Breaking Analysis: Databricks faces critical strategic decisions…here’s why
>> From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. >> Spark became a top level Apache project in 2014, and then shortly thereafter, burst onto the big data scene. Spark, along with the cloud, transformed and in many ways, disrupted the big data market. Databricks optimized its tech stack for Spark and took advantage of the cloud to really cleverly deliver a managed service that has become a leading AI and data platform among data scientists and data engineers. However, emerging customer data requirements are shifting into a direction that will cause modern data platform players generally and Databricks, specifically, we think, to make some key directional decisions and perhaps even reinvent themselves. Hello and welcome to this week's wikibon theCUBE Insights, powered by ETR. In this Breaking Analysis, we're going to do a deep dive into Databricks. We'll explore its current impressive market momentum. We're going to use some ETR survey data to show that, and then we'll lay out how customer data requirements are changing and what the ideal data platform will look like in the midterm future. We'll then evaluate core elements of the Databricks portfolio against that vision, and then we'll close with some strategic decisions that we think the company faces. And to do so, we welcome in our good friend, George Gilbert, former equities analyst, market analyst, and current Principal at TechAlpha Partners. George, good to see you. Thanks for coming on. >> Good to see you, Dave. >> All right, let me set this up. We're going to start by taking a look at where Databricks sits in the market in terms of how customers perceive the company and what it's momentum looks like. And this chart that we're showing here is data from ETS, the emerging technology survey of private companies. The N is 1,421. What we did is we cut the data on three sectors, analytics, database-data warehouse, and AI/ML. The vertical axis is a measure of customer sentiment, which evaluates an IT decision maker's awareness of the firm and the likelihood of engaging and/or purchase intent. The horizontal axis shows mindshare in the dataset, and we've highlighted Databricks, which has been a consistent high performer in this survey over the last several quarters. And as we, by the way, just as aside as we previously reported, OpenAI, which burst onto the scene this past quarter, leads all names, but Databricks is still prominent. You can see that the ETR shows some open source tools for reference, but as far as firms go, Databricks is very impressively positioned. Now, let's see how they stack up to some mainstream cohorts in the data space, against some bigger companies and sometimes public companies. This chart shows net score on the vertical axis, which is a measure of spending momentum and pervasiveness in the data set is on the horizontal axis. You can see that chart insert in the upper right, that informs how the dots are plotted, and net score against shared N. And that red dotted line at 40% indicates a highly elevated net score, anything above that we think is really, really impressive. And here we're just comparing Databricks with Snowflake, Cloudera, and Oracle. And that squiggly line leading to Databricks shows their path since 2021 by quarter. And you can see it's performing extremely well, maintaining an elevated net score and net range. Now it's comparable in the vertical axis to Snowflake, and it consistently is moving to the right and gaining share. Now, why did we choose to show Cloudera and Oracle? The reason is that Cloudera got the whole big data era started and was disrupted by Spark. And of course the cloud, Spark and Databricks and Oracle in many ways, was the target of early big data players like Cloudera. Take a listen to Cloudera CEO at the time, Mike Olson. This is back in 2010, first year of theCUBE, play the clip. >> Look, back in the day, if you had a data problem, if you needed to run business analytics, you wrote the biggest check you could to Sun Microsystems, and you bought a great big, single box, central server, and any money that was left over, you handed to Oracle for a database licenses and you installed that database on that box, and that was where you went for data. That was your temple of information. >> Okay? So Mike Olson implied that monolithic model was too expensive and inflexible, and Cloudera set out to fix that. But the best laid plans, as they say, George, what do you make of the data that we just shared? >> So where Databricks has really come up out of sort of Cloudera's tailpipe was they took big data processing, made it coherent, made it a managed service so it could run in the cloud. So it relieved customers of the operational burden. Where they're really strong and where their traditional meat and potatoes or bread and butter is the predictive and prescriptive analytics that building and training and serving machine learning models. They've tried to move into traditional business intelligence, the more traditional descriptive and diagnostic analytics, but they're less mature there. So what that means is, the reason you see Databricks and Snowflake kind of side by side is there are many, many accounts that have both Snowflake for business intelligence, Databricks for AI machine learning, where Snowflake, I'm sorry, where Databricks also did really well was in core data engineering, refining the data, the old ETL process, which kind of turned into ELT, where you loaded into the analytic repository in raw form and refine it. And so people have really used both, and each is trying to get into the other. >> Yeah, absolutely. We've reported on this quite a bit. Snowflake, kind of moving into the domain of Databricks and vice versa. And the last bit of ETR evidence that we want to share in terms of the company's momentum comes from ETR's Round Tables. They're run by Erik Bradley, and now former Gartner analyst and George, your colleague back at Gartner, Daren Brabham. And what we're going to show here is some direct quotes of IT pros in those Round Tables. There's a data science head and a CIO as well. Just make a few call outs here, we won't spend too much time on it, but starting at the top, like all of us, we can't talk about Databricks without mentioning Snowflake. Those two get us excited. Second comment zeros in on the flexibility and the robustness of Databricks from a data warehouse perspective. And then the last point is, despite competition from cloud players, Databricks has reinvented itself a couple of times over the year. And George, we're going to lay out today a scenario that perhaps calls for Databricks to do that once again. >> Their big opportunity and their big challenge for every tech company, it's managing a technology transition. The transition that we're talking about is something that's been bubbling up, but it's really epical. First time in 60 years, we're moving from an application-centric view of the world to a data-centric view, because decisions are becoming more important than automating processes. So let me let you sort of develop. >> Yeah, so let's talk about that here. We going to put up some bullets on precisely that point and the changing sort of customer environment. So you got IT stacks are shifting is George just said, from application centric silos to data centric stacks where the priority is shifting from automating processes to automating decision. You know how look at RPA and there's still a lot of automation going on, but from the focus of that application centricity and the data locked into those apps, that's changing. Data has historically been on the outskirts in silos, but organizations, you think of Amazon, think Uber, Airbnb, they're putting data at the core, and logic is increasingly being embedded in the data instead of the reverse. In other words, today, the data's locked inside the app, which is why you need to extract that data is sticking it to a data warehouse. The point, George, is we're putting forth this new vision for how data is going to be used. And you've used this Uber example to underscore the future state. Please explain? >> Okay, so this is hopefully an example everyone can relate to. The idea is first, you're automating things that are happening in the real world and decisions that make those things happen autonomously without humans in the loop all the time. So to use the Uber example on your phone, you call a car, you call a driver. Automatically, the Uber app then looks at what drivers are in the vicinity, what drivers are free, matches one, calculates an ETA to you, calculates a price, calculates an ETA to your destination, and then directs the driver once they're there. The point of this is that that cannot happen in an application-centric world very easily because all these little apps, the drivers, the riders, the routes, the fares, those call on data locked up in many different apps, but they have to sit on a layer that makes it all coherent. >> But George, so if Uber's doing this, doesn't this tech already exist? Isn't there a tech platform that does this already? >> Yes, and the mission of the entire tech industry is to build services that make it possible to compose and operate similar platforms and tools, but with the skills of mainstream developers in mainstream corporations, not the rocket scientists at Uber and Amazon. >> Okay, so we're talking about horizontally scaling across the industry, and actually giving a lot more organizations access to this technology. So by way of review, let's summarize the trend that's going on today in terms of the modern data stack that is propelling the likes of Databricks and Snowflake, which we just showed you in the ETR data and is really is a tailwind form. So the trend is toward this common repository for analytic data, that could be multiple virtual data warehouses inside of Snowflake, but you're in that Snowflake environment or Lakehouses from Databricks or multiple data lakes. And we've talked about what JP Morgan Chase is doing with the data mesh and gluing data lakes together, you've got various public clouds playing in this game, and then the data is annotated to have a common meaning. In other words, there's a semantic layer that enables applications to talk to the data elements and know that they have common and coherent meaning. So George, the good news is this approach is more effective than the legacy monolithic models that Mike Olson was talking about, so what's the problem with this in your view? >> So today's data platforms added immense value 'cause they connected the data that was previously locked up in these monolithic apps or on all these different microservices, and that supported traditional BI and AI/ML use cases. But now if we want to build apps like Uber or Amazon.com, where they've got essentially an autonomously running supply chain and e-commerce app where humans only care and feed it. But the thing is figuring out what to buy, when to buy, where to deploy it, when to ship it. We needed a semantic layer on top of the data. So that, as you were saying, the data that's coming from all those apps, the different apps that's integrated, not just connected, but it means the same. And the issue is whenever you add a new layer to a stack to support new applications, there are implications for the already existing layers, like can they support the new layer and its use cases? So for instance, if you add a semantic layer that embeds app logic with the data rather than vice versa, which we been talking about and that's been the case for 60 years, then the new data layer faces challenges that the way you manage that data, the way you analyze that data, is not supported by today's tools. >> Okay, so actually Alex, bring me up that last slide if you would, I mean, you're basically saying at the bottom here, today's repositories don't really do joins at scale. The future is you're talking about hundreds or thousands or millions of data connections, and today's systems, we're talking about, I don't know, 6, 8, 10 joins and that is the fundamental problem you're saying, is a new data error coming and existing systems won't be able to handle it? >> Yeah, one way of thinking about it is that even though we call them relational databases, when we actually want to do lots of joins or when we want to analyze data from lots of different tables, we created a whole new industry for analytic databases where you sort of mung the data together into fewer tables. So you didn't have to do as many joins because the joins are difficult and slow. And when you're going to arbitrarily join thousands, hundreds of thousands or across millions of elements, you need a new type of database. We have them, they're called graph databases, but to query them, you go back to the prerelational era in terms of their usability. >> Okay, so we're going to come back to that and talk about how you get around that problem. But let's first lay out what the ideal data platform of the future we think looks like. And again, we're going to come back to use this Uber example. In this graphic that George put together, awesome. We got three layers. The application layer is where the data products reside. The example here is drivers, rides, maps, routes, ETA, et cetera. The digital version of what we were talking about in the previous slide, people, places and things. The next layer is the data layer, that breaks down the silos and connects the data elements through semantics and everything is coherent. And then the bottom layers, the legacy operational systems feed that data layer. George, explain what's different here, the graph database element, you talk about the relational query capabilities, and why can't I just throw memory at solving this problem? >> Some of the graph databases do throw memory at the problem and maybe without naming names, some of them live entirely in memory. And what you're dealing with is a prerelational in-memory database system where you navigate between elements, and the issue with that is we've had SQL for 50 years, so we don't have to navigate, we can say what we want without how to get it. That's the core of the problem. >> Okay. So if I may, I just want to drill into this a little bit. So you're talking about the expressiveness of a graph. Alex, if you'd bring that back out, the fourth bullet, expressiveness of a graph database with the relational ease of query. Can you explain what you mean by that? >> Yeah, so graphs are great because when you can describe anything with a graph, that's why they're becoming so popular. Expressive means you can represent anything easily. They're conducive to, you might say, in a world where we now want like the metaverse, like with a 3D world, and I don't mean the Facebook metaverse, I mean like the business metaverse when we want to capture data about everything, but we want it in context, we want to build a set of digital twins that represent everything going on in the world. And Uber is a tiny example of that. Uber built a graph to represent all the drivers and riders and maps and routes. But what you need out of a database isn't just a way to store stuff and update stuff. You need to be able to ask questions of it, you need to be able to query it. And if you go back to prerelational days, you had to know how to find your way to the data. It's sort of like when you give directions to someone and they didn't have a GPS system and a mapping system, you had to give them turn by turn directions. Whereas when you have a GPS and a mapping system, which is like the relational thing, you just say where you want to go, and it spits out the turn by turn directions, which let's say, the car might follow or whoever you're directing would follow. But the point is, it's much easier in a relational database to say, "I just want to get these results. You figure out how to get it." The graph database, they have not taken over the world because in some ways, it's taking a 50 year leap backwards. >> Alright, got it. Okay. Let's take a look at how the current Databricks offerings map to that ideal state that we just laid out. So to do that, we put together this chart that looks at the key elements of the Databricks portfolio, the core capability, the weakness, and the threat that may loom. Start with the Delta Lake, that's the storage layer, which is great for files and tables. It's got true separation of compute and storage, I want you to double click on that George, as independent elements, but it's weaker for the type of low latency ingest that we see coming in the future. And some of the threats highlighted here. AWS could add transactional tables to S3, Iceberg adoption is picking up and could accelerate, that could disrupt Databricks. George, add some color here please? >> Okay, so this is the sort of a classic competitive forces where you want to look at, so what are customers demanding? What's competitive pressure? What are substitutes? Even what your suppliers might be pushing. Here, Delta Lake is at its core, a set of transactional tables that sit on an object store. So think of it in a database system, this is the storage engine. So since S3 has been getting stronger for 15 years, you could see a scenario where they add transactional tables. We have an open source alternative in Iceberg, which Snowflake and others support. But at the same time, Databricks has built an ecosystem out of tools, their own and others, that read and write to Delta tables, that's what makes the Delta Lake and ecosystem. So they have a catalog, the whole machine learning tool chain talks directly to the data here. That was their great advantage because in the past with Snowflake, you had to pull all the data out of the database before the machine learning tools could work with it, that was a major shortcoming. They fixed that. But the point here is that even before we get to the semantic layer, the core foundation is under threat. >> Yep. Got it. Okay. We got a lot of ground to cover. So we're going to take a look at the Spark Execution Engine next. Think of that as the refinery that runs really efficient batch processing. That's kind of what disrupted the DOOp in a large way, but it's not Python friendly and that's an issue because the data science and the data engineering crowd are moving in that direction, and/or they're using DBT. George, we had Tristan Handy on at Supercloud, really interesting discussion that you and I did. Explain why this is an issue for Databricks? >> So once the data lake was in place, what people did was they refined their data batch, and Spark has always had streaming support and it's gotten better. The underlying storage as we've talked about is an issue. But basically they took raw data, then they refined it into tables that were like customers and products and partners. And then they refined that again into what was like gold artifacts, which might be business intelligence metrics or dashboards, which were collections of metrics. But they were running it on the Spark Execution Engine, which it's a Java-based engine or it's running on a Java-based virtual machine, which means all the data scientists and the data engineers who want to work with Python are really working in sort of oil and water. Like if you get an error in Python, you can't tell whether the problems in Python or where it's in Spark. There's just an impedance mismatch between the two. And then at the same time, the whole world is now gravitating towards DBT because it's a very nice and simple way to compose these data processing pipelines, and people are using either SQL in DBT or Python in DBT, and that kind of is a substitute for doing it all in Spark. So it's under threat even before we get to that semantic layer, it so happens that DBT itself is becoming the authoring environment for the semantic layer with business intelligent metrics. But that's again, this is the second element that's under direct substitution and competitive threat. >> Okay, let's now move down to the third element, which is the Photon. Photon is Databricks' BI Lakehouse, which has integration with the Databricks tooling, which is very rich, it's newer. And it's also not well suited for high concurrency and low latency use cases, which we think are going to increasingly become the norm over time. George, the call out threat here is customers want to connect everything to a semantic layer. Explain your thinking here and why this is a potential threat to Databricks? >> Okay, so two issues here. What you were touching on, which is the high concurrency, low latency, when people are running like thousands of dashboards and data is streaming in, that's a problem because SQL data warehouse, the query engine, something like that matures over five to 10 years. It's one of these things, the joke that Andy Jassy makes just in general, he's really talking about Azure, but there's no compression algorithm for experience. The Snowflake guy started more than five years earlier, and for a bunch of reasons, that lead is not something that Databricks can shrink. They'll always be behind. So that's why Snowflake has transactional tables now and we can get into that in another show. But the key point is, so near term, it's struggling to keep up with the use cases that are core to business intelligence, which is highly concurrent, lots of users doing interactive query. But then when you get to a semantic layer, that's when you need to be able to query data that might have thousands or tens of thousands or hundreds of thousands of joins. And that's a SQL query engine, traditional SQL query engine is just not built for that. That's the core problem of traditional relational databases. >> Now this is a quick aside. We always talk about Snowflake and Databricks in sort of the same context. We're not necessarily saying that Snowflake is in a position to tackle all these problems. We'll deal with that separately. So we don't mean to imply that, but we're just sort of laying out some of the things that Snowflake or rather Databricks customers we think, need to be thinking about and having conversations with Databricks about and we hope to have them as well. We'll come back to that in terms of sort of strategic options. But finally, when come back to the table, we have Databricks' AI/ML Tool Chain, which has been an awesome capability for the data science crowd. It's comprehensive, it's a one-stop shop solution, but the kicker here is that it's optimized for supervised model building. And the concern is that foundational models like GPT could cannibalize the current Databricks tooling, but George, can't Databricks, like other software companies, integrate foundation model capabilities into its platform? >> Okay, so the sound bite answer to that is sure, IBM 3270 terminals could call out to a graphical user interface when they're running on the XT terminal, but they're not exactly good citizens in that world. The core issue is Databricks has this wonderful end-to-end tool chain for training, deploying, monitoring, running inference on supervised models. But the paradigm there is the customer builds and trains and deploys each model for each feature or application. In a world of foundation models which are pre-trained and unsupervised, the entire tool chain is different. So it's not like Databricks can junk everything they've done and start over with all their engineers. They have to keep maintaining what they've done in the old world, but they have to build something new that's optimized for the new world. It's a classic technology transition and their mentality appears to be, "Oh, we'll support the new stuff from our old stuff." Which is suboptimal, and as we'll talk about, their biggest patron and the company that put them on the map, Microsoft, really stopped working on their old stuff three years ago so that they could build a new tool chain optimized for this new world. >> Yeah, and so let's sort of close with what we think the options are and decisions that Databricks has for its future architecture. They're smart people. I mean we've had Ali Ghodsi on many times, super impressive. I think they've got to be keenly aware of the limitations, what's going on with foundation models. But at any rate, here in this chart, we lay out sort of three scenarios. One is re-architect the platform by incrementally adopting new technologies. And example might be to layer a graph query engine on top of its stack. They could license key technologies like graph database, they could get aggressive on M&A and buy-in, relational knowledge graphs, semantic technologies, vector database technologies. George, as David Floyer always says, "A lot of ways to skin a cat." We've seen companies like, even think about EMC maintained its relevance through M&A for many, many years. George, give us your thought on each of these strategic options? >> Okay, I find this question the most challenging 'cause remember, I used to be an equity research analyst. I worked for Frank Quattrone, we were one of the top tech shops in the banking industry, although this is 20 years ago. But the M&A team was the top team in the industry and everyone wanted them on their side. And I remember going to meetings with these CEOs, where Frank and the bankers would say, "You want us for your M&A work because we can do better." And they really could do better. But in software, it's not like with EMC in hardware because with hardware, it's easier to connect different boxes. With software, the whole point of a software company is to integrate and architect the components so they fit together and reinforce each other, and that makes M&A harder. You can do it, but it takes a long time to fit the pieces together. Let me give you examples. If they put a graph query engine, let's say something like TinkerPop, on top of, I don't even know if it's possible, but let's say they put it on top of Delta Lake, then you have this graph query engine talking to their storage layer, Delta Lake. But if you want to do analysis, you got to put the data in Photon, which is not really ideal for highly connected data. If you license a graph database, then most of your data is in the Delta Lake and how do you sync it with the graph database? If you do sync it, you've got data in two places, which kind of defeats the purpose of having a unified repository. I find this semantic layer option in number three actually more promising, because that's something that you can layer on top of the storage layer that you have already. You just have to figure out then how to have your query engines talk to that. What I'm trying to highlight is, it's easy as an analyst to say, "You can buy this company or license that technology." But the really hard work is making it all work together and that is where the challenge is. >> Yeah, and well look, I thank you for laying that out. We've seen it, certainly Microsoft and Oracle. I guess you might argue that well, Microsoft had a monopoly in its desktop software and was able to throw off cash for a decade plus while it's stock was going sideways. Oracle had won the database wars and had amazing margins and cash flow to be able to do that. Databricks isn't even gone public yet, but I want to close with some of the players to watch. Alex, if you'd bring that back up, number four here. AWS, we talked about some of their options with S3 and it's not just AWS, it's blob storage, object storage. Microsoft, as you sort of alluded to, was an early go-to market channel for Databricks. We didn't address that really. So maybe in the closing comments we can. Google obviously, Snowflake of course, we're going to dissect their options in future Breaking Analysis. Dbt labs, where do they fit? Bob Muglia's company, Relational.ai, why are these players to watch George, in your opinion? >> So everyone is trying to assemble and integrate the pieces that would make building data applications, data products easy. And the critical part isn't just assembling a bunch of pieces, which is traditionally what AWS did. It's a Unix ethos, which is we give you the tools, you put 'em together, 'cause you then have the maximum choice and maximum power. So what the hyperscalers are doing is they're taking their key value stores, in the case of ASW it's DynamoDB, in the case of Azure it's Cosmos DB, and each are putting a graph query engine on top of those. So they have a unified storage and graph database engine, like all the data would be collected in the key value store. Then you have a graph database, that's how they're going to be presenting a foundation for building these data apps. Dbt labs is putting a semantic layer on top of data lakes and data warehouses and as we'll talk about, I'm sure in the future, that makes it easier to swap out the underlying data platform or swap in new ones for specialized use cases. Snowflake, what they're doing, they're so strong in data management and with their transactional tables, what they're trying to do is take in the operational data that used to be in the province of many state stores like MongoDB and say, "If you manage that data with us, it'll be connected to your analytic data without having to send it through a pipeline." And that's hugely valuable. Relational.ai is the wildcard, 'cause what they're trying to do, it's almost like a holy grail where you're trying to take the expressiveness of connecting all your data in a graph but making it as easy to query as you've always had it in a SQL database or I should say, in a relational database. And if they do that, it's sort of like, it'll be as easy to program these data apps as a spreadsheet was compared to procedural languages, like BASIC or Pascal. That's the implications of Relational.ai. >> Yeah, and again, we talked before, why can't you just throw this all in memory? We're talking in that example of really getting down to differences in how you lay the data out on disk in really, new database architecture, correct? >> Yes. And that's why it's not clear that you could take a data lake or even a Snowflake and why you can't put a relational knowledge graph on those. You could potentially put a graph database, but it'll be compromised because to really do what Relational.ai has done, which is the ease of Relational on top of the power of graph, you actually need to change how you're storing your data on disk or even in memory. So you can't, in other words, it's not like, oh we can add graph support to Snowflake, 'cause if you did that, you'd have to change, or in your data lake, you'd have to change how the data is physically laid out. And then that would break all the tools that talk to that currently. >> What in your estimation, is the timeframe where this becomes critical for a Databricks and potentially Snowflake and others? I mentioned earlier midterm, are we talking three to five years here? Are we talking end of decade? What's your radar say? >> I think something surprising is going on that's going to sort of come up the tailpipe and take everyone by storm. All the hype around business intelligence metrics, which is what we used to put in our dashboards where bookings, billings, revenue, customer, those things, those were the key artifacts that used to live in definitions in your BI tools, and DBT has basically created a standard for defining those so they live in your data pipeline or they're defined in their data pipeline and executed in the data warehouse or data lake in a shared way, so that all tools can use them. This sounds like a digression, it's not. All this stuff about data mesh, data fabric, all that's going on is we need a semantic layer and the business intelligence metrics are defining common semantics for your data. And I think we're going to find by the end of this year, that metrics are how we annotate all our analytic data to start adding common semantics to it. And we're going to find this semantic layer, it's not three to five years off, it's going to be staring us in the face by the end of this year. >> Interesting. And of course SVB today was shut down. We're seeing serious tech headwinds, and oftentimes in these sort of downturns or flat turns, which feels like this could be going on for a while, we emerge with a lot of new players and a lot of new technology. George, we got to leave it there. Thank you to George Gilbert for excellent insights and input for today's episode. I want to thank Alex Myerson who's on production and manages the podcast, of course Ken Schiffman as well. Kristin Martin and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hof is our EIC over at Siliconangle.com, he does some great editing. Remember all these episodes, they're available as podcasts. Wherever you listen, all you got to do is search Breaking Analysis Podcast, we publish each week on wikibon.com and siliconangle.com, or you can email me at David.Vellante@siliconangle.com, or DM me @DVellante. Comment on our LinkedIn post, and please do check out ETR.ai, great survey data, enterprise tech focus, phenomenal. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching, and we'll see you next time on Breaking Analysis.
SUMMARY :
bringing you data-driven core elements of the Databricks portfolio and pervasiveness in the data and that was where you went for data. and Cloudera set out to fix that. the reason you see and the robustness of Databricks and their big challenge and the data locked into in the real world and decisions Yes, and the mission of that is propelling the likes that the way you manage that data, is the fundamental problem because the joins are difficult and slow. and connects the data and the issue with that is the fourth bullet, expressiveness and it spits out the and the threat that may loom. because in the past with Snowflake, Think of that as the refinery So once the data lake was in place, George, the call out threat here But the key point is, in sort of the same context. and the company that put One is re-architect the platform and architect the components some of the players to watch. in the case of ASW it's DynamoDB, and why you can't put a relational and executed in the data and manages the podcast, of
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Alex Myerson | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
Mike Olson | PERSON | 0.99+ |
2014 | DATE | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Cheryl Knight | PERSON | 0.99+ |
Ken Schiffman | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Erik Bradley | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Sun Microsystems | ORGANIZATION | 0.99+ |
50 years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
Airbnb | ORGANIZATION | 0.99+ |
60 years | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
2010 | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Kristin Martin | PERSON | 0.99+ |
Rob Hof | PERSON | 0.99+ |
three | QUANTITY | 0.99+ |
15 years | QUANTITY | 0.99+ |
Databricks' | ORGANIZATION | 0.99+ |
two places | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
M&A | ORGANIZATION | 0.99+ |
Frank Quattrone | PERSON | 0.99+ |
second element | QUANTITY | 0.99+ |
Daren Brabham | PERSON | 0.99+ |
TechAlpha Partners | ORGANIZATION | 0.99+ |
third element | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
50 year | QUANTITY | 0.99+ |
40% | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
five years | QUANTITY | 0.99+ |
Robert Nishihara, Anyscale | AWS Startup Showcase S3 E1
(upbeat music) >> Hello everyone. Welcome to theCube's presentation of the "AWS Startup Showcase." The topic this episode is AI and machine learning, top startups building foundational model infrastructure. This is season three, episode one of the ongoing series covering exciting startups from the AWS ecosystem. And this time we're talking about AI and machine learning. I'm your host, John Furrier. I'm excited I'm joined today by Robert Nishihara, who's the co-founder and CEO of a hot startup called Anyscale. He's here to talk about Ray, the open source project, Anyscale's infrastructure for foundation as well. Robert, thank you for joining us today. >> Yeah, thanks so much as well. >> I've been following your company since the founding pre pandemic and you guys really had a great vision scaled up and in a perfect position for this big wave that we all see with ChatGPT and OpenAI that's gone mainstream. Finally, AI has broken out through the ropes and now gone mainstream, so I think you guys are really well positioned. I'm looking forward to to talking with you today. But before we get into it, introduce the core mission for Anyscale. Why do you guys exist? What is the North Star for Anyscale? >> Yeah, like you mentioned, there's a tremendous amount of excitement about AI right now. You know, I think a lot of us believe that AI can transform just every different industry. So one of the things that was clear to us when we started this company was that the amount of compute needed to do AI was just exploding. Like to actually succeed with AI, companies like OpenAI or Google or you know, these companies getting a lot of value from AI, were not just running these machine learning models on their laptops or on a single machine. They were scaling these applications across hundreds or thousands or more machines and GPUs and other resources in the Cloud. And so to actually succeed with AI, and this has been one of the biggest trends in computing, maybe the biggest trend in computing in, you know, in recent history, the amount of compute has been exploding. And so to actually succeed with that AI, to actually build these scalable applications and scale the AI applications, there's a tremendous software engineering lift to build the infrastructure to actually run these scalable applications. And that's very hard to do. So one of the reasons many AI projects and initiatives fail is that, or don't make it to production, is the need for this scale, the infrastructure lift, to actually make it happen. So our goal here with Anyscale and Ray, is to make that easy, is to make scalable computing easy. So that as a developer or as a business, if you want to do AI, if you want to get value out of AI, all you need to know is how to program on your laptop. Like, all you need to know is how to program in Python. And if you can do that, then you're good to go. Then you can do what companies like OpenAI or Google do and get value out of machine learning. >> That programming example of how easy it is with Python reminds me of the early days of Cloud, when infrastructure as code was talked about was, it was just code the infrastructure programmable. That's super important. That's what AI people wanted, first program AI. That's the new trend. And I want to understand, if you don't mind explaining, the relationship that Anyscale has to these foundational models and particular the large language models, also called LLMs, was seen with like OpenAI and ChatGPT. Before you get into the relationship that you have with them, can you explain why the hype around foundational models? Why are people going crazy over foundational models? What is it and why is it so important? >> Yeah, so foundational models and foundation models are incredibly important because they enable businesses and developers to get value out of machine learning, to use machine learning off the shelf with these large models that have been trained on tons of data and that are useful out of the box. And then, of course, you know, as a business or as a developer, you can take those foundational models and repurpose them or fine tune them or adapt them to your specific use case and what you want to achieve. But it's much easier to do that than to train them from scratch. And I think there are three, for people to actually use foundation models, there are three main types of workloads or problems that need to be solved. One is training these foundation models in the first place, like actually creating them. The second is fine tuning them and adapting them to your use case. And the third is serving them and actually deploying them. Okay, so Ray and Anyscale are used for all of these three different workloads. Companies like OpenAI or Cohere that train large language models. Or open source versions like GPTJ are done on top of Ray. There are many startups and other businesses that fine tune, that, you know, don't want to train the large underlying foundation models, but that do want to fine tune them, do want to adapt them to their purposes, and build products around them and serve them, those are also using Ray and Anyscale for that fine tuning and that serving. And so the reason that Ray and Anyscale are important here is that, you know, building and using foundation models requires a huge scale. It requires a lot of data. It requires a lot of compute, GPUs, TPUs, other resources. And to actually take advantage of that and actually build these scalable applications, there's a lot of infrastructure that needs to happen under the hood. And so you can either use Ray and Anyscale to take care of that and manage the infrastructure and solve those infrastructure problems. Or you can build the infrastructure and manage the infrastructure yourself, which you can do, but it's going to slow your team down. It's going to, you know, many of the businesses we work with simply don't want to be in the business of managing infrastructure and building infrastructure. They want to focus on product development and move faster. >> I know you got a keynote presentation we're going to go to in a second, but I think you hit on something I think is the real tipping point, doing it yourself, hard to do. These are things where opportunities are and the Cloud did that with data centers. Turned a data center and made it an API. The heavy lifting went away and went to the Cloud so people could be more creative and build their product. In this case, build their creativity. Is that kind of what's the big deal? Is that kind of a big deal happening that you guys are taking the learnings and making that available so people don't have to do that? >> That's exactly right. So today, if you want to succeed with AI, if you want to use AI in your business, infrastructure work is on the critical path for doing that. To do AI, you have to build infrastructure. You have to figure out how to scale your applications. That's going to change. We're going to get to the point, and you know, with Ray and Anyscale, we're going to remove the infrastructure from the critical path so that as a developer or as a business, all you need to focus on is your application logic, what you want the the program to do, what you want your application to do, how you want the AI to actually interface with the rest of your product. Now the way that will happen is that Ray and Anyscale will still, the infrastructure work will still happen. It'll just be under the hood and taken care of by Ray in Anyscale. And so I think something like this is really necessary for AI to reach its potential, for AI to have the impact and the reach that we think it will, you have to make it easier to do. >> And just for clarification to point out, if you don't mind explaining the relationship of Ray and Anyscale real quick just before we get into the presentation. >> So Ray is an open source project. We created it. We were at Berkeley doing machine learning. We started Ray so that, in order to provide an easy, a simple open source tool for building and running scalable applications. And Anyscale is the managed version of Ray, basically we will run Ray for you in the Cloud, provide a lot of tools around the developer experience and managing the infrastructure and providing more performance and superior infrastructure. >> Awesome. I know you got a presentation on Ray and Anyscale and you guys are positioning as the infrastructure for foundational models. So I'll let you take it away and then when you're done presenting, we'll come back, I'll probably grill you with a few questions and then we'll close it out so take it away. >> Robert: Sounds great. So I'll say a little bit about how companies are using Ray and Anyscale for foundation models. The first thing I want to mention is just why we're doing this in the first place. And the underlying observation, the underlying trend here, and this is a plot from OpenAI, is that the amount of compute needed to do machine learning has been exploding. It's been growing at something like 35 times every 18 months. This is absolutely enormous. And other people have written papers measuring this trend and you get different numbers. But the point is, no matter how you slice and dice it, it' a astronomical rate. Now if you compare that to something we're all familiar with, like Moore's Law, which says that, you know, the processor performance doubles every roughly 18 months, you can see that there's just a tremendous gap between the needs, the compute needs of machine learning applications, and what you can do with a single chip, right. So even if Moore's Law were continuing strong and you know, doing what it used to be doing, even if that were the case, there would still be a tremendous gap between what you can do with the chip and what you need in order to do machine learning. And so given this graph, what we've seen, and what has been clear to us since we started this company, is that doing AI requires scaling. There's no way around it. It's not a nice to have, it's really a requirement. And so that led us to start Ray, which is the open source project that we started to make it easy to build these scalable Python applications and scalable machine learning applications. And since we started the project, it's been adopted by a tremendous number of companies. Companies like OpenAI, which use Ray to train their large models like ChatGPT, companies like Uber, which run all of their deep learning and classical machine learning on top of Ray, companies like Shopify or Spotify or Instacart or Lyft or Netflix, ByteDance, which use Ray for their machine learning infrastructure. Companies like Ant Group, which makes Alipay, you know, they use Ray across the board for fraud detection, for online learning, for detecting money laundering, you know, for graph processing, stream processing. Companies like Amazon, you know, run Ray at a tremendous scale and just petabytes of data every single day. And so the project has seen just enormous adoption since, over the past few years. And one of the most exciting use cases is really providing the infrastructure for building training, fine tuning, and serving foundation models. So I'll say a little bit about, you know, here are some examples of companies using Ray for foundation models. Cohere trains large language models. OpenAI also trains large language models. You can think about the workloads required there are things like supervised pre-training, also reinforcement learning from human feedback. So this is not only the regular supervised learning, but actually more complex reinforcement learning workloads that take human input about what response to a particular question, you know is better than a certain other response. And incorporating that into the learning. There's open source versions as well, like GPTJ also built on top of Ray as well as projects like Alpa coming out of UC Berkeley. So these are some of the examples of exciting projects in organizations, training and creating these large language models and serving them using Ray. Okay, so what actually is Ray? Well, there are two layers to Ray. At the lowest level, there's the core Ray system. This is essentially low level primitives for building scalable Python applications. Things like taking a Python function or a Python class and executing them in the cluster setting. So Ray core is extremely flexible and you can build arbitrary scalable applications on top of Ray. So on top of Ray, on top of the core system, what really gives Ray a lot of its power is this ecosystem of scalable libraries. So on top of the core system you have libraries, scalable libraries for ingesting and pre-processing data, for training your models, for fine tuning those models, for hyper parameter tuning, for doing batch processing and batch inference, for doing model serving and deployment, right. And a lot of the Ray users, the reason they like Ray is that they want to run multiple workloads. They want to train and serve their models, right. They want to load their data and feed that into training. And Ray provides common infrastructure for all of these different workloads. So this is a little overview of what Ray, the different components of Ray. So why do people choose to go with Ray? I think there are three main reasons. The first is the unified nature. The fact that it is common infrastructure for scaling arbitrary workloads, from data ingest to pre-processing to training to inference and serving, right. This also includes the fact that it's future proof. AI is incredibly fast moving. And so many people, many companies that have built their own machine learning infrastructure and standardized on particular workflows for doing machine learning have found that their workflows are too rigid to enable new capabilities. If they want to do reinforcement learning, if they want to use graph neural networks, they don't have a way of doing that with their standard tooling. And so Ray, being future proof and being flexible and general gives them that ability. Another reason people choose Ray in Anyscale is the scalability. This is really our bread and butter. This is the reason, the whole point of Ray, you know, making it easy to go from your laptop to running on thousands of GPUs, making it easy to scale your development workloads and run them in production, making it easy to scale, you know, training to scale data ingest, pre-processing and so on. So scalability and performance, you know, are critical for doing machine learning and that is something that Ray provides out of the box. And lastly, Ray is an open ecosystem. You can run it anywhere. You can run it on any Cloud provider. Google, you know, Google Cloud, AWS, Asure. You can run it on your Kubernetes cluster. You can run it on your laptop. It's extremely portable. And not only that, it's framework agnostic. You can use Ray to scale arbitrary Python workloads. You can use it to scale and it integrates with libraries like TensorFlow or PyTorch or JAX or XG Boost or Hugging Face or PyTorch Lightning, right, or Scikit-learn or just your own arbitrary Python code. It's open source. And in addition to integrating with the rest of the machine learning ecosystem and these machine learning frameworks, you can use Ray along with all of the other tooling in the machine learning ecosystem. That's things like weights and biases or ML flow, right. Or you know, different data platforms like Databricks, you know, Delta Lake or Snowflake or tools for model monitoring for feature stores, all of these integrate with Ray. And that's, you know, Ray provides that kind of flexibility so that you can integrate it into the rest of your workflow. And then Anyscale is the scalable compute platform that's built on top, you know, that provides Ray. So Anyscale is a managed Ray service that runs in the Cloud. And what Anyscale does is it offers the best way to run Ray. And if you think about what you get with Anyscale, there are fundamentally two things. One is about moving faster, accelerating the time to market. And you get that by having the managed service so that as a developer you don't have to worry about managing infrastructure, you don't have to worry about configuring infrastructure. You also, it provides, you know, optimized developer workflows. Things like easily moving from development to production, things like having the observability tooling, the debug ability to actually easily diagnose what's going wrong in a distributed application. So things like the dashboards and the other other kinds of tooling for collaboration, for monitoring and so on. And then on top of that, so that's the first bucket, developer productivity, moving faster, faster experimentation and iteration. The second reason that people choose Anyscale is superior infrastructure. So this is things like, you know, cost deficiency, being able to easily take advantage of spot instances, being able to get higher GPU utilization, things like faster cluster startup times and auto scaling. Things like just overall better performance and faster scheduling. And so these are the kinds of things that Anyscale provides on top of Ray. It's the managed infrastructure. It's fast, it's like the developer productivity and velocity as well as performance. So this is what I wanted to share about Ray in Anyscale. >> John: Awesome. >> Provide that context. But John, I'm curious what you think. >> I love it. I love the, so first of all, it's a platform because that's the platform architecture right there. So just to clarify, this is an Anyscale platform, not- >> That's right. >> Tools. So you got tools in the platform. Okay, that's key. Love that managed service. Just curious, you mentioned Python multiple times, is that because of PyTorch and TensorFlow or Python's the most friendly with machine learning or it's because it's very common amongst all developers? >> That's a great question. Python is the language that people are using to do machine learning. So it's the natural starting point. Now, of course, Ray is actually designed in a language agnostic way and there are companies out there that use Ray to build scalable Java applications. But for the most part right now we're focused on Python and being the best way to build these scalable Python and machine learning applications. But, of course, down the road there always is that potential. >> So if you're slinging Python code out there and you're watching that, you're watching this video, get on Anyscale bus quickly. Also, I just, while you were giving the presentation, I couldn't help, since you mentioned OpenAI, which by the way, congratulations 'cause they've had great scale, I've noticed in their rapid growth 'cause they were the fastest company to the number of users than anyone in the history of the computer industry, so major successor, OpenAI and ChatGPT, huge fan. I'm not a skeptic at all. I think it's just the beginning, so congratulations. But I actually typed into ChatGPT, what are the top three benefits of Anyscale and came up with scalability, flexibility, and ease of use. Obviously, scalability is what you guys are called. >> That's pretty good. >> So that's what they came up with. So they nailed it. Did you have an inside prompt training, buy it there? Only kidding. (Robert laughs) >> Yeah, we hard coded that one. >> But that's the kind of thing that came up really, really quickly if I asked it to write a sales document, it probably will, but this is the future interface. This is why people are getting excited about the foundational models and the large language models because it's allowing the interface with the user, the consumer, to be more human, more natural. And this is clearly will be in every application in the future. >> Absolutely. This is how people are going to interface with software, how they're going to interface with products in the future. It's not just something, you know, not just a chat bot that you talk to. This is going to be how you get things done, right. How you use your web browser or how you use, you know, how you use Photoshop or how you use other products. Like you're not going to spend hours learning all the APIs and how to use them. You're going to talk to it and tell it what you want it to do. And of course, you know, if it doesn't understand it, it's going to ask clarifying questions. You're going to have a conversation and then it'll figure it out. >> This is going to be one of those things, we're going to look back at this time Robert and saying, "Yeah, from that company, that was the beginning of that wave." And just like AWS and Cloud Computing, the folks who got in early really were in position when say the pandemic came. So getting in early is a good thing and that's what everyone's talking about is getting in early and playing around, maybe replatforming or even picking one or few apps to refactor with some staff and managed services. So people are definitely jumping in. So I have to ask you the ROI cost question. You mentioned some of those, Moore's Law versus what's going on in the industry. When you look at that kind of scale, the first thing that jumps out at people is, "Okay, I love it. Let's go play around." But what's it going to cost me? Am I going to be tied to certain GPUs? What's the landscape look like from an operational standpoint, from the customer? Are they locked in and the benefit was flexibility, are you flexible to handle any Cloud? What is the customers, what are they looking at? Basically, that's my question. What's the customer looking at? >> Cost is super important here and many of the companies, I mean, companies are spending a huge amount on their Cloud computing, on AWS, and on doing AI, right. And I think a lot of the advantage of Anyscale, what we can provide here is not only better performance, but cost efficiency. Because if we can run something faster and more efficiently, it can also use less resources and you can lower your Cloud spending, right. We've seen companies go from, you know, 20% GPU utilization with their current setup and the current tools they're using to running on Anyscale and getting more like 95, you know, 100% GPU utilization. That's something like a five x improvement right there. So depending on the kind of application you're running, you know, it's a significant cost savings. We've seen companies that have, you know, processing petabytes of data every single day with Ray going from, you know, getting order of magnitude cost savings by switching from what they were previously doing to running their application on Ray. And when you have applications that are spending, you know, potentially $100 million a year and getting a 10 X cost savings is just absolutely enormous. So these are some of the kinds of- >> Data infrastructure is super important. Again, if the customer, if you're a prospect to this and thinking about going in here, just like the Cloud, you got infrastructure, you got the platform, you got SaaS, same kind of thing's going to go on in AI. So I want to get into that, you know, ROI discussion and some of the impact with your customers that are leveraging the platform. But first I hear you got a demo. >> Robert: Yeah, so let me show you, let me give you a quick run through here. So what I have open here is the Anyscale UI. I've started a little Anyscale Workspace. So Workspaces are the Anyscale concept for interactive developments, right. So here, imagine I'm just, you want to have a familiar experience like you're developing on your laptop. And here I have a terminal. It's not on my laptop. It's actually in the cloud running on Anyscale. And I'm just going to kick this off. This is going to train a large language model, so OPT. And it's doing this on 32 GPUs. We've got a cluster here with a bunch of CPU cores, bunch of memory. And as that's running, and by the way, if I wanted to run this on instead of 32 GPUs, 64, 128, this is just a one line change when I launch the Workspace. And what I can do is I can pull up VS code, right. Remember this is the interactive development experience. I can look at the actual code. Here it's using Ray train to train the torch model. We've got the training loop and we're saying that each worker gets access to one GPU and four CPU cores. And, of course, as I make the model larger, this is using deep speed, as I make the model larger, I could increase the number of GPUs that each worker gets access to, right. And how that is distributed across the cluster. And if I wanted to run on CPUs instead of GPUs or a different, you know, accelerator type, again, this is just a one line change. And here we're using Ray train to train the models, just taking my vanilla PyTorch model using Hugging Face and then scaling that across a bunch of GPUs. And, of course, if I want to look at the dashboard, I can go to the Ray dashboard. There are a bunch of different visualizations I can look at. I can look at the GPU utilization. I can look at, you know, the CPU utilization here where I think we're currently loading the model and running that actual application to start the training. And some of the things that are really convenient here about Anyscale, both I can get that interactive development experience with VS code. You know, I can look at the dashboards. I can monitor what's going on. It feels, I have a terminal, it feels like my laptop, but it's actually running on a large cluster. And I can, with however many GPUs or other resources that I want. And so it's really trying to combine the best of having the familiar experience of programming on your laptop, but with the benefits, you know, being able to take advantage of all the resources in the Cloud to scale. And it's like when, you know, you're talking about cost efficiency. One of the biggest reasons that people waste money, one of the silly reasons for wasting money is just forgetting to turn off your GPUs. And what you can do here is, of course, things will auto terminate if they're idle. But imagine you go to sleep, I have this big cluster. You can turn it off, shut off the cluster, come back tomorrow, restart the Workspace, and you know, your big cluster is back up and all of your code changes are still there. All of your local file edits. It's like you just closed your laptop and came back and opened it up again. And so this is the kind of experience we want to provide for our users. So that's what I wanted to share with you. >> Well, I think that whole, couple of things, lines of code change, single line of code change, that's game changing. And then the cost thing, I mean human error is a big deal. People pass out at their computer. They've been coding all night or they just forget about it. I mean, and then it's just like leaving the lights on or your water running in your house. It's just, at the scale that it is, the numbers will add up. That's a huge deal. So I think, you know, compute back in the old days, there's no compute. Okay, it's just compute sitting there idle. But you know, data cranking the models is doing, that's a big point. >> Another thing I want to add there about cost efficiency is that we make it really easy to use, if you're running on Anyscale, to use spot instances and these preemptable instances that can just be significantly cheaper than the on-demand instances. And so when we see our customers go from what they're doing before to using Anyscale and they go from not using these spot instances 'cause they don't have the infrastructure around it, the fault tolerance to handle the preemption and things like that, to being able to just check a box and use spot instances and save a bunch of money. >> You know, this was my whole, my feature article at Reinvent last year when I met with Adam Selipsky, this next gen Cloud is here. I mean, it's not auto scale, it's infrastructure scale. It's agility. It's flexibility. I think this is where the world needs to go. Almost what DevOps did for Cloud and what you were showing me that demo had this whole SRE vibe. And remember Google had site reliability engines to manage all those servers. This is kind of like an SRE vibe for data at scale. I mean, a similar kind of order of magnitude. I mean, I might be a little bit off base there, but how would you explain it? >> It's a nice analogy. I mean, what we are trying to do here is get to the point where developers don't think about infrastructure. Where developers only think about their application logic. And where businesses can do AI, can succeed with AI, and build these scalable applications, but they don't have to build, you know, an infrastructure team. They don't have to develop that expertise. They don't have to invest years in building their internal machine learning infrastructure. They can just focus on the Python code, on their application logic, and run the stuff out of the box. >> Awesome. Well, I appreciate the time. Before we wrap up here, give a plug for the company. I know you got a couple websites. Again, go, Ray's got its own website. You got Anyscale. You got an event coming up. Give a plug for the company looking to hire. Put a plug in for the company. >> Yeah, absolutely. Thank you. So first of all, you know, we think AI is really going to transform every industry and the opportunity is there, right. We can be the infrastructure that enables all of that to happen, that makes it easy for companies to succeed with AI, and get value out of AI. Now we have, if you're interested in learning more about Ray, Ray has been emerging as the standard way to build scalable applications. Our adoption has been exploding. I mentioned companies like OpenAI using Ray to train their models. But really across the board companies like Netflix and Cruise and Instacart and Lyft and Uber, you know, just among tech companies. It's across every industry. You know, gaming companies, agriculture, you know, farming, robotics, drug discovery, you know, FinTech, we see it across the board. And all of these companies can get value out of AI, can really use AI to improve their businesses. So if you're interested in learning more about Ray and Anyscale, we have our Ray Summit coming up in September. This is going to highlight a lot of the most impressive use cases and stories across the industry. And if your business, if you want to use LLMs, you want to train these LLMs, these large language models, you want to fine tune them with your data, you want to deploy them, serve them, and build applications and products around them, give us a call, talk to us. You know, we can really take the infrastructure piece, you know, off the critical path and make that easy for you. So that's what I would say. And, you know, like you mentioned, we're hiring across the board, you know, engineering, product, go-to-market, and it's an exciting time. >> Robert Nishihara, co-founder and CEO of Anyscale, congratulations on a great company you've built and continuing to iterate on and you got growth ahead of you, you got a tailwind. I mean, the AI wave is here. I think OpenAI and ChatGPT, a customer of yours, have really opened up the mainstream visibility into this new generation of applications, user interface, roll of data, large scale, how to make that programmable so we're going to need that infrastructure. So thanks for coming on this season three, episode one of the ongoing series of the hot startups. In this case, this episode is the top startups building foundational model infrastructure for AI and ML. I'm John Furrier, your host. Thanks for watching. (upbeat music)
SUMMARY :
episode one of the ongoing and you guys really had and other resources in the Cloud. and particular the large language and what you want to achieve. and the Cloud did that with data centers. the point, and you know, if you don't mind explaining and managing the infrastructure and you guys are positioning is that the amount of compute needed to do But John, I'm curious what you think. because that's the platform So you got tools in the platform. and being the best way to of the computer industry, Did you have an inside prompt and the large language models and tell it what you want it to do. So I have to ask you and you can lower your So I want to get into that, you know, and you know, your big cluster is back up So I think, you know, the on-demand instances. and what you were showing me that demo and run the stuff out of the box. I know you got a couple websites. and the opportunity is there, right. and you got growth ahead
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Robert Nishihara | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Robert | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
35 times | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
$100 million | QUANTITY | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Ant Group | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
20% | QUANTITY | 0.99+ |
32 GPUs | QUANTITY | 0.99+ |
Lyft | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
Anyscale | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
128 | QUANTITY | 0.99+ |
September | DATE | 0.99+ |
today | DATE | 0.99+ |
Moore's Law | TITLE | 0.99+ |
Adam Selipsky | PERSON | 0.99+ |
PyTorch | TITLE | 0.99+ |
Ray | ORGANIZATION | 0.99+ |
second reason | QUANTITY | 0.99+ |
64 | QUANTITY | 0.99+ |
each worker | QUANTITY | 0.99+ |
each worker | QUANTITY | 0.99+ |
Photoshop | TITLE | 0.99+ |
UC Berkeley | ORGANIZATION | 0.99+ |
Java | TITLE | 0.99+ |
Shopify | ORGANIZATION | 0.99+ |
OpenAI | ORGANIZATION | 0.99+ |
Anyscale | PERSON | 0.99+ |
third | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
ByteDance | ORGANIZATION | 0.99+ |
Spotify | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
95 | QUANTITY | 0.99+ |
Asure | ORGANIZATION | 0.98+ |
one line | QUANTITY | 0.98+ |
one GPU | QUANTITY | 0.98+ |
ChatGPT | TITLE | 0.98+ |
TensorFlow | TITLE | 0.98+ |
last year | DATE | 0.98+ |
first bucket | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
two layers | QUANTITY | 0.98+ |
Cohere | ORGANIZATION | 0.98+ |
Alipay | ORGANIZATION | 0.98+ |
Ray | PERSON | 0.97+ |
one | QUANTITY | 0.97+ |
Instacart | ORGANIZATION | 0.97+ |
Kirk Haslbeck, Collibra, Data Citizens 22
(atmospheric music) >> Welcome to theCUBE Coverage of Data Citizens 2022 Collibra's Customer event. My name is Dave Vellante. With us is Kirk Haslbeck, who's the Vice President of Data Quality of Collibra. Kirk, good to see you, welcome. >> Thanks for having me, Dave. Excited to be here. >> You bet. Okay, we're going to discuss data quality, observability. It's a hot trend right now. You founded a data quality company, OwlDQ, and it was acquired by Collibra last year. Congratulations. And now you lead data quality at Collibra. So we're hearing a lot about data quality right now. Why is it such a priority? Take us through your thoughts on that. >> Yeah, absolutely. It's definitely exciting times for data quality which you're right, has been around for a long time. So why now? And why is it so much more exciting than it used to be? I think it's a bit stale, but we all know that companies use more data than ever before, and the variety has changed and the volume has grown. And while I think that remains true there are a couple other hidden factors at play that everyone's so interested in as to why this is becoming so important now. And I guess you could kind of break this down simply and think about if Dave you and I were going to build a new healthcare application and monitor the heartbeat of individuals, imagine if we get that wrong, what the ramifications could be, what those incidents would look like. Or maybe better yet, we try to build a new trading algorithm with a crossover strategy where the 50 day crosses the 10 day average. And imagine if the data underlying the inputs to that is incorrect. We will probably have major financial ramifications in that sense. So, kind of starts there, where everybody's realizing that we're all data companies, and if we are using bad data we're likely making incorrect business decisions. But I think there's kind of two other things at play. I bought a car not too long ago and my dad called and said, "How many cylinders does it have?" And I realized in that moment, I might have failed him cause I didn't know. And I used to ask those types of questions about any lock breaks and cylinders, and if it's manual or automatic. And I realized, I now just buy a car that I hope works. And it's so complicated with all the computer chips. I really don't know that much about it. And that's what's happening with data. We're just loading so much of it. And it's so complex that the way companies consume them in the IT function is that they bring in a lot of data and then they syndicate it out to the business. And it turns out that the individuals loading and consuming all of this data for the company actually may not know that much about the data itself and that's not even their job anymore. So, we'll talk more about that in a minute, but that's really what's setting the foreground for this observability play and why everybody's so interested. It's because we're becoming less close to the intricacies of the data and we just expect it to always be there and be correct. >> You know, the other thing too about data quality, and for years we did the MIT, CDO, IQ event. We didn't do it last year at COVID, messed everything up. But the observation I would make there, your thoughts is, data quality used to be information quality, used to be this back office function, and then it became sort of front office with financial services, and government and healthcare, these highly regulated industries. And then the whole chief data officer thing happened and people were realizing, well they sort of flipped the bit from sort of a data as a risk to data as an asset. And now as we say, we're going to talk about observability. And so it's really become front and center, just the whole quality issue because data's so fundamental, hasn't it? >> Yeah, absolutely. I mean, let's imagine we pull up our phones right now and I go to my favorite stock ticker app, and I check out the Nasdaq market cap. I really have no idea if that's the correct number. I know it's a number, it looks large, it's in a numeric field. And that's kind of what's going on. There's so many numbers and they're coming from all of these different sources, and data providers, and they're getting consumed and passed along. But there isn't really a way to tactically put controls on every number and metric across every field we plan to monitor, but with the scale that we've achieved in early days, even before Collibra. And what's been so exciting is, we have these types of observation techniques, these data monitors that can actually track past performance of every field at scale. And why that's so interesting, and why I think the CDO is listening right intently nowadays to this topic is, so maybe we could surface all of these problems with the right solution of data observability and with the right scale, and then just be alerted on breaking trends. So we're sort of shifting away from this world of must write a condition and then when that condition breaks that was always known as a break record. But what about breaking trends and root cause analysis? And is it possible to do that with less human intervention? And so I think most people are seeing now that it's going to have to be a software tool and a computer system. It's not ever going to be based on one or two domain experts anymore. >> So how does data observability relate to data quality? Are they sort of two sides of the same coin? Are they cousins? What's your perspective on that? >> Yeah, it's super interesting. It's an emerging market. So the language is changing, a lot of the topic and areas changing. The way that I like to say it or break it down because the lingo is constantly moving, as a target on the space is really breaking records versus breaking trends. And I could write a condition when this thing happens it's wrong, and when it doesn't it's correct. Or I could look for a trend and I'll give you a good example. Everybody's talking about fresh data and stale data, and why would that matter? Well, if your data never arrived, or only part of it arrived, or didn't arrive on time, it's likely stale, and there will not be a condition that you could write that would show you all the good and the bads. That was kind of your traditional approach of data quality break records. But your modern day approach is you lost a significant portion of your data, or it did not arrive on time to make that decision accurately on time. And that's a hidden concern. Some people call this freshness, we call it stale data. But it all points to the same idea of the thing that you're observing may not be a data quality condition anymore. It may be a breakdown in the data pipeline. And with thousands of data pipelines in play for every company out there, there's more than a couple of these happening every day. >> So what's the Collibra angle on all this stuff? Made the acquisition, you got data quality, observability coming together. You guys have a lot of expertise in this area, but you hear providence of data. You just talked about stale data, the whole trend toward realtime. How is Collibra approaching the problem and what's unique about your approach? >> Well I think where we're fortunate is with our background. Myself and team, we sort of lived this problem for a long time in the Wall Street days about a decade ago. And we saw it from many different angles. And what we came up with, before it was called data observability or reliability, was basically the underpinnings of that. So we're a little bit ahead of the curve there when most people evaluate our solution. It's more advanced than some of the observation techniques that currently exist. But we've also always covered data quality and we believe that people want to know more, they need more insights. And they want to see break records and breaking trends together, so they can correlate the root cause. And we hear that all the time. "I have so many things going wrong just show me the big picture. Help me find the thing that if I were to fix it today would make the most impact." So we're really focused on root cause analysis, business impact, connecting it with lineage and catalog metadata. And as that grows you can actually achieve total data governance. At this point with the acquisition of what was a Lineage company years ago, and then my company OwlDQ, now Collibra Data Quality. Collibra may be the best positioned for total data governance and intelligence in the space. >> Well, you mentioned financial services a couple of times and some examples, remember the flash crash in 2010. Nobody had any idea what that was. They would just say, "Oh, it's a glitch." So they didn't understand the root cause of it. So this is a really interesting topic to me. So we know at Data Citizens 22 that you're announcing, you got to announce new products, right? It is your yearly event. What's new? Give us a sense as to what products are coming out but specifically around data quality and observability. >> Absolutely. There's this, there's always a next thing on the forefront. And the one right now is these hyperscalers in the cloud. So you have databases like Snowflake and BigQuery, and Databricks, Delta Lake and SQL Pushdown. And ultimately what that means is a lot of people are storing in loading data even faster in a SaaS like model. And we've started to hook into these databases, and while we've always worked with the same databases in the past they're supported today. We're doing something called Native Database pushdown, where the entire compute and data activity happens in the database. And why that is so interesting and powerful now? Is everyone's concerned with something called Egress. Did my data that I've spent all this time and money with my security team securing ever leave my hands, did it ever leave my secure VPC as they call it? And with these native integrations that we're building and about to unveil here as kind of a sneak peak for next week at Data Citizens, we're now doing all compute and data operations in databases like Snowflake. And what that means is with no install and no configuration you could log into the Collibra data quality app and have all of your data quality running inside the database that you've probably already picked as your go forward team selection secured database of choice. So we're really excited about that. And I think if you look at the whole landscape of network cost, egress cost, data storage and compute, what people are realizing is it's extremely efficient to do it in the way that we're about to release here next week. >> So this is interesting because what you just described, you mentioned Snowflake, you mentioned Google, oh actually you mentioned yeah, Databricks. You know, Snowflake has the data cloud. If you put everything in the data cloud, okay, you're cool. But then Google's got the open data cloud. If you heard, Google next. And now Databricks doesn't call it the data cloud, but they have like the open source data cloud. So you have all these different approaches and there's really no way, up until now I'm hearing, to really understand the relationships between all those and have confidence across, it's like yamarket AMI, you should just be a note on the mesh. I don't care if it's a data warehouse or a data lake, or where it comes from, but it's a point on that mesh and I need tooling to be able to have confidence that my data is governed and has the proper lineage, providence. And that's what you're bringing to the table. Is that right? Did I get that right? >> Yeah, that's right. And it's, for us, it's not that we haven't been working with those great cloud databases, but it's the fact that we can send them the instructions now we can send them the operating ability to crunch all of the calculations, the governance, the quality, and get the answers. And what that's doing, it's basically zero network cost, zero egress cost, zero latency of time. And so when you were to log into BigQuery tomorrow using our tool, or say Snowflake for example, you have instant data quality metrics, instant profiling, instant lineage in access, privacy controls, things of that nature that just become less onerous. What we're seeing is there's so much technology out there just like all of the major brands that you mentioned but how do we make it easier? The future is about less clicks, faster time to value, faster scale, and eventually lower cost. And we think that this positions us to be the leader there. >> I love this example because, we've got talks about well the cloud guys you're going to own the world. And of course now we're seeing that the ecosystem is finding so much white space to add value connect across cloud. Sometimes we call it super cloud and so, or inter clouding. Alright, Kirk, give us your final thoughts on the trends that we've talked about and data Citizens 22. >> Absolutely. Well I think, one big trend is discovery and classification. Seeing that across the board, people used to know it was a zip code and nowadays with the amount of data that's out there they want to know where everything is, where their sensitive data is, if it's redundant, tell me everything inside of three to five seconds. And with that comes, they want to know in all of these hyperscale databases how fast they can get controls and insights out of their tools. So I think we're going to see more one click solutions, more SaaS based solutions, and solutions that hopefully prove faster time to value on all of these modern cloud platforms. >> Excellent. All right, Kirk Haslbeck, thanks so much for coming on theCUBE and previewing Data Citizens 22. Appreciate it. >> Thanks for having me, Dave. >> You're welcome. All right. And thank you for watching. Keep it right there for more coverage from theCUBE. (atmospheric music)
SUMMARY :
Kirk, good to see you, welcome. Excited to be here. And now you lead data quality at Collibra. And it's so complex that the And now as we say, we're going and I check out the Nasdaq market cap. of the thing that you're observing and what's unique about your approach? ahead of the curve there and some examples, And the one right now is these and has the proper lineage, providence. and get the answers. And of course now we're and solutions that hopefully and previewing Data Citizens 22. And thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Collibra | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
Kirk Haslbeck | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
OwlDQ | ORGANIZATION | 0.99+ |
Kirk | PERSON | 0.99+ |
50 day | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
10 day | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
two sides | QUANTITY | 0.99+ |
last year | DATE | 0.99+ |
Collibra Data Quality | ORGANIZATION | 0.99+ |
next week | DATE | 0.99+ |
Data Citizens | ORGANIZATION | 0.99+ |
tomorrow | DATE | 0.98+ |
two other things | QUANTITY | 0.98+ |
BigQuery | TITLE | 0.98+ |
five seconds | QUANTITY | 0.98+ |
one click | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
Collibra | TITLE | 0.96+ |
Wall Street | LOCATION | 0.96+ |
SQL Pushdown | TITLE | 0.94+ |
Data Citizens 22 | ORGANIZATION | 0.93+ |
COVID | ORGANIZATION | 0.93+ |
Snowflake | TITLE | 0.91+ |
Nasdaq | ORGANIZATION | 0.9+ |
Data Citizens 22 | ORGANIZATION | 0.89+ |
Delta Lake | TITLE | 0.89+ |
Egress | ORGANIZATION | 0.89+ |
MIT | EVENT | 0.89+ |
more than a couple | QUANTITY | 0.87+ |
a decade ago | DATE | 0.85+ |
zero | QUANTITY | 0.84+ |
Citizens | ORGANIZATION | 0.83+ |
Data Citizens 2022 Collibra | EVENT | 0.83+ |
years | DATE | 0.81+ |
thousands of data | QUANTITY | 0.8+ |
Data Citizens 22 | TITLE | 0.78+ |
two domain experts | QUANTITY | 0.77+ |
Snowflake | ORGANIZATION | 0.76+ |
IQ | EVENT | 0.76+ |
couple | QUANTITY | 0.75+ |
Collibra | PERSON | 0.75+ |
theCUBE | ORGANIZATION | 0.71+ |
many numbers | QUANTITY | 0.7+ |
Vice President | PERSON | 0.68+ |
Lineage | ORGANIZATION | 0.66+ |
Databricks | TITLE | 0.64+ |
too long ago | DATE | 0.62+ |
three | QUANTITY | 0.6+ |
Data | ORGANIZATION | 0.57+ |
CDO | EVENT | 0.53+ |
minute | QUANTITY | 0.53+ |
CDO | TITLE | 0.53+ |
number | QUANTITY | 0.51+ |
AMI | ORGANIZATION | 0.44+ |
Quality | PERSON | 0.43+ |
Collibra Data Citizens 22
>>Collibra is a company that was founded in 2008 right before the so-called modern big data era kicked into high gear. The company was one of the first to focus its business on data governance. Now, historically, data governance and data quality initiatives, they were back office functions and they were largely confined to regulatory regulated industries that had to comply with public policy mandates. But as the cloud went mainstream, the tech giants showed us how valuable data could become and the value proposition for data quality and trust. It evolved from primarily a compliance driven issue to becoming a lynchpin of competitive advantage. But data in the decade of the 2010s was largely about getting the technology to work. You had these highly centralized technical teams that were formed and they had hyper specialized skills to develop data architectures and processes to serve the myriad data needs of organizations. >>And it resulted in a lot of frustration with data initiatives for most organizations that didn't have the resources of the cloud guys and the social media giants to really attack their data problems and turn data into gold. This is why today for example, this quite a bit of momentum to rethinking monolithic data architectures. You see, you hear about initiatives like data mesh and the idea of data as a product. They're gaining traction as a way to better serve the the data needs of decentralized business Uni users, you hear a lot about data democratization. So these decentralization efforts around data, they're great, but they create a new set of problems. Specifically, how do you deliver like a self-service infrastructure to business users and domain experts? Now the cloud is definitely helping with that, but also how do you automate governance? This becomes especially tricky as protecting data privacy has become more and more important. >>In other words, while it's enticing to experiment and run fast and loose with data initiatives kinda like the Wild West, to find new veins of gold, it has to be done responsibly. As such, the idea of data governance has had to evolve to become more automated. And intelligence governance and data lineage is still fundamental to ensuring trust as data. It moves like water through an organization. No one is gonna use data that isn't trusted. Metadata has become increasingly important for data discovery and data classification. As data flows through an organization, the continuously ability to check for data flaws and automating that data quality, they become a functional requirement of any modern data management platform. And finally, data privacy has become a critical adjacency to cyber security. So you can see how data governance has evolved into a much richer set of capabilities than it was 10 or 15 years ago. >>Hello and welcome to the Cube's coverage of Data Citizens made possible by Calibra, a leader in so-called Data intelligence and the host of Data Citizens 2022, which is taking place in San Diego. My name is Dave Ante and I'm one of the hosts of our program, which is running in parallel to data citizens. Now at the Cube we like to say we extract the signal from the noise, and over the, the next couple of days, we're gonna feature some of the themes from the keynote speakers at Data Citizens and we'll hear from several of the executives. Felix Von Dala, who is the co-founder and CEO of Collibra, will join us along with one of the other founders of Collibra, Stan Christians, who's gonna join my colleague Lisa Martin. I'm gonna also sit down with Laura Sellers, she's the Chief Product Officer at Collibra. We'll talk about some of the, the announcements and innovations they're making at the event, and then we'll dig in further to data quality with Kirk Hasselbeck. >>He's the vice president of Data quality at Collibra. He's an amazingly smart dude who founded Owl dq, a company that he sold to Col to Collibra last year. Now many companies, they didn't make it through the Hado era, you know, they missed the industry waves and they became Driftwood. Collibra, on the other hand, has evolved its business. They've leveraged the cloud, expanded its product portfolio, and leaned in heavily to some major partnerships with cloud providers, as well as receiving a strategic investment from Snowflake earlier this year. So it's a really interesting story that we're thrilled to be sharing with you. Thanks for watching and I hope you enjoy the program. >>Last year, the Cube Covered Data Citizens Collibra's customer event. And the premise that we put forth prior to that event was that despite all the innovation that's gone on over the last decade or more with data, you know, starting with the Hado movement, we had data lakes, we'd spark the ascendancy of programming languages like Python, the introduction of frameworks like TensorFlow, the rise of ai, low code, no code, et cetera. Businesses still find it's too difficult to get more value from their data initiatives. And we said at the time, you know, maybe it's time to rethink data innovation. While a lot of the effort has been focused on, you know, more efficiently storing and processing data, perhaps more energy needs to go into thinking about the people and the process side of the equation, meaning making it easier for domain experts to both gain insights for data, trust the data, and begin to use that data in new ways, fueling data, products, monetization and insights data citizens 2022 is back and we're pleased to have Felix Van Dema, who is the founder and CEO of Collibra. He's on the cube or excited to have you, Felix. Good to see you again. >>Likewise Dave. Thanks for having me again. >>You bet. All right, we're gonna get the update from Felix on the current data landscape, how he sees it, why data intelligence is more important now than ever and get current on what Collibra has been up to over the past year and what's changed since Data Citizens 2021. And we may even touch on some of the product news. So Felix, we're living in a very different world today with businesses and consumers. They're struggling with things like supply chains, uncertain economic trends, and we're not just snapping back to the 2010s. That's clear, and that's really true as well in the world of data. So what's different in your mind, in the data landscape of the 2020s from the previous decade, and what challenges does that bring for your customers? >>Yeah, absolutely. And, and I think you said it well, Dave, and and the intro that that rising complexity and fragmentation in the broader data landscape, that hasn't gotten any better over the last couple of years. When when we talk to our customers, that level of fragmentation, the complexity, how do we find data that we can trust, that we know we can use has only gotten kinda more, more difficult. So that trend that's continuing, I think what is changing is that trend has become much more acute. Well, the other thing we've seen over the last couple of years is that the level of scrutiny that organizations are under respect to data, as data becomes more mission critical, as data becomes more impactful than important, the level of scrutiny with respect to privacy, security, regulatory compliance, as only increasing as well, which again, is really difficult in this environment of continuous innovation, continuous change, continuous growing complexity and fragmentation. >>So it's become much more acute. And, and to your earlier point, we do live in a different world and and the the past couple of years we could probably just kind of brute for it, right? We could focus on, on the top line. There was enough kind of investments to be, to be had. I think nowadays organizations are focused or are, are, are, are, are, are in a very different environment where there's much more focus on cost control, productivity, efficiency, How do we truly get value from that data? So again, I think it just another incentive for organization to now truly look at data and to scale it data, not just from a a technology and infrastructure perspective, but how do you actually scale data from an organizational perspective, right? You said at the the people and process, how do we do that at scale? And that's only, only only becoming much more important. And we do believe that the, the economic environment that we find ourselves in today is gonna be catalyst for organizations to really dig out more seriously if, if, if, if you will, than they maybe have in the have in the best. >>You know, I don't know when you guys founded Collibra, if, if you had a sense as to how complicated it was gonna get, but you've been on a mission to really address these problems from the beginning. How would you describe your, your, your mission and what are you doing to address these challenges? >>Yeah, absolutely. We, we started Colli in 2008. So in some sense and the, the last kind of financial crisis, and that was really the, the start of Colli where we found product market fit, working with large finance institutions to help them cope with the increasing compliance requirements that they were faced with because of the, of the financial crisis and kind of here we are again in a very different environment, of course 15 years, almost 15 years later. But data only becoming more important. But our mission to deliver trusted data for every user, every use case and across every source, frankly, has only become more important. So what has been an incredible journey over the last 14, 15 years, I think we're still relatively early in our mission to again, be able to provide everyone, and that's why we call it data citizens. We truly believe that everyone in the organization should be able to use trusted data in an easy, easy matter. That mission is is only becoming more important, more relevant. We definitely have a lot more work ahead of us because we are still relatively early in that, in that journey. >>Well, that's interesting because, you know, in my observation it takes seven to 10 years to actually build a company and then the fact that you're still in the early days is kind of interesting. I mean, you, Collibra's had a good 12 months or so since we last spoke at Data Citizens. Give us the latest update on your business. What do people need to know about your, your current momentum? >>Yeah, absolutely. Again, there's, there's a lot of tail organizations that are only maturing the data practices and we've seen it kind of transform or, or, or influence a lot of our business growth that we've seen, broader adoption of the platform. We work at some of the largest organizations in the world where it's Adobe, Heineken, Bank of America, and many more. We have now over 600 enterprise customers, all industry leaders and every single vertical. So it's, it's really exciting to see that and continue to partner with those organizations. On the partnership side, again, a lot of momentum in the org in, in the, in the markets with some of the cloud partners like Google, Amazon, Snowflake, data bricks and, and others, right? As those kind of new modern data infrastructures, modern data architectures that are definitely all moving to the cloud, a great opportunity for us, our partners and of course our customers to help them kind of transition to the cloud even faster. >>And so we see a lot of excitement and momentum there within an acquisition about 18 months ago around data quality, data observability, which we believe is an enormous opportunity. Of course, data quality isn't new, but I think there's a lot of reasons why we're so excited about quality and observability now. One is around leveraging ai, machine learning, again to drive more automation. And the second is that those data pipelines that are now being created in the cloud, in these modern data architecture arch architectures, they've become mission critical. They've become real time. And so monitoring, observing those data pipelines continuously has become absolutely critical so that they're really excited about about that as well. And on the organizational side, I'm sure you've heard a term around kind of data mesh, something that's gaining a lot of momentum, rightfully so. It's really the type of governance that we always believe. Then federated focused on domains, giving a lot of ownership to different teams. I think that's the way to scale data organizations. And so that aligns really well with our vision and, and from a product perspective, we've seen a lot of momentum with our customers there as well. >>Yeah, you know, a couple things there. I mean, the acquisition of i l dq, you know, Kirk Hasselbeck and, and their team, it's interesting, you know, the whole data quality used to be this back office function and, and really confined to highly regulated industries. It's come to the front office, it's top of mind for chief data officers, data mesh. You mentioned you guys are a connective tissue for all these different nodes on the data mesh. That's key. And of course we see you at all the shows. You're, you're a critical part of many ecosystems and you're developing your own ecosystem. So let's chat a little bit about the, the products. We're gonna go deeper in into products later on at, at Data Citizens 22, but we know you're debuting some, some new innovations, you know, whether it's, you know, the, the the under the covers in security, sort of making data more accessible for people just dealing with workflows and processes as you talked about earlier. Tell us a little bit about what you're introducing. >>Yeah, absolutely. We're super excited, a ton of innovation. And if we think about the big theme and like, like I said, we're still relatively early in this, in this journey towards kind of that mission of data intelligence that really bolts and compelling mission, either customers are still start, are just starting on that, on that journey. We wanna make it as easy as possible for the, for our organization to actually get started because we know that's important that they do. And for our organization and customers that have been with us for some time, there's still a tremendous amount of opportunity to kind of expand the platform further. And again, to make it easier for really to, to accomplish that mission and vision around that data citizen that everyone has access to trustworthy data in a very easy, easy way. So that's really the theme of a lot of the innovation that we're driving. >>A lot of kind of ease of adoption, ease of use, but also then how do we make sure that lio becomes this kind of mission critical enterprise platform from a security performance architecture scale supportability that we're truly able to deliver that kind of an enterprise mission critical platform. And so that's the big theme from an innovation perspective, From a product perspective, a lot of new innovation that we're really excited about. A couple of highlights. One is around data marketplace. Again, a lot of our customers have plans in that direction, how to make it easy. How do we make, how do we make available to true kind of shopping experience that anybody in your organization can, in a very easy search first way, find the right data product, find the right dataset, that data can then consume usage analytics. How do you, how do we help organizations drive adoption, tell them where they're working really well and where they have opportunities homepages again to, to make things easy for, for people, for anyone in your organization to kind of get started with ppia, you mentioned workflow designer, again, we have a very powerful enterprise platform. >>One of our key differentiators is the ability to really drive a lot of automation through workflows. And now we provided a new low code, no code kind of workflow designer experience. So, so really customers can take it to the next level. There's a lot more new product around K Bear Protect, which in partnership with Snowflake, which has been a strategic investor in kib, focused on how do we make access governance easier? How do we, how do we, how are we able to make sure that as you move to the cloud, things like access management, masking around sensitive data, PII data is managed as much more effective, effective rate, really excited about that product. There's more around data quality. Again, how do we, how do we get that deployed as easily and quickly and widely as we can? Moving that to the cloud has been a big part of our strategy. >>So we launch more data quality cloud product as well as making use of those, those native compute capabilities in platforms like Snowflake, Data, Bricks, Google, Amazon, and others. And so we are bettering a capability, a capability that we call push down. So actually pushing down the computer and data quality, the monitoring into the underlying platform, which again, from a scale performance and ease of use perspective is gonna make a massive difference. And then more broadly, we, we talked a little bit about the ecosystem. Again, integrations, we talk about being able to connect to every source. Integrations are absolutely critical and we're really excited to deliver new integrations with Snowflake, Azure and Google Cloud storage as well. So there's a lot coming out. The, the team has been work at work really hard and we are really, really excited about what we are coming, what we're bringing to markets. >>Yeah, a lot going on there. I wonder if you could give us your, your closing thoughts. I mean, you, you talked about, you know, the marketplace, you know, you think about data mesh, you think of data as product, one of the key principles you think about monetization. This is really different than what we've been used to in data, which is just getting the technology to work has been been so hard. So how do you see sort of the future and, you know, give us the, your closing thoughts please? >>Yeah, absolutely. And I, and I think we we're really at this pivotal moment, and I think you said it well. We, we all know the constraint and the challenges with data, how to actually do data at scale. And while we've seen a ton of innovation on the infrastructure side, we fundamentally believe that just getting a faster database is important, but it's not gonna fully solve the challenges and truly kind of deliver on the opportunity. And that's why now is really the time to deliver this data intelligence vision, this data intelligence platform. We are still early, making it as easy as we can. It's kind of, of our, it's our mission. And so I'm really, really excited to see what we, what we are gonna, how the marks gonna evolve over the next, next few quarters and years. I think the trend is clearly there when we talk about data mesh, this kind of federated approach folks on data products is just another signal that we believe that a lot of our organization are now at the time. >>The understanding need to go beyond just the technology. I really, really think about how do we actually scale data as a business function, just like we've done with it, with, with hr, with, with sales and marketing, with finance. That's how we need to think about data. I think now is the time given the economic environment that we are in much more focus on control, much more focused on productivity efficiency and now's the time. We need to look beyond just the technology and infrastructure to think of how to scale data, how to manage data at scale. >>Yeah, it's a new era. The next 10 years of data won't be like the last, as I always say. Felix, thanks so much and good luck in, in San Diego. I know you're gonna crush it out there. >>Thank you Dave. >>Yeah, it's a great spot for an in-person event and, and of course the content post event is gonna be available@collibra.com and you can of course catch the cube coverage@thecube.net and all the news@siliconangle.com. This is Dave Valante for the cube, your leader in enterprise and emerging tech coverage. >>Hi, I'm Jay from Collibra's Data Office. Today I want to talk to you about Collibra's data intelligence cloud. We often say Collibra is a single system of engagement for all of your data. Now, when I say data, I mean data in the broadest sense of the word, including reference and metadata. Think of metrics, reports, APIs, systems, policies, and even business processes that produce or consume data. Now, the beauty of this platform is that it ensures all of your users have an easy way to find, understand, trust, and access data. But how do you get started? Well, here are seven steps to help you get going. One, start with the data. What's data intelligence? Without data leverage the Collibra data catalog to automatically profile and classify your enterprise data wherever that data lives, databases, data lakes or data warehouses, whether on the cloud or on premise. >>Two, you'll then wanna organize the data and you'll do that with data communities. This can be by department, find a business or functional team, however your organization organizes work and accountability. And for that you'll establish community owners, communities, make it easy for people to navigate through the platform, find the data and will help create a sense of belonging for users. An important and related side note here, we find it's typical in many organizations that data is thought of is just an asset and IT and data offices are viewed as the owners of it and who are really the central teams performing analytics as a service provider to the enterprise. We believe data is more than an asset, it's a true product that can be converted to value. And that also means establishing business ownership of data where that strategy and ROI come together with subject matter expertise. >>Okay, three. Next, back to those communities there, the data owners should explain and define their data, not just the tables and columns, but also the related business terms, metrics and KPIs. These objects we call these assets are typically organized into business glossaries and data dictionaries. I definitely recommend starting with the topics that are most important to the business. Four, those steps that enable you and your users to have some fun with it. Linking everything together builds your knowledge graph and also known as a metadata graph by linking or relating these assets together. For example, a data set to a KPI to a report now enables your users to see what we call the lineage diagram that visualizes where the data in your dashboards actually came from and what the data means and who's responsible for it. Speaking of which, here's five. Leverage the calibra trusted business reporting solution on the marketplace, which comes with workflows for those owners to certify their reports, KPIs, and data sets. >>This helps them force their trust in their data. Six, easy to navigate dashboards or landing pages right in your platform for your company's business processes are the most effective way for everyone to better understand and take action on data. Here's a pro tip, use the dashboard design kit on the marketplace to help you build compelling dashboards. Finally, seven, promote the value of this to your users and be sure to schedule enablement office hours and new employee onboarding sessions to get folks excited about what you've built and implemented. Better yet, invite all of those community and data owners to these sessions so that they can show off the value that they've created. Those are my seven tips to get going with Collibra. I hope these have been useful. For more information, be sure to visit collibra.com. >>Welcome to the Cube's coverage of Data Citizens 2022 Collibra's customer event. My name is Dave Valante. With us is Kirk Hasselbeck, who's the vice president of Data Quality of Collibra Kirk, good to see you. Welcome. >>Thanks for having me, Dave. Excited to be here. >>You bet. Okay, we're gonna discuss data quality observability. It's a hot trend right now. You founded a data quality company, OWL dq, and it was acquired by Collibra last year. Congratulations. And now you lead data quality at Collibra. So we're hearing a lot about data quality right now. Why is it such a priority? Take us through your thoughts on that. >>Yeah, absolutely. It's, it's definitely exciting times for data quality, which you're right, has been around for a long time. So why now and why is it so much more exciting than it used to be? I think it's a bit stale, but we all know that companies use more data than ever before and the variety has changed and the volume has grown. And, and while I think that remains true, there are a couple other hidden factors at play that everyone's so interested in as, as to why this is becoming so important now. And, and I guess you could kind of break this down simply and think about if Dave, you and I were gonna build, you know, a new healthcare application and monitor the heartbeat of individuals, imagine if we get that wrong, you know, what the ramifications could be, what, what those incidents would look like, or maybe better yet, we try to build a, a new trading algorithm with a crossover strategy where the 50 day crosses the, the 10 day average. >>And imagine if the data underlying the inputs to that is incorrect. We will probably have major financial ramifications in that sense. So, you know, it kind of starts there where everybody's realizing that we're all data companies and if we are using bad data, we're likely making incorrect business decisions. But I think there's kind of two other things at play. You know, I, I bought a car not too long ago and my dad called and said, How many cylinders does it have? And I realized in that moment, you know, I might have failed him because, cause I didn't know. And, and I used to ask those types of questions about any lock brakes and cylinders and, and you know, if it's manual or, or automatic and, and I realized I now just buy a car that I hope works. And it's so complicated with all the computer chips, I, I really don't know that much about it. >>And, and that's what's happening with data. We're just loading so much of it. And it's so complex that the way companies consume them in the IT function is that they bring in a lot of data and then they syndicate it out to the business. And it turns out that the, the individuals loading and consuming all of this data for the company actually may not know that much about the data itself, and that's not even their job anymore. So we'll talk more about that in a minute, but that's really what's setting the foreground for this observability play and why everybody's so interested. It, it's because we're becoming less close to the intricacies of the data and we just expect it to always be there and be correct. >>You know, the other thing too about data quality, and for years we did the MIT CDO IQ event, we didn't do it last year, Covid messed everything up. But the observation I would make there thoughts is, is it data quality? Used to be information quality used to be this back office function, and then it became sort of front office with financial services and government and healthcare, these highly regulated industries. And then the whole chief data officer thing happened and people were realizing, well, they sort of flipped the bit from sort of a data as a, a risk to data as a, as an asset. And now as we say, we're gonna talk about observability. And so it's really become front and center just the whole quality issue because data's so fundamental, hasn't it? >>Yeah, absolutely. I mean, let's imagine we pull up our phones right now and I go to my, my favorite stock ticker app and I check out the NASDAQ market cap. I really have no idea if that's the correct number. I know it's a number, it looks large, it's in a numeric field. And, and that's kind of what's going on. There's, there's so many numbers and they're coming from all of these different sources and data providers and they're getting consumed and passed along. But there isn't really a way to tactically put controls on every number and metric across every field we plan to monitor, but with the scale that we've achieved in early days, even before calibra. And what's been so exciting is we have these types of observation techniques, these data monitors that can actually track past performance of every field at scale. And why that's so interesting and why I think the CDO is, is listening right intently nowadays to this topic is, so maybe we could surface all of these problems with the right solution of data observability and with the right scale and then just be alerted on breaking trends. So we're sort of shifting away from this world of must write a condition and then when that condition breaks, that was always known as a break record. But what about breaking trends and root cause analysis? And is it possible to do that, you know, with less human intervention? And so I think most people are seeing now that it's going to have to be a software tool and a computer system. It's, it's not ever going to be based on one or two domain experts anymore. >>So, So how does data observability relate to data quality? Are they sort of two sides of the same coin? Are they, are they cousins? What's your perspective on that? >>Yeah, it's, it's super interesting. It's an emerging market. So the language is changing a lot of the topic and areas changing the way that I like to say it or break it down because the, the lingo is constantly moving is, you know, as a target on this space is really breaking records versus breaking trends. And I could write a condition when this thing happens, it's wrong and when it doesn't it's correct. Or I could look for a trend and I'll give you a good example. You know, everybody's talking about fresh data and stale data and, and why would that matter? Well, if your data never arrived or only part of it arrived or didn't arrive on time, it's likely stale and there will not be a condition that you could write that would show you all the good in the bads. That was kind of your, your traditional approach of data quality break records. But your modern day approach is you lost a significant portion of your data, or it did not arrive on time to make that decision accurately on time. And that's a hidden concern. Some people call this freshness, we call it stale data, but it all points to the same idea of the thing that you're observing may not be a data quality condition anymore. It may be a breakdown in the data pipeline. And with thousands of data pipelines in play for every company out there there, there's more than a couple of these happening every day. >>So what's the Collibra angle on all this stuff made the acquisition, you got data quality observability coming together, you guys have a lot of expertise in, in this area, but you hear providence of data, you just talked about, you know, stale data, you know, the, the whole trend toward real time. How is Calibra approaching the problem and what's unique about your approach? >>Well, I think where we're fortunate is with our background, myself and team, we sort of lived this problem for a long time, you know, in, in the Wall Street days about a decade ago. And we saw it from many different angles. And what we came up with before it was called data observability or reliability was basically the, the underpinnings of that. So we're a little bit ahead of the curve there when most people evaluate our solution, it's more advanced than some of the observation techniques that that currently exist. But we've also always covered data quality and we believe that people want to know more, they need more insights, and they want to see break records and breaking trends together so they can correlate the root cause. And we hear that all the time. I have so many things going wrong, just show me the big picture, help me find the thing that if I were to fix it today would make the most impact. So we're really focused on root cause analysis, business impact, connecting it with lineage and catalog metadata. And as that grows, you can actually achieve total data governance at this point with the acquisition of what was a Lineage company years ago, and then my company Ldq now Collibra, Data quality Collibra may be the best positioned for total data governance and intelligence in the space. >>Well, you mentioned financial services a couple of times and some examples, remember the flash crash in 2010. Nobody had any idea what that was, you know, they just said, Oh, it's a glitch, you know, so they didn't understand the root cause of it. So this is a really interesting topic to me. So we know at Data Citizens 22 that you're announcing, you gotta announce new products, right? You're yearly event what's, what's new. Give us a sense as to what products are coming out, but specifically around data quality and observability. >>Absolutely. There's this, you know, there's always a next thing on the forefront. And the one right now is these hyperscalers in the cloud. So you have databases like Snowflake and Big Query and Data Bricks is Delta Lake and SQL Pushdown. And ultimately what that means is a lot of people are storing in loading data even faster in a SaaS like model. And we've started to hook in to these databases. And while we've always worked with the the same databases in the past, they're supported today we're doing something called Native Database pushdown, where the entire compute and data activity happens in the database. And why that is so interesting and powerful now is everyone's concerned with something called Egress. Did your, my data that I've spent all this time and money with my security team securing ever leave my hands, did it ever leave my secure VPC as they call it? >>And with these native integrations that we're building and about to unveil, here's kind of a sneak peek for, for next week at Data Citizens. We're now doing all compute and data operations in databases like Snowflake. And what that means is with no install and no configuration, you could log into the Collibra data quality app and have all of your data quality running inside the database that you've probably already picked as your your go forward team selection secured database of choice. So we're really excited about that. And I think if you look at the whole landscape of network cost, egress, cost, data storage and compute, what people are realizing is it's extremely efficient to do it in the way that we're about to release here next week. >>So this is interesting because what you just described, you know, you mentioned Snowflake, you mentioned Google, Oh actually you mentioned yeah, data bricks. You know, Snowflake has the data cloud. If you put everything in the data cloud, okay, you're cool, but then Google's got the open data cloud. If you heard, you know, Google next and now data bricks doesn't call it the data cloud, but they have like the open source data cloud. So you have all these different approaches and there's really no way up until now I'm, I'm hearing to, to really understand the relationships between all those and have confidence across, you know, it's like Jak Dani, you should just be a note on the mesh. And I don't care if it's a data warehouse or a data lake or where it comes from, but it's a point on that mesh and I need tooling to be able to have confidence that my data is governed and has the proper lineage, providence. And, and, and that's what you're bringing to the table, Is that right? Did I get that right? >>Yeah, that's right. And it's, for us, it's, it's not that we haven't been working with those great cloud databases, but it's the fact that we can send them the instructions now, we can send them the, the operating ability to crunch all of the calculations, the governance, the quality, and get the answers. And what that's doing, it's basically zero network costs, zero egress cost, zero latency of time. And so when you were to log into Big Query tomorrow using our tool or like, or say Snowflake for example, you have instant data quality metrics, instant profiling, instant lineage and access privacy controls, things of that nature that just become less onerous. What we're seeing is there's so much technology out there, just like all of the major brands that you mentioned, but how do we make it easier? The future is about less clicks, faster time to value, faster scale, and eventually lower cost. And, and we think that this positions us to be the leader there. >>I love this example because, you know, Barry talks about, wow, the cloud guys are gonna own the world and, and of course now we're seeing that the ecosystem is finding so much white space to add value, connect across cloud. Sometimes we call it super cloud and so, or inter clouding. All right, Kirk, give us your, your final thoughts and on on the trends that we've talked about and Data Citizens 22. >>Absolutely. Well, I think, you know, one big trend is discovery and classification. Seeing that across the board, people used to know it was a zip code and nowadays with the amount of data that's out there, they wanna know where everything is, where their sensitive data is. If it's redundant, tell me everything inside of three to five seconds. And with that comes, they want to know in all of these hyperscale databases how fast they can get controls and insights out of their tools. So I think we're gonna see more one click solutions, more SAS based solutions and solutions that hopefully prove faster time to value on, on all of these modern cloud platforms. >>Excellent. All right, Kurt Hasselbeck, thanks so much for coming on the Cube and previewing Data Citizens 22. Appreciate it. >>Thanks for having me, Dave. >>You're welcome. Right, and thank you for watching. Keep it right there for more coverage from the Cube. Welcome to the Cube's virtual Coverage of Data Citizens 2022. My name is Dave Valante and I'm here with Laura Sellers, who's the Chief Product Officer at Collibra, the host of Data Citizens. Laura, welcome. Good to see you. >>Thank you. Nice to be here. >>Yeah, your keynote at Data Citizens this year focused on, you know, your mission to drive ease of use and scale. Now when I think about historically fast access to the right data at the right time in a form that's really easily consumable, it's been kind of challenging, especially for business users. Can can you explain to our audience why this matters so much and what's actually different today in the data ecosystem to make this a reality? >>Yeah, definitely. So I think what we really need and what I hear from customers every single day is that we need a new approach to data management and our product teams. What inspired me to come to Calibra a little bit a over a year ago was really the fact that they're very focused on bringing trusted data to more users across more sources for more use cases. And so as we look at what we're announcing with these innovations of ease of use and scale, it's really about making teams more productive in getting started with and the ability to manage data across the entire organization. So we've been very focused on richer experiences, a broader ecosystem of partners, as well as a platform that delivers performance, scale and security that our users and teams need and demand. So as we look at, Oh, go ahead. >>I was gonna say, you know, when I look back at like the last 10 years, it was all about getting the technology to work and it was just so complicated. But, but please carry on. I'd love to hear more about this. >>Yeah, I, I really, you know, Collibra is a system of engagement for data and we really are working on bringing that entire system of engagement to life for everyone to leverage here and now. So what we're announcing from our ease of use side of the world is first our data marketplace. This is the ability for all users to discover and access data quickly and easily shop for it, if you will. The next thing that we're also introducing is the new homepage. It's really about the ability to drive adoption and have users find data more quickly. And then the two more areas of the ease of use side of the world is our world of usage analytics. And one of the big pushes and passions we have at Collibra is to help with this data driven culture that all companies are trying to create. And also helping with data literacy, with something like usage analytics, it's really about driving adoption of the CLE platform, understanding what's working, who's accessing it, what's not. And then finally we're also introducing what's called workflow designer. And we love our workflows at Libra, it's a big differentiator to be able to automate business processes. The designer is really about a way for more people to be able to create those workflows, collaborate on those workflow flows, as well as people to be able to easily interact with them. So a lot of exciting things when it comes to ease of use to make it easier for all users to find data. >>Y yes, there's definitely a lot to unpack there. I I, you know, you mentioned this idea of, of of, of shopping for the data. That's interesting to me. Why this analogy, metaphor or analogy, I always get those confused. I let's go with analogy. Why is it so important to data consumers? >>I think when you look at the world of data, and I talked about this system of engagement, it's really about making it more accessible to the masses. And what users are used to is a shopping experience like your Amazon, if you will. And so having a consumer grade experience where users can quickly go in and find the data, trust that data, understand where the data's coming from, and then be able to quickly access it, is the idea of being able to shop for it, just making it as simple as possible and really speeding the time to value for any of the business analysts, data analysts out there. >>Yeah, I think when you, you, you see a lot of discussion about rethinking data architectures, putting data in the hands of the users and business people, decentralized data and of course that's awesome. I love that. But of course then you have to have self-service infrastructure and you have to have governance. And those are really challenging. And I think so many organizations, they're facing adoption challenges, you know, when it comes to enabling teams generally, especially domain experts to adopt new data technologies, you know, like the, the tech comes fast and furious. You got all these open source projects and get really confusing. Of course it risks security, governance and all that good stuff. You got all this jargon. So where do you see, you know, the friction in adopting new data technologies? What's your point of view and how can organizations overcome these challenges? >>You're, you're dead on. There's so much technology and there's so much to stay on top of, which is part of the friction, right? It's just being able to stay ahead of, of and understand all the technologies that are coming. You also look at as there's so many more sources of data and people are migrating data to the cloud and they're migrating to new sources. Where the friction comes is really that ability to understand where the data came from, where it's moving to, and then also to be able to put the access controls on top of it. So people are only getting access to the data that they should be getting access to. So one of the other things we're announcing with, with all of the innovations that are coming is what we're doing around performance and scale. So with all of the data movement, with all of the data that's out there, the first thing we're launching in the world of performance and scale is our world of data quality. >>It's something that Collibra has been working on for the past year and a half, but we're launching the ability to have data quality in the cloud. So it's currently an on-premise offering, but we'll now be able to carry that over into the cloud for us to manage that way. We're also introducing the ability to push down data quality into Snowflake. So this is, again, one of those challenges is making sure that that data that you have is d is is high quality as you move forward. And so really another, we're just reducing friction. You already have Snowflake stood up. It's not another machine for you to manage, it's just push down capabilities into Snowflake to be able to track that quality. Another thing that we're launching with that is what we call Collibra Protect. And this is that ability for users to be able to ingest metadata, understand where the PII data is, and then set policies up on top of it. So very quickly be able to set policies and have them enforced at the data level. So anybody in the organization is only getting access to the data they should have access to. >>Here's Topica data quality is interesting. It's something that I've followed for a number of years. It used to be a back office function, you know, and really confined only to highly regulated industries like financial services and healthcare and government. You know, you look back over a decade ago, you didn't have this worry about personal information, g gdpr, and, you know, California Consumer Privacy Act all becomes, becomes so much important. The cloud is really changed things in terms of performance and scale and of course partnering for, for, with Snowflake it's all about sharing data and monetization, anything but a back office function. So it was kind of smart that you guys were early on and of course attracting them and as a, as an investor as well was very strong validation. What can you tell us about the nature of the relationship with Snowflake and specifically inter interested in sort of joint engineering or, and product innovation efforts, you know, beyond the standard go to market stuff? >>Definitely. So you mentioned there were a strategic investor in Calibra about a year ago. A little less than that I guess. We've been working with them though for over a year really tightly with their product and engineering teams to make sure that Collibra is adding real value. Our unified platform is touching pieces of our unified platform or touching all pieces of Snowflake. And when I say that, what I mean is we're first, you know, able to ingest data with Snowflake, which, which has always existed. We're able to profile and classify that data we're announcing with Calibra Protect this week that you're now able to create those policies on top of Snowflake and have them enforce. So again, people can get more value out of their snowflake more quickly as far as time to value with, with our policies for all business users to be able to create. >>We're also announcing Snowflake Lineage 2.0. So this is the ability to take stored procedures in Snowflake and understand the lineage of where did the data come from, how was it transformed with within Snowflake as well as the data quality. Pushdown, as I mentioned, data quality, you brought it up. It is a new, it is a, a big industry push and you know, one of the things I think Gartner mentioned is people are losing up to $15 million without having great data quality. So this push down capability for Snowflake really is again, a big ease of use push for us at Collibra of that ability to, to push it into snowflake, take advantage of the data, the data source, and the engine that already lives there and get the right and make sure you have the right quality. >>I mean, the nice thing about Snowflake, if you play in the Snowflake sandbox, you, you, you, you can get sort of a, you know, high degree of confidence that the data sharing can be done in a safe way. Bringing, you know, Collibra into the, into the story allows me to have that data quality and, and that governance that I, that I need. You know, we've said many times on the cube that one of the notable differences in cloud this decade versus last decade, I mean ob there are obvious differences just in terms of scale and scope, but it's shaping up to be about the strength of the ecosystems. That's really a hallmark of these big cloud players. I mean they're, it's a key factor for innovating, accelerating product delivery, filling gaps in, in the hyperscale offerings cuz you got more stack, you know, mature stack capabilities and you know, it creates this flywheel momentum as we often say. But, so my question is, how do you work with the hyperscalers? Like whether it's AWS or Google, whomever, and what do you see as your role and what's the Collibra sweet spot? >>Yeah, definitely. So, you know, one of the things I mentioned early on is the broader ecosystem of partners is what it's all about. And so we have that strong partnership with Snowflake. We also are doing more with Google around, you know, GCP and kbra protect there, but also tighter data plex integration. So similar to what you've seen with our strategic moves around Snowflake and, and really covering the broad ecosystem of what Collibra can do on top of that data source. We're extending that to the world of Google as well and the world of data plex. We also have great partners in SI's Infosys is somebody we spoke with at the conference who's done a lot of great work with Levi's as they're really important to help people with their whole data strategy and driving that data driven culture and, and Collibra being the core of it. >>Hi Laura, we're gonna, we're gonna end it there, but I wonder if you could kind of put a bow on, you know, this year, the event your, your perspectives. So just give us your closing thoughts. >>Yeah, definitely. So I, I wanna say this is one of the biggest releases Collibra's ever had. Definitely the biggest one since I've been with the company a little over a year. We have all these great new product innovations coming to really drive the ease of use to make data more valuable for users everywhere and, and companies everywhere. And so it's all about everybody being able to easily find, understand, and trust and get access to that data going forward. >>Well congratulations on all the pro progress. It was great to have you on the cube first time I believe, and really appreciate you, you taking the time with us. >>Yes, thank you for your time. >>You're very welcome. Okay, you're watching the coverage of Data Citizens 2022 on the cube, your leader in enterprise and emerging tech coverage. >>So data modernization oftentimes means moving some of your storage and computer to the cloud where you get the benefit of scale and security and so on. But ultimately it doesn't take away the silos that you have. We have more locations, more tools and more processes with which we try to get value from this data. To do that at scale in an organization, people involved in this process, they have to understand each other. So you need to unite those people across those tools, processes, and systems with a shared language. When I say customer, do you understand the same thing as you hearing customer? Are we counting them in the same way so that shared language unites us and that gives the opportunity for the organization as a whole to get the maximum value out of their data assets and then they can democratize data so everyone can properly use that shared language to find, understand, and trust the data asset that's available. >>And that's where Collibra comes in. We provide a centralized system of engagement that works across all of those locations and combines all of those different user types across the whole business. At Collibra, we say United by data and that also means that we're united by data with our customers. So here is some data about some of our customers. There was the case of an online do it yourself platform who grew their revenue almost three times from a marketing campaign that provided the right product in the right hands of the right people. In other case that comes to mind is from a financial services organization who saved over 800 K every year because they were able to reuse the same data in different kinds of reports and before there was spread out over different tools and processes and silos, and now the platform brought them together so they realized, oh, we're actually using the same data, let's find a way to make this more efficient. And the last example that comes to mind is that of a large home loan, home mortgage, mortgage loan provider where they have a very complex landscape, a very complex architecture legacy in the cloud, et cetera. And they're using our software, they're using our platform to unite all the people and those processes and tools to get a common view of data to manage their compliance at scale. >>Hey everyone, I'm Lisa Martin covering Data Citizens 22, brought to you by Collibra. This next conversation is gonna focus on the importance of data culture. One of our Cube alumni is back, Stan Christians is Collibra's co-founder and it's Chief Data citizens. Stan, it's great to have you back on the cube. >>Hey Lisa, nice to be. >>So we're gonna be talking about the importance of data culture, data intelligence, maturity, all those great things. When we think about the data revolution that every business is going through, you know, it's so much more than technology innovation. It also really re requires cultural transformation, community transformation. Those are challenging for customers to undertake. Talk to us about what you mean by data citizenship and the role that creating a data culture plays in that journey. >>Right. So as you know, our event is called Data Citizens because we believe that in the end, a data citizen is anyone who uses data to do their job. And we believe that today's organizations, you have a lot of people, most of the employees in an organization are somehow gonna to be a data citizen, right? So you need to make sure that these people are aware of it. You need that. People have skills and competencies to do with data what necessary and that's on, all right? So what does it mean to have a good data culture? It means that if you're building a beautiful dashboard to try and convince your boss, we need to make this decision that your boss is also open to and able to interpret, you know, the data presented in dashboard to actually make that decision and take that action. Right? >>And once you have that why to the organization, that's when you have a good data culture. Now that's continuous effort for most organizations because they're always moving, somehow they're hiring new people and it has to be continuous effort because we've seen that on the hand. Organizations continue challenged their data sources and where all the data is flowing, right? Which in itself creates a lot of risk. But also on the other set hand of the equation, you have the benefit. You know, you might look at regulatory drivers like, we have to do this, right? But it's, it's much better right now to consider the competitive drivers, for example, and we did an IDC study earlier this year, quite interesting. I can recommend anyone to it. And one of the conclusions they found as they surveyed over a thousand people across organizations worldwide is that the ones who are higher in maturity. >>So the, the organizations that really look at data as an asset, look at data as a product and actively try to be better at it, don't have three times as good a business outcome as the ones who are lower on the maturity scale, right? So you can say, ok, I'm doing this, you know, data culture for everyone, awakening them up as data citizens. I'm doing this for competitive reasons, I'm doing this re reasons you're trying to bring both of those together and the ones that get data intelligence right, are successful and competitive. That's, and that's what we're seeing out there in the market. >>Absolutely. We know that just generally stand right, the organizations that are, are really creating a, a data culture and enabling everybody within the organization to become data citizens are, We know that in theory they're more competitive, they're more successful. But the IDC study that you just mentioned demonstrates they're three times more successful and competitive than their peers. Talk about how Collibra advises customers to create that community, that culture of data when it might be challenging for an organization to adapt culturally. >>Of course, of course it's difficult for an organization to adapt but it's also necessary, as you just said, imagine that, you know, you're a modern day organization, laptops, what have you, you're not using those, right? Or you know, you're delivering them throughout organization, but not enabling your colleagues to actually do something with that asset. Same thing as through with data today, right? If you're not properly using the data asset and competitors are, they're gonna to get more advantage. So as to how you get this done, establish this. There's angles to look at, Lisa. So one angle is obviously the leadership whereby whoever is the boss of data in the organization, you typically have multiple bosses there, like achieve data officers. Sometimes there's, there's multiple, but they may have a different title, right? So I'm just gonna summarize it as a data leader for a second. >>So whoever that is, they need to make sure that there's a clear vision, a clear strategy for data. And that strategy needs to include the monetization aspect. How are you going to get value from data? Yes. Now that's one part because then you can leadership in the organization and also the business value. And that's important. Cause those people, their job in essence really is to make everyone in the organization think about data as an asset. And I think that's the second part of the equation of getting that right, is it's not enough to just have that leadership out there, but you also have to get the hearts and minds of the data champions across the organization. You, I really have to win them over. And if you have those two combined and obviously a good technology to, you know, connect those people and have them execute on their responsibilities such as a data intelligence platform like s then the in place to really start upgrading that culture inch by inch if you'll, >>Yes, I like that. The recipe for success. So you are the co-founder of Collibra. You've worn many different hats along this journey. Now you're building Collibra's own data office. I like how before we went live, we were talking about Calibra is drinking its own champagne. I always loved to hear stories about that. You're speaking at Data Citizens 2022. Talk to us about how you are building a data culture within Collibra and what maybe some of the specific projects are that Collibra's data office is working on. >>Yes, and it is indeed data citizens. There are a ton of speaks here, are very excited. You know, we have Barb from m MIT speaking about data monetization. We have Dilla at the last minute. So really exciting agen agenda. Can't wait to get back out there essentially. So over the years at, we've doing this since two and eight, so a good years and I think we have another decade of work ahead in the market, just to be very clear. Data is here to stick around as are we. And myself, you know, when you start a company, we were for people in a, if you, so everybody's wearing all sorts of hat at time. But over the years I've run, you know, presales that sales partnerships, product cetera. And as our company got a little bit biggish, we're now thousand two. Something like people in the company. >>I believe systems and processes become a lot important. So we said you CBRA isn't the size our customers we're getting there in of organization structure, process systems, et cetera. So we said it's really time for us to put our money where is and to our own data office, which is what we were seeing customers', organizations worldwide. And they organizations have HR units, they have a finance unit and over time they'll all have a department if you'll, that is responsible somehow for the data. So we said, ok, let's try to set an examples that other people can take away with it, right? Can take away from it. So we set up a data strategy, we started building data products, took care of the data infrastructure. That's sort of good stuff. And in doing all of that, ISA exactly as you said, we said, okay, we need to also use our product and our own practices and from that use, learn how we can make the product better, learn how we make, can make the practice better and share that learning with all the, and on, on the Monday mornings, we sometimes refer to eating our dog foods on Friday evenings. >>We referred to that drinking our own champagne. I like it. So we, we had a, we had the driver to do this. You know, there's a clear business reason. So we involved, we included that in the data strategy and that's a little bit of our origin. Now how, how do we organize this? We have three pillars, and by no means is this a template that everyone should, this is just the organization that works at our company, but it can serve as an inspiration. So we have a pillar, which is data science. The data product builders, if you'll or the people who help the business build data products. We have the data engineers who help keep the lights on for that data platform to make sure that the products, the data products can run, the data can flow and you know, the quality can be checked. >>And then we have a data intelligence or data governance builders where we have those data governance, data intelligence stakeholders who help the business as a sort of data partner to the business stakeholders. So that's how we've organized it. And then we started following the CBRA approach, which is, well, what are the challenges that our business stakeholders have in hr, finance, sales, marketing all over? And how can data help overcome those challenges? And from those use cases, we then just started to build a map and started execution use of the use case. And a important ones are very simple. We them with our, our customers as well, people talking about the cata, right? The catalog for the data scientists to know what's in their data lake, for example, and for the people in and privacy. So they have their process registry and they can see how the data flows. >>So that's a starting place and that turns into a marketplace so that if new analysts and data citizens join kbra, they immediately have a place to go to, to look at, see, ok, what data is out there for me as an analyst or a data scientist or whatever to do my job, right? So they can immediately get access data. And another one that we is around trusted business. We're seeing that since, you know, self-service BI allowed everyone to make beautiful dashboards, you know, pie, pie charts. I always, my pet pee is the pie chart because I love buy and you shouldn't always be using pie charts. But essentially there's become proliferation of those reports. And now executives don't really know, okay, should I trust this report or that report the reporting on the same thing. But the numbers seem different, right? So that's why we have trusted this reporting. So we know if a, the dashboard, a data product essentially is built, we not that all the right steps are being followed and that whoever is consuming that can be quite confident in the result either, Right. And that silver browser, right? Absolutely >>Decay. >>Exactly. Yes, >>Absolutely. Talk a little bit about some of the, the key performance indicators that you're using to measure the success of the data office. What are some of those KPIs? >>KPIs and measuring is a big topic in the, in the data chief data officer profession, I would say, and again, it always varies with to your organization, but there's a few that we use that might be of interest. Use those pillars, right? And we have metrics across those pillars. So for example, a pillar on the data engineering side is gonna be more related to that uptime, right? Are the, is the data platform up and running? Are the data products up and running? Is the quality in them good enough? Is it going up? Is it going down? What's the usage? But also, and especially if you're in the cloud and if consumption's a big thing, you have metrics around cost, for example, right? So that's one set of examples. Another one is around the data sciences and products. Are people using them? Are they getting value from it? >>Can we calculate that value in ay perspective, right? Yeah. So that we can to the rest of the business continue to say we're tracking all those numbers and those numbers indicate that value is generated and how much value estimated in that region. And then you have some data intelligence, data governance metrics, which is, for example, you have a number of domains in a data mesh. People talk about being the owner of a data domain, for example, like product or, or customer. So how many of those domains do you have covered? How many of them are already part of the program? How many of them have owners assigned? How well are these owners organized, executing on their responsibilities? How many tickets are open closed? How many data products are built according to process? And so and so forth. So these are an set of examples of, of KPIs. There's a, there's a lot more, but hopefully those can already inspire the audience. >>Absolutely. So we've, we've talked about the rise cheap data offices, it's only accelerating. You mentioned this is like a 10 year journey. So if you were to look into a crystal ball, what do you see in terms of the maturation of data offices over the next decade? >>So we, we've seen indeed the, the role sort of grow up, I think in, in thousand 10 there may have been like 10 achieve data officers or something. Gartner has exact numbers on them, but then they grew, you know, industries and the number is estimated to be about 20,000 right now. Wow. And they evolved in a sort of stack of competencies, defensive data strategy, because the first chief data officers were more regulatory driven, offensive data strategy support for the digital program. And now all about data products, right? So as a data leader, you now need all of those competences and need to include them in, in your strategy. >>How is that going to evolve for the next couple of years? I wish I had one of those balls, right? But essentially I think for the next couple of years there's gonna be a lot of people, you know, still moving along with those four levels of the stack. A lot of people I see are still in version one and version two of the chief data. So you'll see over the years that's gonna evolve more digital and more data products. So for next years, my, my prediction is it's all products because it's an immediate link between data and, and the essentially, right? Right. So that's gonna be important and quite likely a new, some new things will be added on, which nobody can predict yet. But we'll see those pop up in a few years. I think there's gonna be a continued challenge for the chief officer role to become a real executive role as opposed to, you know, somebody who claims that they're executive, but then they're not, right? >>So the real reporting level into the board, into the CEO for example, will continue to be a challenging point. But the ones who do get that done will be the ones that are successful and the ones who get that will the ones that do it on the basis of data monetization, right? Connecting value to the data and making that value clear to all the data citizens in the organization, right? And in that sense, they'll need to have both, you know, technical audiences and non-technical audiences aligned of course. And they'll need to focus on adoption. Again, it's not enough to just have your data office be involved in this. It's really important that you're waking up data citizens across the organization and you make everyone in the organization think about data as an asset. >>Absolutely. Because there's so much value that can be extracted. Organizations really strategically build that data office and democratize access across all those data citizens. Stan, this is an exciting arena. We're definitely gonna keep our eyes on this. Sounds like a lot of evolution and maturation coming from the data office perspective. From the data citizen perspective. And as the data show that you mentioned in that IDC study, you mentioned Gartner as well, organizations have so much more likelihood of being successful and being competitive. So we're gonna watch this space. Stan, thank you so much for joining me on the cube at Data Citizens 22. We appreciate it. >>Thanks for having me over >>From Data Citizens 22, I'm Lisa Martin, you're watching The Cube, the leader in live tech coverage. >>Okay, this concludes our coverage of Data Citizens 2022, brought to you by Collibra. Remember, all these videos are available on demand@thecube.net. And don't forget to check out silicon angle.com for all the news and wiki bod.com for our weekly breaking analysis series where we cover many data topics and share survey research from our partner ETR Enterprise Technology Research. If you want more information on the products announced at Data Citizens, go to collibra.com. There are tons of resources there. You'll find analyst reports, product demos. It's really worthwhile to check those out. Thanks for watching our program and digging into Data Citizens 2022 on the Cube, your leader in enterprise and emerging tech coverage. We'll see you soon.
SUMMARY :
largely about getting the technology to work. Now the cloud is definitely helping with that, but also how do you automate governance? So you can see how data governance has evolved into to say we extract the signal from the noise, and over the, the next couple of days, we're gonna feature some of the So it's a really interesting story that we're thrilled to be sharing And we said at the time, you know, maybe it's time to rethink data innovation. 2020s from the previous decade, and what challenges does that bring for your customers? as data becomes more impactful than important, the level of scrutiny with respect to privacy, So again, I think it just another incentive for organization to now truly look at data You know, I don't know when you guys founded Collibra, if, if you had a sense as to how complicated the last kind of financial crisis, and that was really the, the start of Colli where we found product market Well, that's interesting because, you know, in my observation it takes seven to 10 years to actually build a again, a lot of momentum in the org in, in the, in the markets with some of the cloud partners And the second is that those data pipelines that are now being created in the cloud, I mean, the acquisition of i l dq, you know, So that's really the theme of a lot of the innovation that we're driving. And so that's the big theme from an innovation perspective, One of our key differentiators is the ability to really drive a lot of automation through workflows. So actually pushing down the computer and data quality, one of the key principles you think about monetization. And I, and I think we we're really at this pivotal moment, and I think you said it well. We need to look beyond just the I know you're gonna crush it out there. This is Dave Valante for the cube, your leader in enterprise and Without data leverage the Collibra data catalog to automatically And for that you'll establish community owners, a data set to a KPI to a report now enables your users to see what Finally, seven, promote the value of this to your users and Welcome to the Cube's coverage of Data Citizens 2022 Collibra's customer event. And now you lead data quality at Collibra. imagine if we get that wrong, you know, what the ramifications could be, And I realized in that moment, you know, I might have failed him because, cause I didn't know. And it's so complex that the way companies consume them in the IT function is And so it's really become front and center just the whole quality issue because data's so fundamental, nowadays to this topic is, so maybe we could surface all of these problems with So the language is changing a you know, stale data, you know, the, the whole trend toward real time. we sort of lived this problem for a long time, you know, in, in the Wall Street days about a decade you know, they just said, Oh, it's a glitch, you know, so they didn't understand the root cause of it. And the one right now is these hyperscalers in the cloud. And I think if you look at the whole So this is interesting because what you just described, you know, you mentioned Snowflake, And so when you were to log into Big Query tomorrow using our I love this example because, you know, Barry talks about, wow, the cloud guys are gonna own the world and, Seeing that across the board, people used to know it was a zip code and nowadays Appreciate it. Right, and thank you for watching. Nice to be here. Can can you explain to our audience why the ability to manage data across the entire organization. I was gonna say, you know, when I look back at like the last 10 years, it was all about getting the technology to work and it And one of the big pushes and passions we have at Collibra is to help with I I, you know, you mentioned this idea of, and really speeding the time to value for any of the business analysts, So where do you see, you know, the friction in adopting new data technologies? So one of the other things we're announcing with, with all of the innovations that are coming is So anybody in the organization is only getting access to the data they should have access to. So it was kind of smart that you guys were early on and We're able to profile and classify that data we're announcing with Calibra Protect this week that and get the right and make sure you have the right quality. I mean, the nice thing about Snowflake, if you play in the Snowflake sandbox, you, you, you, you can get sort of a, We also are doing more with Google around, you know, GCP and kbra protect there, you know, this year, the event your, your perspectives. And so it's all about everybody being able to easily It was great to have you on the cube first time I believe, cube, your leader in enterprise and emerging tech coverage. the cloud where you get the benefit of scale and security and so on. And the last example that comes to mind is that of a large home loan, home mortgage, Stan, it's great to have you back on the cube. Talk to us about what you mean by data citizenship and the And we believe that today's organizations, you have a lot of people, And one of the conclusions they found as they So you can say, ok, I'm doing this, you know, data culture for everyone, awakening them But the IDC study that you just mentioned demonstrates they're three times So as to how you get this done, establish this. part of the equation of getting that right, is it's not enough to just have that leadership out Talk to us about how you are building a data culture within Collibra and But over the years I've run, you know, So we said you the data products can run, the data can flow and you know, the quality can be checked. The catalog for the data scientists to know what's in their data lake, and data citizens join kbra, they immediately have a place to go to, Yes, success of the data office. So for example, a pillar on the data engineering side is gonna be more related So how many of those domains do you have covered? to look into a crystal ball, what do you see in terms of the maturation industries and the number is estimated to be about 20,000 right now. How is that going to evolve for the next couple of years? And in that sense, they'll need to have both, you know, technical audiences and non-technical audiences And as the data show that you mentioned in that IDC study, the leader in live tech coverage. Okay, this concludes our coverage of Data Citizens 2022, brought to you by Collibra.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Laura | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Heineken | ORGANIZATION | 0.99+ |
Dave Valante | PERSON | 0.99+ |
Laura Sellers | PERSON | 0.99+ |
2008 | DATE | 0.99+ |
Collibra | ORGANIZATION | 0.99+ |
Adobe | ORGANIZATION | 0.99+ |
Felix Von Dala | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Felix Van Dema | PERSON | 0.99+ |
seven | QUANTITY | 0.99+ |
Stan Christians | PERSON | 0.99+ |
2010 | DATE | 0.99+ |
Lisa | PERSON | 0.99+ |
San Diego | LOCATION | 0.99+ |
Jay | PERSON | 0.99+ |
50 day | QUANTITY | 0.99+ |
Felix | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Kurt Hasselbeck | PERSON | 0.99+ |
Bank of America | ORGANIZATION | 0.99+ |
10 year | QUANTITY | 0.99+ |
California Consumer Privacy Act | TITLE | 0.99+ |
10 day | QUANTITY | 0.99+ |
Six | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
Dave Ante | PERSON | 0.99+ |
Last year | DATE | 0.99+ |
demand@thecube.net | OTHER | 0.99+ |
ETR Enterprise Technology Research | ORGANIZATION | 0.99+ |
Barry | PERSON | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
one part | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
2010s | DATE | 0.99+ |
2020s | DATE | 0.99+ |
Calibra | LOCATION | 0.99+ |
last year | DATE | 0.99+ |
two | QUANTITY | 0.99+ |
Calibra | ORGANIZATION | 0.99+ |
K Bear Protect | ORGANIZATION | 0.99+ |
two sides | QUANTITY | 0.99+ |
Kirk Hasselbeck | PERSON | 0.99+ |
12 months | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Barb | PERSON | 0.99+ |
Stan | PERSON | 0.99+ |
Data Citizens | ORGANIZATION | 0.99+ |
Kirk Haslbeck, Collibra | Data Citizens '22
(bright upbeat music) >> Welcome to theCUBE's Coverage of Data Citizens 2022 Collibra's Customer event. My name is Dave Vellante. With us is Kirk Hasselbeck, who's the Vice President of Data Quality of Collibra. Kirk, good to see you. Welcome. >> Thanks for having me, Dave. Excited to be here. >> You bet. Okay, we're going to discuss data quality, observability. It's a hot trend right now. You founded a data quality company, OwlDQ and it was acquired by Collibra last year. Congratulations! And now you lead data quality at Collibra. So we're hearing a lot about data quality right now. Why is it such a priority? Take us through your thoughts on that. >> Yeah, absolutely. It's definitely exciting times for data quality which you're right, has been around for a long time. So why now, and why is it so much more exciting than it used to be? I think it's a bit stale, but we all know that companies use more data than ever before and the variety has changed and the volume has grown. And while I think that remains true, there are a couple other hidden factors at play that everyone's so interested in as to why this is becoming so important now. And I guess you could kind of break this down simply and think about if Dave, you and I were going to build, you know a new healthcare application and monitor the heartbeat of individuals, imagine if we get that wrong, what the ramifications could be? What those incidents would look like? Or maybe better yet, we try to build a new trading algorithm with a crossover strategy where the 50 day crosses the 10 day average. And imagine if the data underlying the inputs to that is incorrect. We'll probably have major financial ramifications in that sense. So, it kind of starts there where everybody's realizing that we're all data companies and if we are using bad data, we're likely making incorrect business decisions. But I think there's kind of two other things at play. I bought a car not too long ago and my dad called and said, "How many cylinders does it have?" And I realized in that moment, I might have failed him because 'cause I didn't know. And I used to ask those types of questions about any lock brakes and cylinders and if it's manual or automatic and I realized I now just buy a car that I hope works. And it's so complicated with all the computer chips. I really don't know that much about it. And that's what's happening with data. We're just loading so much of it. And it's so complex that the way companies consume them in the IT function is that they bring in a lot of data and then they syndicate it out to the business. And it turns out that the individuals loading and consuming all of this data for the company actually may not know that much about the data itself and that's not even their job anymore. So, we'll talk more about that in a minute but that's really what's setting the foreground for this observability play and why everybody's so interested, it's because we're becoming less close to the intricacies of the data and we just expect it to always be there and be correct. >> You know, the other thing too about data quality and for years we did the MIT CDOIQ event we didn't do it last year at COVID, messed everything up. But the observation I would make there love thoughts is it data quality used to be information quality used to be this back office function, and then it became sort of front office with financial services and government and healthcare, these highly regulated industries. And then the whole chief data officer thing happened and people were realizing, well, they sort of flipped the bit from sort of a data as a a risk to data as an asset. And now, as we say, we're going to talk about observability. And so it's really become front and center, just the whole quality issue because data's fundamental, hasn't it? >> Yeah, absolutely. I mean, let's imagine we pull up our phones right now and I go to my favorite stock ticker app and I check out the NASDAQ market cap. I really have no idea if that's the correct number. I know it's a number, it looks large, it's in a numeric field. And that's kind of what's going on. There's so many numbers and they're coming from all of these different sources and data providers and they're getting consumed and passed along. But there isn't really a way to tactically put controls on every number and metric across every field we plan to monitor. But with the scale that we've achieved in early days, even before Collibra. And what's been so exciting is we have these types of observation techniques, these data monitors that can actually track past performance of every field at scale. And why that's so interesting and why I think the CDO is listening right intently nowadays to this topic is so maybe we could surface all of these problems with the right solution of data observability and with the right scale and then just be alerted on breaking trends. So we're sort of shifting away from this world of must write a condition and then when that condition breaks, that was always known as a break record. But what about breaking trends and root cause analysis? And is it possible to do that, with less human intervention? And so I think most people are seeing now that it's going to have to be a software tool and a computer system. It's not ever going to be based on one or two domain experts anymore. >> So, how does data observability relate to data quality? Are they sort of two sides of the same coin? Are they cousins? What's your perspective on that? >> Yeah, it's super interesting. It's an emerging market. So the language is changing a lot of the topic and areas changing the way that I like to say it or break it down because the lingo is constantly moving as a target on this space is really breaking records versus breaking trends. And I could write a condition when this thing happens it's wrong and when it doesn't, it's correct. Or I could look for a trend and I'll give you a good example. Everybody's talking about fresh data and stale data and why would that matter? Well, if your data never arrived or only part of it arrived or didn't arrive on time, it's likely stale and there will not be a condition that you could write that would show you all the good and the bads. That was kind of your traditional approach of data quality break records. But your modern day approach is you lost a significant portion of your data, or it did not arrive on time to make that decision accurately on time. And that's a hidden concern. Some people call this freshness, we call it stale data but it all points to the same idea of the thing that you're observing may not be a data quality condition anymore. It may be a breakdown in the data pipeline. And with thousands of data pipelines in play for every company out there there, there's more than a couple of these happening every day. >> So what's the Collibra angle on all this stuff made the acquisition you got data quality observability coming together, you guys have a lot of expertise in this area but you hear providence of data you just talked about stale data, the whole trend toward real time. How is Collibra approaching the problem and what's unique about your approach? >> Well, I think where we're fortunate is with our background, myself and team we sort of lived this problem for a long time in the Wall Street days about a decade ago. And we saw it from many different angles. And what we came up with before it was called data observability or reliability was basically the underpinnings of that. So we're a little bit ahead of the curve there when most people evaluate our solution. It's more advanced than some of the observation techniques that currently exist. But we've also always covered data quality and we believe that people want to know more, they need more insights and they want to see break records and breaking trends together so they can correlate the root cause. And we hear that all the time. I have so many things going wrong just show me the big picture. Help me find the thing that if I were to fix it today would make the most impact. So we're really focused on root cause analysis, business impact connecting it with lineage and catalog, metadata. And as that grows, you can actually achieve total data governance. At this point, with the acquisition of what was a lineage company years ago and then my company OwlDQ, now Collibra Data Quality, Collibra may be the best positioned for total data governance and intelligence in the space. >> Well, you mentioned financial services a couple of times and some examples, remember the flash crash in 2010. Nobody had any idea what that was, they just said, "Oh, it's a glitch." So they didn't understand the root cause of it. So this is a really interesting topic to me. So we know at Data Citizens '22 that you're announcing you got to announce new products, right? Your yearly event, what's new? Give us a sense as to what products are coming out but specifically around data quality and observability. >> Absolutely. There's always a next thing on the forefront. And the one right now is these hyperscalers in the cloud. So you have databases like Snowflake and Big Query and Data Bricks, Delta Lake and SQL Pushdown. And ultimately what that means is a lot of people are storing in loading data even faster in a salike model. And we've started to hook in to these databases. And while we've always worked with the same databases in the past they're supported today we're doing something called Native Database pushdown, where the entire compute and data activity happens in the database. And why that is so interesting and powerful now is everyone's concerned with something called Egress. Did my data that I've spent all this time and money with my security team securing ever leave my hands? Did it ever leave my secure VPC as they call it? And with these native integrations that we're building and about to unveil here as kind of a sneak peek for next week at Data Citizens, we're now doing all compute and data operations in databases like Snowflake. And what that means is with no install and no configuration you could log into the Collibra Data Quality app and have all of your data quality running inside the database that you've probably already picked as your your go forward team selection secured database of choice. So we're really excited about that. And I think if you look at the whole landscape of network cost, egress cost, data storage and compute, what people are realizing is it's extremely efficient to do it in the way that we're about to release here next week. >> So this is interesting because what you just described you mentioned Snowflake, you mentioned Google, oh actually you mentioned yeah, the Data Bricks. Snowflake has the data cloud. If you put everything in the data cloud, okay, you're cool but then Google's got the open data cloud. If you heard Google Nest and now Data Bricks doesn't call it the data cloud but they have like the open source data cloud. So you have all these different approaches and there's really no way up until now I'm hearing to really understand the relationships between all those and have confidence across, it's like (indistinct) you should just be a note on the mesh. And I don't care if it's a data warehouse or a data lake or where it comes from, but it's a point on that mesh and I need tooling to be able to have confidence that my data is governed and has the proper lineage, providence. And that's what you're bringing to the table. Is that right? Did I get that right? >> Yeah, that's right. And for us, it's not that we haven't been working with those great cloud databases, but it's the fact that we can send them the instructions now we can send them the operating ability to crunch all of the calculations, the governance, the quality and get the answers. And what that's doing, it's basically zero network cost, zero egress cost, zero latency of time. And so when you were to log into Big BigQuery tomorrow using our tool or let or say Snowflake, for example, you have instant data quality metrics, instant profiling, instant lineage and access privacy controls things of that nature that just become less onerous. What we're seeing is there's so much technology out there just like all of the major brands that you mentioned but how do we make it easier? The future is about less clicks, faster time to value faster scale, and eventually lower cost. And we think that this positions us to be the leader there. >> I love this example because every talks about wow the cloud guys are going to own the world and of course now we're seeing that the ecosystem is finding so much white space to add value, connect across cloud. Sometimes we call it super cloud and so, or inter clouding. Alright, Kirk, give us your final thoughts and on the trends that we've talked about and Data Citizens '22. >> Absolutely. Well I think, one big trend is discovery and classification. Seeing that across the board people used to know it was a zip code and nowadays with the amount of data that's out there, they want to know where everything is where their sensitive data is. If it's redundant, tell me everything inside of three to five seconds. And with that comes, they want to know in all of these hyperscale databases, how fast they can get controls and insights out of their tools. So I think we're going to see more one click solutions, more SAS-based solutions and solutions that hopefully prove faster time to value on all of these modern cloud platforms. >> Excellent, all right. Kurt Hasselbeck, thanks so much for coming on theCUBE and previewing Data Citizens '22. Appreciate it. >> Thanks for having me, Dave. >> You're welcome. All right, and thank you for watching. Keep it right there for more coverage from theCUBE.
SUMMARY :
Kirk, good to see you. Excited to be here. and it was acquired by Collibra last year. And it's so complex that the And now, as we say, we're going and I check out the NASDAQ market cap. and areas changing the and what's unique about your approach? of the curve there when most and some examples, remember and data activity happens in the database. and has the proper lineage, providence. and get the answers. and on the trends that we've talked about and solutions that hopefully and previewing Data Citizens '22. All right, and thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Collibra | ORGANIZATION | 0.99+ |
Kurt Hasselbeck | PERSON | 0.99+ |
2010 | DATE | 0.99+ |
one | QUANTITY | 0.99+ |
Kirk Hasselbeck | PERSON | 0.99+ |
50 day | QUANTITY | 0.99+ |
Kirk | PERSON | 0.99+ |
10 day | QUANTITY | 0.99+ |
OwlDQ | ORGANIZATION | 0.99+ |
Kirk Haslbeck | PERSON | 0.99+ |
next week | DATE | 0.99+ |
ORGANIZATION | 0.99+ | |
last year | DATE | 0.99+ |
two sides | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
NASDAQ | ORGANIZATION | 0.99+ |
Snowflake | TITLE | 0.99+ |
Data Citizens | ORGANIZATION | 0.99+ |
Data Bricks | ORGANIZATION | 0.99+ |
two other things | QUANTITY | 0.98+ |
one click | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
today | DATE | 0.98+ |
five seconds | QUANTITY | 0.97+ |
two domain | QUANTITY | 0.94+ |
Collibra Data Quality | TITLE | 0.92+ |
MIT CDOIQ | EVENT | 0.9+ |
Data Citizens '22 | TITLE | 0.9+ |
Egress | ORGANIZATION | 0.89+ |
Delta Lake | TITLE | 0.89+ |
three | QUANTITY | 0.86+ |
zero | QUANTITY | 0.85+ |
Big Query | TITLE | 0.85+ |
about a decade ago | DATE | 0.85+ |
SQL Pushdown | TITLE | 0.83+ |
Data Citizens 2022 Collibra | EVENT | 0.82+ |
Big BigQuery | TITLE | 0.81+ |
more than a couple | QUANTITY | 0.79+ |
couple | QUANTITY | 0.78+ |
one big | QUANTITY | 0.77+ |
Collibra Data Quality | ORGANIZATION | 0.75+ |
Collibra | OTHER | 0.75+ |
Google Nest | ORGANIZATION | 0.75+ |
Data Citizens '22 | ORGANIZATION | 0.74+ |
zero latency | QUANTITY | 0.72+ |
SAS | ORGANIZATION | 0.71+ |
Snowflake | ORGANIZATION | 0.69+ |
COVID | ORGANIZATION | 0.69+ |
years ago | DATE | 0.68+ |
Wall Street | LOCATION | 0.66+ |
theCUBE | ORGANIZATION | 0.66+ |
many numbers | QUANTITY | 0.63+ |
Collibra | PERSON | 0.63+ |
times | QUANTITY | 0.61+ |
Data | ORGANIZATION | 0.61+ |
too long | DATE | 0.6+ |
Vice President | PERSON | 0.57+ |
data | QUANTITY | 0.56+ |
CDO | TITLE | 0.52+ |
Bricks | TITLE | 0.48+ |
Jack Andersen & Joel Minnick, Databricks | AWS Marketplace Seller Conference 2022
(upbeat music) >> Welcome back everyone to The Cubes coverage here in Seattle, Washington. For AWS's Marketplace Seller Conference. It's the big news within the Amazon partner network, combining with marketplace, forming the Amazon partner organization. Part of a big reorg as they grow to the next level, NextGen cloud, mid-game on the chessboard. Cube's got it covered. I'm John Furry, your host at Cube. Great guests here from Data bricks. Both cube alumni's. Jack Anderson, GM and VP of the Databricks partnership team for AWS. You handle that relationship and Joel Minick vice president of product and partner marketing. You guys have the keys to the kingdom with Databricks and AWS. Thanks for joining. Good to see you again. >> Thanks for having us back. >> Yeah, John, great to be here. >> So I feel like we're at Reinvent 2013. Small event, no stage, but there's a real shift happening with procurement. Obviously it's a no brainer on the micro, you know, people should be buying online. Self-service, Cloud Scale. But Amazon's got billions being sold through their marketplace. They've reorganized their partner network. You can see kind of what's going on. They've kind of figured it out. Like let's put everything together and simplify and make it less of a website, marketplace. Merge our partner organizations, have more synergy and frictionless experiences so everyone can make more money and customer's are going to be happier. >> Yeah, that's right. >> I mean, you're running relationship. You're in the middle of it. >> Well, Amazon's mental model here is that they want the world's best ISVs to operate on AWS so that we can collaborate and co architect on behalf of customers. And that's exactly what the APO and marketplace allow us to do, is to work with Amazon on these really, you know, unique use cases. >> You know, I interviewed Ali many times over the years. I remember many years ago, maybe six, seven years ago, we were talking. He's like, "we're all in on AWS." Obviously now the success of Databricks, you've got multiple clouds, see that. Customers have choice. But I remember the strategy early on. It was like, we're going to be deep. So this is, speaks volumes to the relationship you have. Years. Jack, take us through the relationship that Databricks has with AWS from a partner perspective. Joel, and from a product perspective. Because it's not like you guys are Johnny come lately, new to the scene. >> Right. >> You've been there, almost president creation of this wave. What's the relationship and how does it relate to what's going on today? >> So most people may not know that Databricks was born on AWS. We actually did our first $100 million of revenue on Amazon. And today we're obviously available on multiple clouds. But we're very fond of our Amazon relationship. And when you look at what the APN allows us to do, you know, we're able to expand our reach and co-sell with Amazon, and marketplace broadens our reach. And so, we think of marketplace in three different aspects. We've got the marketplace private offer business, which we've been doing for a number of years. Matter of fact, we were driving well over a hundred percent year over year growth in private offers. And we have a nine figure business. So it's a very significant business. And when a customer uses a private offer, that private offer counts against their private pricing agreement with AWS. So they get pricing power against their private pricing. So it's really important it goes on their Amazon bill. In may we launched our pay as you go, on demand offering. And in five short months, we have well over a thousand subscribers. And what this does, is it really reduces the barriers to entry. It's low friction. So anybody in an enterprise or startup or public sector company can start to use Databricks on AWS, in a consumption based model, and have it go against their monthly bill. And so we see customers, you know, doing rapid experimentation, pilots, POCs. They're really learning the value of that first, use case. And then we see rapid use case expansion. And the third aspect is the consulting partner, private offer, CPPO. Super important in how we involve our partner ecosystem of our consulting partners and our resellers that are able to work with Databricks on behalf of customers. >> So you got the big contracts with the private offer. You got the product market fit, kind of people iterating with data, coming in with the buyers you get. And obviously the integration piece all fitting in there. >> Exactly. >> Okay, so those are the offers, that's current, what's in marketplace today. Is that the products... What are people buying? >> Yeah. >> I mean, I guess what's the... Joel, what are people buying in the marketplace? And what does it mean for them? >> So fundamentally what they're buying is the ability to take silos out of their organization. And that is the problem that Databricks is out there to solve. Which is, when you look across your data landscape today, you've got unstructured data, you've got structured data, you've got real time streaming data. And your teams are trying to use all of this data to solve really complicated problems. And as Databricks, as the Lakehouse Company, what we're helping customers do is, how do they get into the new world? How do they move to a place where they can use all of that data across all of their teams? And so we allow them to begin to find, through the marketplace, those rapid adoption use cases where they can get rid of these data warehousing, data lake silos they've had in the past. Get their unstructured and structured data onto one data platform, an open data platform, that is no longer adherent to any proprietary formats and standards and something they can, very much, very easily, integrate into the rest of their data environment. Apply one common data governance layer on top of that. So that from the time they ingest that data, to the time they use that data, to the time they share that data, inside and outside of their organization, they know exactly how it's flowing. They know where it came from. They know who's using it. They know who has access to it. They know how it's changing. And then with that common data platform, with that common governance solution, they'd being able to bring all of those use cases together. Across their real time streaming, their data engineering, their BI, their AI. All of their teams working on one set of data. And that lets them move really, really fast. And it also lets them solve challenges they just couldn't solve before. A good example of this, you know, one of the world's now largest data streaming platforms runs on Databricks with AWS. And if you think about what does it take to set that up? Well, they've got all this customer data that was historically inside of data warehouses. That they have to understand who their customers are. They have all this unstructured data, they've built their data science model, so they can do the right kinds of recommendation engines and forecasting around. And then they've got all this streaming data going back and forth between click stream data, from what the customers are doing with their platform and the recommendations they want to push back out. And if those teams were all working in individual silos, building these kinds of platforms would be extraordinarily slow and complex. But by building it on Databricks, they were able to release it in record time and have grown at a record pace to now be the number one platform. >> And this product, it's impacting product development. >> Absolutely. >> I mean, this is like the difference between lagging months of product development, to like days. >> Yes. >> Pretty much what you're getting at. >> Yes. >> So total agility. >> Mm-hmm. >> I got that. Okay, now, I'm a customer I want to buy in the marketplace, but you got direct Salesforce up there. So how do you guys look at this? Is there channel conflict? Are there comp programs? Because one of the things I heard today in on the stage from AWS's leadership, Chris, was up there speaking, and Mona was, "Hey, he's a CRO conference chief revenue officer" conversation. Which means someone's getting compensated. So, if I'm the sales rep at Databricks, what's my motion to the customer? Do I get paid? Does Amazon sell it? Take us through that. Is there channel conflict? Or, how do you handle it? >> Well, I'd add what Joel just talked about with, you know, with the solution, the value of the solution our entire offering is available on AWS marketplace. So it's not a subset, it's the entire Data Bricks offering. And- >> The flagship, all the, the top stuff. >> Everything, the flagship, the complete offering. So it's not segmented. It's not a sub segment. >> Okay. >> It's, you know, you can use all of our different offerings. Now when it comes to seller compensation, we view this two different ways, right? One is that AWS is also incented, right? Versus selling a native service to recommend Databricks for the right situation. Same thing with Databricks, our sales force wants to do the right thing for the customer. If the customer wants to use marketplace as their procurement vehicle. And that really helps customers because if you get Databricks and five other ISVs together, and let's say each ISV is spending, you're spending a million dollars. You have $5 million of spend. You put that spend through the flywheel with AWS marketplace, and then you can use that in your negotiations with AWS to get better pricing overall. So that's how we view it. >> So customers are driving. This sounds like. >> Correct. For sure. >> So they're looking at this as saying, Hey, I'm going to just get purchasing power with all my relationships. Because it's a solution architectural market, right? >> Yeah. It makes sense. Because if most customers will have a primary and secondary cloud provider. If they can consolidate, you know, multiple ISV spend through that same primary provider, you get pricing power. >> Okay, Joel, we're going to date ourselves. At least I will. So back in the old days, (group laughter) It used to be, do a Barney deal with someone, Hey, let's go to market together. You got to get paper, you do a biz dev deal. And then you got to say, okay, now let's coordinate our sales teams, a lot of moving parts. So what you're getting at here is that the alternative for Databricks, or any company is, to go find those partners and do deals, versus now Amazon is the center point for the customer. So you can still do those joint deals, but this seems to be flipping the script a little bit. >> Well, it is, but we still have vars and consulting partners that are doing implementation work. Very valuable work, advisory work, that can actually work with marketplace through the CPPO offering. So the marketplace allows multiple ways to procure your solution. >> So it doesn't change your business structure. It just makes it more efficient. >> That's correct. >> That's a great way to say it. >> Yeah, that's great. >> Okay. So, that's it. So that's just makes it more efficient. So you guys are actually incented to point customers to the marketplace. >> Yes. >> Absolutely. >> Economically. >> Economically, it's the right thing to do for the customer. It's the right thing to do for our relationship with Amazon. Especially when it comes back to co-selling, right? Because Amazon now is leaning in with ISVs and making recommendations for, you know, an ISV solution. And our teams are working backwards from those use cases, you know, to collaborate and land them. >> Yeah. I want to get that out there. Go ahead, Joel. >> So one of the other things I might add to that too, you know, and why this is advantageous for companies like Databricks to work through the marketplace. Is it makes it so much easier for customers to deploy a solution. It's very, literally, one click through the marketplace to get Databricks stood up inside of your environment. And so if you're looking at how do I help customers most rapidly adopt these solutions in the AWS cloud, the marketplace is a fantastic accelerator to that. >> You know, it's interesting. I want to bring this up and get your reaction to it because to me, I think this is the future of procurement. So from a procurement standpoint, I mean, again, dating myself, EDI back in the old days, you know, all that craziness. Now this is all the internet, basically through the console. I get the infrastructure side, you know, spin up and provision some servers, all been good. You guys have played well there in the marketplace. But now as we get into more of what I call the business apps, and they brought this up on stage. A little nuanced. Most enterprises aren't yet there of integrating tech, on the business apps, into the stack. This is where I think you guys are a use case of success where you guys have been successful with data integration. It's an integrators dilemma, not an innovator's dilemma. So like, I want to integrate. So now I have integration points with Databricks, but I want to put an app in there. I want to provision an application, but it has to be built. It's not, you don't buy it. You build, you got to build stuff. And this is the nuance. What's your reaction to that? Am I getting this right? Or am I off because, no one's going to be buying software like they used to. They buy software to integrate it. >> Yeah, no- >> Because everything's integrated. >> I think AWS has done a great job at creating a partner ecosystem, right? To give customers the right tools for the right jobs. And those might be with third parties. Databricks is doing the same thing with our partner connect program, right? We've got customer partners like Five Tran and DBT that, you know, augment and enhance our platform. And so you're looking at multi ISV architectures and all of that can be procured through the AWS marketplace. >> Yeah. It's almost like, you know, bundling and un bundling. I was talking about this with, with Dave Alante about Supercloud. Which is why wouldn't a customer want the best solution in their architecture? Period. In its class. If someone's got API security or an API gateway. Well, you know, I don't want to be forced to buy something because it's part of a suite. And that's where you see things get sub optimized. Where someone dominates a category and they have, oh, you got to buy my version of this. >> Joel and I were talking, we were actually saying, what's really important about Databricks, is that customers control the data, right? You want to comment on that? >> Yeah. I was going to say, you know, what you're pushing on there, we think is extraordinarily, you know, the way the market is going to go. Is that customers want a lot of control over how they build their data stack. And everyone's unique in what tools are the right ones for them. And so one of the, you know, philosophically, I think, really strong places, Databricks and AWS have lined up, is we both take an approach that you should be able to have maximum flexibility on the platform. And as we think about the Lakehouse, one thing we've always been extremely committed to, as a company, is building the data platform on an open foundation. And we do that primarily through Delta Lake and making sure that, to Jack's point, with Databricks, the data is always in your control. And then it's always stored in a completely open format. And that is one of the things that's allowed Databricks to have the breadth of integrations that it has with all the other data tools out there. Because you're not tied into any proprietary format, but instead are able to take advantage of all the innovation that's happening out there in the open source ecosystem. >> When you see other solutions out there that aren't as open as you guys, you guys are very open by the way, we love that too. We think that's a great strategy, but what am I foreclosing if I go with something else that's not as open? What's the customer's downside as you think about what's around the corner in the industry? Because if you believe it's going to be open, open source, which I think open source software is the software industry, and integration is a big deal. Because software's going to be plentiful. >> Sure. >> Let's face it. It's a good time to be in software business. But Cloud's booming. So what's the downside, from your Databricks perspective? You see a buyer clicking on Databricks versus that alternative. What's potentially should they be a nervous about, down the road, if they go with a more proprietary or locked in approach? >> Yeah. >> Well, I think the challenge with proprietary ecosystems is you become beholden to the ability of that provider to both build relationships and convince other vendors that they should invest in that format. But you're also, then, beholden to the pace at which that provider is able to innovate. >> Mm-hmm. >> And I think we've seen lots of times over history where, you know, a proprietary format may run ahead, for a while, on a lot of innovation. But as that market control begins to solidify, that desire to innovate begins to degrade. Whereas in the open formats- >> So extract rents versus innovation. (John laughs) >> Exactly. Yeah, exactly. >> I'll say it. >> But in the open world, you know, you have to continue to innovate. >> Yeah. >> And the open source world is always innovating. If you look at the last 10 to 15 years, I challenge you to find, you know, an example where the innovation in the data and AI world is not coming from open source. And so by investing in open ecosystems, that means you are always going to be at the forefront of what is the latest. >> You know, again, not to date myself again, but you look back at the eighties and nineties, the protocol stacked with proprietary. >> Yeah. >> You know, SNA and IBM, deck net was digital. You know the rest. And then TCPIP was part of the open systems interconnect. >> Mm-hmm. >> Revolutionary (indistinct) a big part of that, as well as my school did. And so like, you know, that was, but it didn't standardize the whole stack. It stopped at IP and TCP. >> Yeah. >> But that helped inter operate, that created a nice defacto. So this is a big part of this mid game. I call it the chessboard, you know, you got opening game and mid-game, then you get the end game. You're not there at the end game yet at Cloud. But Cloud- >> There's, always some form of lock in, right? Andy Jazzy will address it, you know, when making a decision. But if you're going to make a decision you want to reduce- You don't want to be limited, right? So I would advise a customer that there could be limitations with a proprietary architecture. And if you look at what every customer's trying to become right now, is an AI driven business, right? And so it has to do with, can you get that data out of silos? Can you organize it and secure it? And then can you work with data scientists to feed those models? >> Yeah. >> In a very consistent manner. And so the tools of tomorrow will, to Joel's point, will be open and we want interoperability with those tools. >> And choice is a matter too. And I would say that, you know, the argument for why I think Amazon is not as locked in as maybe some other clouds, is that they have to compete directly too. Redshift competes directly with a lot of other stuff. But they can't play the bundling game because the customers are getting savvy to the fact that if you try to bundle an inferior product with something else, it may not work great at all. And they're going to be, they're onto it. This is the- >> To Amazon's credit by having these solutions that may compete with native services in marketplace, they are providing customers with choice, low price- >> And access to the core value. Which is the hardware- >> Exactly. >> Which is their platform. Okay. So I want to get you guys thought on something else I see emerging. This is, again, kind of Cube rumination moment. So on stage, Chris unpacked a lot of stuff. I mean this marketplace, they're touching a lot of hot buttons here, you know, pricing, compensation, workflows, services behind the curtain. And one of those things he mentioned was, they talk about resellers or channel partners, depending upon what you talk about. We believe, Dave and I believe on the Cube, that the entire indirect sales channel of the industry is going to be disrupted radically. Because those players were selling hardware in the old days and software. That game is going to change. You mentioned you guys have a program, let me get your thoughts on this. We believe that once this gets set up, they can play in this game and bring their services in. Which means that the old reseller channels are going to be rewritten. They're going to be refactored with this new kinds of access. Because you've got scale, you've got money and you've got product. And you got customers coming into the marketplace. So if you're like a reseller that sold computers to data centers or software, you know, a value added reseller or VAB or business. >> You've got to evolve. >> You got to, you got to be here. >> Yes. >> Yeah. >> How are you guys working with those partners? Because you say you have a product in your marketplace there. How do I make money if I'm a reseller with Databricks, with Amazon? Take me through that use case. >> Well I'll let Joel comment, but I think it's pretty straightforward, right? Customers need expertise. They need knowhow. When we're seeing customers do mass migrations to the cloud or Hadoop specific migrations or data transformation implementations. They need expertise from consulting and SI partners. If those consulting and SI partners happen to resell the solution as well. Well, that's another aspect of their business. But I really think it is the expertise that the partners bring to help customers get outcomes. >> Joel, channel big opportunity for Amazon to reimagine this. >> For sure. Yeah. And I think, you know, to your comment about how do resellers take advantage of that, I think what Jack was pushing on is spot on. Which is, it's becoming more and more about the expertise you bring to the table. And not just transacting the software. But now actually helping customers make the right choices. And we're seeing, you know, both SIs begin to be able to resell solutions and finding a lot of opportunity in that. >> Yeah. And I think we're seeing traditional resellers begin to move into that SI model as well. And that's going to be the evolution that this goes. >> At the end of the day, it's about services, right? >> For sure. Yeah. >> I mean... >> You've got a great service. You're going to have high gross profits. >> Yeah >> Managed service provider business is alive and well, right? Because there are a number of customers that want that type of a service. >> I think that's going to be a really hot, hot button for you guys. I think being the way you guys are open, this channel, partner services model coming in, to the fold, really kind of makes for kind of that Supercloud like experience, where you guys now have an ecosystem. And that's my next question. You guys have an ecosystem going on, within Databricks. >> For sure. >> On top of this ecosystem. How does that work? This is kind of like, hasn't been written up in business school and case studies yet. This is new. What is this? >> I think, you know, what it comes down to is, you're seeing ecosystems begin to evolve around the data platforms. And that's going to be one of the big, kind of, new horizons for us as we think about what drives ecosystems. It's going to be around, well, what's the data platform that I'm using? And then all the tools that have to encircle that to get my business done. And so I think there's, you know, absolutely ecosystems inside of the AWS business on all of AWS's services, across data analytics and AI. And then to your point, you are seeing ecosystems now arise around Databricks in its Lakehouse platform as well. As customers are looking at well, if I'm standing these Lakehouses up and I'm beginning to invest in this, then I need a whole set of tools that help me get that done as well. >> I mean you think about ecosystem theory, we're living a whole nother dream. And I'm not kidding. It hasn't yet been written up and for business school case studies is that, we're now in a whole nother connective tissue, ecology thing happening. Where you have dependencies and value proposition. Economics, connectedness. So you have relationships in these ecosystems. >> And I think one of the great things about the relationships with these ecosystems, is that there's a high degree of overlap. >> Yeah. >> So you're seeing that, you know, the way that the cloud business is evolving, the ecosystem partners of Databricks, are the same ecosystem partners of AWS. And so as you build these platforms out into the cloud, you're able to really take advantage of best of breed, the broadest set of solutions out there for you. >> Joel, Jack, I love it because you know what it means? The best ecosystem will win, if you keep it open. >> Sure, sure. >> You can see everything. If you're going to do it in the dark, you know, you don't know the outcome. I mean, this is really kind of what we're talking about. >> And John, can I just add that when I was at Amazon, we had a theory that there's buyers and builders, right? There's very innovative companies that want to build things themselves. We're seeing now that that builders want to buy a platform. Right? >> Yeah. >> And so there's a platform decision being made and that ecosystem is going to evolve around the platform. >> Yeah, and I totally agree. And the word innovation gets kicked around. That's why, you know, when we had our Supercloud panel, it was called the innovators dilemma, with a slash through it, called the integrater's dilemma. Innovation is the digital transformation. So- >> Absolutely. >> Like that becomes cliche in a way, but it really becomes more of a, are you open? Are you integrating? If APIs are connective tissue, what's automation, what's the service messages look like? I mean, a whole nother set of, kind of thinking, goes on in these new ecosystems and these new products. >> And that thinking is, has been born in Delta Sharing, right? So the idea that you can have a multi-cloud implementation of Databricks, and actually share data between those two different clouds, that is the next layer on top of the native cloud solution. >> Well, Databricks has done a good job of building on top of the goodness of, and the CapEx gift from AWS. But you guys have done a great job taking that building differentiation into the product. You guys have great customer base, great growing ecosystem. And again, I think a shining example of what every enterprise is going to do. Build on top of something, operating model, get that operating model, driving revenue. >> Mm-hmm. >> Yeah. >> Whether, you're Goldman Sachs or capital one or XYZ corporation. >> S and P global, NASDAQ. >> Yeah. >> We've got, you know, the biggest verticals in the world are solving tough problems with Databricks. I think we'd be remiss because if Ali was here, he would really want to thank Amazon for all of the investments across all of the different functions. Whether it's the relationship we have with our engineering and service teams. Our marketing teams, you know, product development. And we're going to be at Reinvent. A big presence at Reinvent. We're looking forward to seeing you there, again. >> Yeah. We'll see you guys there. Yeah. Again, good ecosystem. I love the ecosystem evolutions happening. This NextGen Cloud is here. We're seeing this evolve, kind of new economics, new value propositions kind of scaling up. Producing more. So you guys are doing a great job. Thanks for coming on the Cube and taking the time. Joel, great to see you at the check. >> Thanks for having us, John. >> Okay. Cube coverage here. The world's changing as APN comes together with the marketplace for a new partner organization at Amazon web services. The Cube's got it covered. This should be a very big, growing ecosystem as this continues. Billions of being sold through the marketplace. And of course the buyers are happy as well. So we've got it all covered. I'm John Furry. your host of the cube. Thanks for watching. (upbeat music)
SUMMARY :
You guys have the keys to the kingdom on the micro, you know, You're in the middle of it. you know, unique use cases. to the relationship you have. and how does it relate to And so we see customers, you know, And obviously the integration Is that the products... buying in the marketplace? And that is the problem that Databricks And this product, it's the difference between So how do you guys look at So it's not a subset, it's the Everything, the flagship, and then you can use So customers are driving. For sure. Hey, I'm going to just you know, multiple ISV spend here is that the alternative So the marketplace allows multiple ways So it doesn't change So you guys are actually incented It's the right thing to do for out there. the marketplace to get Databricks stood up I get the infrastructure side, you know, Databricks is doing the same thing And that's where you see And that is one of the things that aren't as open as you guys, down the road, if they go that provider is able to innovate. that desire to innovate begins to degrade. So extract rents versus innovation. Yeah, exactly. But in the open world, you know, And the open source the protocol stacked with proprietary. You know the rest. And so like, you know, that was, I call it the chessboard, you know, And if you look at what every customer's And so the tools of tomorrow And I would say that, you know, And access to the core value. to data centers or software, you know, How are you guys working that the partners bring to to reimagine this. And I think, you know, And that's going to be the Yeah. You're going to have high gross profits. that want that type of a service. I think being the way you guys are open, This is kind of like, And so I think there's, you know, So you have relationships And I think one of the great things And so as you build these because you know what it means? in the dark, you know, that want to build things themselves. to evolve around the platform. And the word innovation more of a, are you open? So the idea that you and the CapEx gift from AWS. Whether, you're Goldman for all of the investments across Joel, great to see you at the check. And of course the buyers
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David Nicholson | PERSON | 0.99+ |
Chris | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Joel | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Peter | PERSON | 0.99+ |
Mona | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
David Vellante | PERSON | 0.99+ |
Keith | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Jeff | PERSON | 0.99+ |
Kevin | PERSON | 0.99+ |
Joel Minick | PERSON | 0.99+ |
Andy | PERSON | 0.99+ |
Ryan | PERSON | 0.99+ |
Cathy Dally | PERSON | 0.99+ |
Patrick | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
Rebecca Knight | PERSON | 0.99+ |
Stephen | PERSON | 0.99+ |
Kevin Miller | PERSON | 0.99+ |
Marcus | PERSON | 0.99+ |
Dave Alante | PERSON | 0.99+ |
Eric | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
two | QUANTITY | 0.99+ |
Dan | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Greg Tinker | PERSON | 0.99+ |
Utah | LOCATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Raleigh | LOCATION | 0.99+ |
Brooklyn | LOCATION | 0.99+ |
Carl Krupitzer | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
Lenovo | ORGANIZATION | 0.99+ |
JetBlue | ORGANIZATION | 0.99+ |
2015 | DATE | 0.99+ |
Dave | PERSON | 0.99+ |
Angie Embree | PERSON | 0.99+ |
Kirk Skaugen | PERSON | 0.99+ |
Dave Nicholson | PERSON | 0.99+ |
2014 | DATE | 0.99+ |
Simon | PERSON | 0.99+ |
United | ORGANIZATION | 0.99+ |
Stu Miniman | PERSON | 0.99+ |
Southwest | ORGANIZATION | 0.99+ |
Kirk | PERSON | 0.99+ |
Frank | PERSON | 0.99+ |
Patrick Osborne | PERSON | 0.99+ |
1984 | DATE | 0.99+ |
China | LOCATION | 0.99+ |
Boston | LOCATION | 0.99+ |
California | LOCATION | 0.99+ |
Singapore | LOCATION | 0.99+ |
Jack Andersen & Joel Minnick, Databricks | AWS Marketplace Seller Conference 2022
>>Welcome back everyone to the cubes coverage here in Seattle, Washington, AWS's marketplace seller conference. It's the big news within the Amazon partner network, combining with marketplaces, forming the Amazon partner organization, part of a big reorg as they grow the next level NextGen cloud mid-game on the chessboard. Cube's got cover. I'm John fur, host of Cub, a great guests here from data bricks, both cube alumnis, Jack Anderson, GM of the and VP of the data bricks partnership team. For ADOS, you handle that relationship and Joel Minick vice president of product and partner marketing. You guys are the, have the keys to the kingdom with data, bricks, and AWS. Thanks for joining. Thanks for good to see you again. Thanks for >>Having us back. Yeah, John, great to be here. >>So I feel like we're at reinvent 2013 small event, no stage, but there's a real shift happening with procurement. Obviously it makes it's a no brainer on the micro, you know, people should be buying online self-service cloud scale, but Amazon's got billions being sold to their marketplace. They've reorganized their partner network. You can see kind of what's going on. They've kind of figured it out. Like let's put everything together and simplify and make it less of a website marketplace merge our partner to have more synergy and friction, less experiences so everyone can make more money and customer's gonna be happier. >>Yeah, that's right. >>I mean, you're run relationship. You're in the middle of it. >>Well, Amazon's mental model here is that they want the world's best ISVs to operate on AWS so that we can collaborate and co architect on behalf of customers. And that's exactly what the APO and marketplace allow us to do is to work with Amazon on these really, you know, unique use cases. >>You know, I interviewed Ali many times over the years. I remember many years ago, I think six, maybe six, seven years ago, we were talking. He's like, we're all in ons. Obviously. Now the success of data bricks, you've got multiple clouds. See that customers have choice, but I remember the strategy early on. It was like, we're gonna be deep. So this is speaks volumes to the, the relationship you have years. Jack take us through the relationship that data bricks has with AWS from a, from a partner perspective, Joel, and from a product perspective, because it's not like you got to Johnny come lately new to the new, to the scene, right? We've been there almost president creation of this wave. What's the relationship and has it relate to what's going on today? >>So, so most people may not know that data bricks was born on AWS. We actually did our first 100 million of revenue on Amazon. And today we're obviously available on multiple clouds, but we're very fond of our Amazon relationship. And when you look at what the APN allows us to do, you know, we're able to expand our reach and co-sell with Amazon and marketplace broadens our reach. And so we think of marketplace in three different aspects. We've got the marketplace, private offer business, which we've been doing for a number of years. Matter of fact, we we're driving well over a hundred percent year over year growth in private offers and we have a nine figure business. So it's a very significant business. And when a customer uses a private offer that private offer counts against their private pricing agreement with AWS. So they get pricing power against their, their private pricing. >>So it's really important. It goes on their Amazon bill in may. We launched our pay as you go on demand offering. And in five short months, we have well over a thousand subscribers. And what this does is it really reduces the barriers to entry it's low friction. So anybody in an enterprise or startup or public sector company can start to use data bricks on AWS and pay consumption based model and have it go against their monthly bill. And so we see customers, you know, doing rapid experimentation pilots, POCs, they're, they're really learning the value of that first use case. And then we see rapid use case expansion. And the third aspect is the consulting partner, private offers C P O super important in how we involve our partner ecosystem of our consulting partners and our resellers that are able to work with data bricks on behalf of customers. >>So you got the big contracts with the private offer. You got the product market fit, kind of people iterating with data coming in with, with the buyers you go. And obviously the integration piece all fitting in there. Exactly. Exactly. Okay. So that's that those are the offers that's current and what's in marketplace today. Is that the products, what are, what are people buying? I mean, I guess what's the Joel, what are, what are people buying in the marketplace and what does it mean for >>Them? So fundamentally what they're buying is the ability to take silos out of their organization. And that's, that is the problem that data bricks is out there to solve, which is when you look across your data landscape today, you've got unstructured data, you've got structured data, you've got real time streaming data, and your teams are trying to use all of this data to solve really complicated problems. And as data bricks as the lake house company, what we're helping customers do is how do they get into the new world? How do they move to a place where they can use all of that data across all of their teams? And so we allow them to begin to find through the marketplace, those rapid adoption use cases where they can get rid of these data, warehousing data lake silos they've had in the past, get their unstructured and structured data onto one data platform and open data platform that is no longer adherent to any proprietary formats and standards and something. >>They can very much, very easily integrate into the rest of their data environment, apply one common data governance layer on top of that. So that from the time they ingest that data to the time they use that data to the time they share that data inside and outside of their organization, they know exactly how it's flowing. They know where it came from. They know who's using it. They know who has access to it. They know how it's changing. And then with that common data platform with that common governance solution, they'd being able to bring all of those use cases together across their real time, streaming their data engineering, their BI, their AI, all of their teams working on one set of data. And that lets them move really, really fast. And it also lets them solve challenges. They just couldn't solve before a good example of this, you know, one of the world's now largest data streaming platforms runs on data bricks with AWS. >>And if you think about what does it take to set that up? Well, they've got all this customer data that was historically inside of data warehouses, that they have to understand who their customers are. They have all this unstructured data, they've built their data science model, so they can do the right kinds of recommendation engines and forecasting around. And then they've got all this streaming data going back and forth between click stream data from what the customers are doing with their platform and the recommendations they wanna push back out. And if those teams were all working in individual silos, building these kinds of platforms would be extraordinarily slow and complex, but by building it on data bricks, they were able to release it in record time and have grown at, at record pace >>To not be that's product platform that's impacting product development. Absolutely. I mean, this is like the difference between lagging months of product development to like days. Yes. Pretty much what you're getting at. Yeah. So total agility. I got that. Okay. Now I'm a customer I wanna buy in the marketplace, but I also, you got direct Salesforce up there. So how do you guys look at this? Is there channel conflict? Are there comp programs? Because one of the things I heard today in on the stage from a Davis's leadership, Chris was up there speaking and, and, and moment I was, Hey, he's a CRO conference, chief revenue officer conversation, which means someone's getting compensated. So if I'm the sales rep at data bricks, what's my motion to the customer. Do I get paid? Does Amazon sell it? Take us through that. Is there channel conflict? Is there or an audio lift? >>Well, I I'd add what Joel just talked about with, with, you know, what the solution, the value of the solution our entire offering is available on AWS marketplace. So it's not a subset, the entire data bricks offering and >>The flagship, all the, the top, >>Everything, the flagship, the complete offering. So it's not, it's not segmented. It's not a sub segment. It's it's, you know, you can use all of our different offerings. Now when it comes to seller compensation, we, we, we view this two, two different ways, right? One is that AWS is also incented, right? Versus selling a native service to recommend data bricks for the right situation. Same thing with data bricks. Our Salesforce wants to do the right thing for the customer. If the customer wants to use marketplace as their procurement vehicle. And that really helps customers because if you get data bricks and five other ISVs together, and let's say each ISV is spending, you're spending a million dollars, you have $5 million of spend, you put that spend through the flywheel with AWS marketplace. And then you can use that in your negotiations with AWS to get better pricing overall. So that's how we, >>We do it. So customers are driving. This sounds like, correct. For sure. So they're looking at this as saying, Hey, I'm gonna just get purchasing power with all my relationships because it's a solution architectural market, right? >>Yeah. It makes sense. Because if most customers will have a primary and secondary cloud provider, if they can consolidate, you know, multiple ISV spend through that same primary provider, you get pricing >>Power, okay, Jill, we're gonna date ourselves. At least I will. So back in the old days, it used to be, do a Barney deal with someone, Hey, let's go to market together. You gotta get paper, you do a biz dev deal. And then you gotta say, okay, now let's coordinate our sales teams, a lot of moving parts. So what you're getting at here is that the alternative for data bricks or any company is to go find those partners and do deals versus now Amazon is the center point for the customer so that you can still do those joint deals. But this seems to be flipping the script a little bit. >>Well, it is, but we still have VAs and consulting partners that are doing implementation work very valuable work advisory work that can actually work with marketplace through the C PPO offering. So the marketplace allows multiple ways to procure your >>Solution. So it doesn't change your business structure. It just makes it more efficient. That's >>Correct. >>That's a great way to say it. Yeah, >>That's great. So that's so that's it. So that's just makes it more efficient. So you guys are actually incented to point customers to the marketplace. >>Yes, >>Absolutely. Economically. Yeah. >>E economically it's the right thing to do for the customer. It's the right thing to do for our relationship with Amazon, especially when it comes back to co-selling right? Because Amazon now is leaning in with ISVs and making recommendations for, you know, an ISV solution and our teams are working backwards from those use cases, you know, to collaborate, land them. >>Yeah. I want, I wanna get that out there. Go ahead, Joel. >>So one of the other things I might add to that too, you know, and why this is advantageous for, for companies like data bricks to, to work through the marketplace, is it makes it so much easier for customers to deploy a solution. It's, it's very, literally one click through the marketplace to get data bricks stood up inside of your environment. And so if you're looking at how do I help customers most rapidly adopt these solutions in the AWS cloud, the marketplace is a fantastic accelerator to that. You >>Know, it's interesting. I wanna bring this up and get your reaction to it because to me, I think this is the future of procurement. So from a procurement standpoint, I mean, again, dating myself EDI back in the old days, you know, all that craziness. Now this is all the, all the internet, basically through the console, I get the infrastructure side, you know, spin up and provision. Some servers, all been good. You guys have played well there in the marketplace. But now as we get into more of what I call the business apps, and they brought this up on stage little nuance, most enterprises aren't yet there of integrating tech on the business apps, into the stack. This is where I think you guys are a use case of success where you guys have been successful with data integration. It's an integrator's dilemma, not an innovator's dilemma. So like, I want to integrate, so now I have integration points with data bricks, but I want to put an app in there. I want to provision an application, but it has to be built. It's not, you don't buy it. You build, you gotta build stuff. And this is the nuance. What's your reaction to that? Am I getting this right? Or, or am I off because no, one's gonna be buying software. Like they used to, they buy software to integrate it. >>Yeah, >>No, I, cause everything's integrated. >>I think AWS has done a great job at creating a partner ecosystem, right. To give customers the right tools for the right jobs. And those might be with third parties, data bricks is doing the same thing with our partner connect program. Right. We've got customer, customer partners like five tra and D V T that, you know, augment and enhance our platform. And so you, you're looking at multi ISV architectures and all of that can be procured through the AWS marketplace. >>Yeah. It's almost like, you know, bundling and unbundling. I was talking about this with, with Dave ante about Supercloud, which is why wouldn't a customer want the best solution in their architecture period. And it's class. If someone's got API security or an API gateway. Well, you know, I don't wanna be forced to buy something because it's part of a suite and that's where you see things get suboptimized where someone dominates a category and they have, oh, you gotta buy my version of this. Yeah. >>Joel, Joel. And that's Joel and I were talking, we're actually saying what what's really important about Databricks is that customers control the data. Right? You wanna comment on that? >>Yeah. I was say the, you know what you're pushing on there we think is extraordinarily, you know, the way the market is gonna go is that customers want a lot of control over how they build their data stack. And everyone's unique in what tools are the right ones for them. And so one of the, you know, philosophically I think really strong places, data, bricks, and AWS have lined up is we both take an approach that you should be able to have maximum flexibility on the platform. And as we think about the lake house, one thing we've always been extremely committed to as a company is building the data platform on an open foundation. And we do that primarily through Delta lake and making sure that to Jack's point with data bricks, the data is always in your control. And then it's always stored in a completely open format. And that is one of the things that's allowed data bricks to have the breadth of integrations that it has with all the other data tools out there, because you're not tied into any proprietary format, but instead are able to take advantage of all the innovation that's happening out there in the open source ecosystem. >>When you see other solutions out there that aren't as open as you guys, you guys are very open by the way, we love that too. We think that's a great strategy, but what's the, what am I foreclosing? If I go with something else that's not as open what what's the customer's downside as you think about what's around the corner in the industry. Cuz if you believe it's gonna be open, open source, which I think opens our software is the software industry and integration is a big deal, cuz software's gonna be plentiful. Let's face it. It's a good time to be in software business, but cloud's booming. So what's the downside from your data bricks perspective, you see a buyer clicking on data bricks versus that alternative what's potentially is should they be a nervous about down the road if they go with a more proprietary or locked in approach? Well, >>I think the challenge with proprietary ecosystems is you become beholden to the ability of that provider to both build relationships and convince other vendors that they should invest in that format. But you're also then beholden to the pace at which that provider is able to innovate. And I think we've seen lots of times over history where, you know, a proprietary format may run ahead for a while on a lot of innovation. But as that market control begins to solidify that desire to innovate begins to, to degrade, whereas in the open format. So >>Extract rents versus innovation. Exactly. >>Yeah, exactly. >>But >>I'll say it in the open world, you know, you have to continue to innovate. Yeah. And the open source world is always innovating. If you look at the last 10 to 15 years, I challenge you to find, you know, an example where the innovation in the data and AI world is not coming from open source. And so by investing in open ecosystems, that means you were always going to be at the forefront of what is the >>Latest, you know, again, not to date myself again, but you look back at the eighties and nineties, the protocol stacked for proprietary. Yeah. You know, SNA at IBM deck net was digital, you know, the rest is, and then TCP, I P was part of the open systems, interconnect, revolutionary Oly, a big part of that as well as my school did. And so like, you know, that was, but it didn't standardize the whole stack. It stopped at IP and TCP. Yeah. But that helped interoperate, that created a nice defacto. So this is a big part of this mid game. I call it the chessboard, you know, you got opening game and mid game. Then you got the end game and we're not there. The end game yet cloud the cloud. >>There's, there's always some form of lock in, right. Andy jazzy will, will address it, you know, when making a decision. But if you're gonna make a decision you want to reduce as you don't wanna be limited. Right. So I would advise a customer that there could be limitations with a proprietary architecture. And if you look at what every customer's trying to become right now is an AI driven business. Right? And so it has to do with, can you get that data outta silos? Can you, can you organize it and secure it? And then can you work with data scientists to feed those models? Yeah. In a, in a very consistent manner. And so the tools of tomorrow will to Joel's point will be open and we want interoperability with those >>Tools and, and choice is a matter too. And I would say that, you know, the argument for why I think Amazon is not as locked in as maybe some other clouds is that they have to compete directly too. Redshift competes directly with a lot of other stuff, but they can't play the bundling game because the customers are getting savvy to the fact that if you try to bundle an inferior product with something else, it may not work great at all. And they're gonna be they're onto it. This is >>The Amazon's credit by having these, these solutions that may compete with native services in marketplace, they are providing customers with choice, low >>Price and access to the S and access to the core value. Exactly. Which the >>Hardware, which is their platform. Okay. So I wanna get you guys thought on something else. I, I see emerging, this is again kind of cube rumination moment. So on stage Chris unpacked, a lot of stuff. I mean this marketplace, they're touching a lot of hot buttons here, you know, pricing compensation, workflows services behind the curtain. And one of the things he mentioned was they talk about resellers or channel partners, depending upon what you talk about. We believe Dave and I believe on the cube that the entire indirect sales channel of the industry is gonna be disrupted radically because those players were selling hardware in the old days and software, that game is gonna change. You know, you mentioned you guys have a program, want to get your thoughts on this. We believe that once this gets set up, they can play in this game and bring their services in which means that the old reseller channels are gonna be rewritten. They're gonna be refactored with this new kinds of access. Cuz you've got scale, you've got money and you've got product and you got customers coming into the marketplace. So if you're like a reseller that sold computers to data centers or software, you know, value added reseller or V or business, >>You've gotta evolve. >>You gotta, you gotta be here. Yes. How are you guys working with those partners? Cuz you say you have a part in your marketplace there. How do I make money? If I'm a reseller with data bricks with eight Amazon, take me through that use case. >>Well I'll let Joel comment, but I think it's, it's, it's pretty straightforward, right? Customers need expertise. They need knowhow. When we're seeing customers do mass migrations to the cloud or Hadoop specific migrations or data transformation implementations, they need expertise from consulting and SI partners. If those consulting SI partners happen to resell the solution as well. Well, that's another aspect of their business, but I really think it is the expertise that the partners bring to help customers get outcomes. >>Joel, channel big opportunity for re re Amazon to reimagine this. >>For sure. Yeah. And I think, you know, to your comment about how to resellers take advantage of that, I think what Jack was pushing on is spot on, which is it's becoming more about more and more about the expertise you bring to the table and not just transacting the software, but now actually helping customers make the right choices. And we're seeing, you know, both SI begin to be able to resell solutions and finding a lot of opportunity in that. Yeah. And I think we're seeing traditional resellers begin to move into that SI model as well. And that's gonna be the evolution that >>This gets at the end of the day. It's about services for sure, for sure. You've got a great service. You're gonna have high gross profits. And >>I think that the managed service provider business is alive and well, right? Because there are a number of customers that want that, that type of a service. >>I think that's gonna be a really hot, hot button for you guys. I think being the way you guys are open this channel partner services model coming in to the fold really kind of makes for kind of that super cloudlike experience where you guys now have an ecosystem. And that's my next question. You guys have an ecosystem going on within data bricks for sure. On top of this ecosystem, how does that work? This is kinda like hasn't been written up in business school and case studies yet this is new. What is this? >>I think, you know, what it comes down to is you're seeing ecosystems begin to evolve around the data platforms and that's gonna be one of the big kind of new horizons for us as we think about what drives ecosystems it's going to be around. Well, what is the, what's the data platform that I'm using and then all the tools that have to encircle that to get my business done. And so I think there's, you know, absolutely ecosystems inside of the AWS business on all of AWS's services, across data analytics and AI. And then to your point, you are seeing ecosystems now arise around data bricks in its Lakehouse platform, as well as customers are looking at well, if I'm standing these Lakehouse up and I'm beginning to invest in this, then I need a whole set of tools that help me get that done as well. >>I mean you think about ecosystem theory, we're living a whole nother dream and I'm, and I'm not kidding. It hasn't yet been written up and for business school case studies is that we're now in a whole nother connective tissue ecology thing happening where you have dependencies and value proposition economics connectedness. So you have relationships in these ecosystems. >>And I think one of the great things about relationships with these ecosystems is that there's a high degree of overlap. Yeah. So you're seeing that, you know, the way that the cloud business is evolving, the, the ecosystem partners of data bricks are the same ecosystem partners of AWS. And so as you build these platforms out into the cloud, you're able to really take advantage of best of breed, the broadest set of solutions out there for >>You. Joel, Jack, I love it because you know what it means the best ecosystem will win. If you keep it open. Sure. You can see everything. If you're gonna do it in the dark, you know, you don't know the outcome. I mean, this is really kind we're talking about. >>And John, can I just add that when I was in Amazon, we had a, a theory that there's buyers and builders, right? There's very innovative companies that want to build things themselves. We're seeing now that that builders want to buy a platform. Right? Yeah. And so there's a platform decision being made and that ecosystem gonna evolve around the >>Platform. Yeah. And I totally agree. And, and, and the word innovation get kicks around. That's why, you know, when we had our super cloud panel was called the innovators dilemma with a slash through it called the integrated dilemma, innovation is the digital transformation. So absolutely like that becomes cliche in a way, but it really becomes more of a, are you open? Are you integrating if APIs are the connective tissue, what's automation, what's the service message look like. I mean, a whole nother set of kind of thinking goes on and these new ecosystems and these new products >>And that, and that thinking is, has been born in Delta sharing. Right? So the idea that you can have a multi-cloud implementation of data bricks, and actually share data between those two different clouds, that is the next layer on top of the native cloud >>Solution. Well, data bricks has done a good job of building on top of the goodness of, and the CapEx gift from AWS. But you guys have done a great job taking that building differentiation into the product. You guys have great customer base, great grow ecosystem. And again, I think in a shining example of what every enterprise is going to do, build on top of something operating model, get that operating model, driving revenue. >>Yeah. >>Well we, whether whether you're Goldman Sachs or capital one or XYZ corporation >>S and P global NASDAQ, right. We've got, you know, these, the biggest verticals in the world are solving tough problems with data breaks. I think we'd be remiss cuz if Ali was here, he would really want to thank Amazon for all of the investments across all of the different functions, whether it's the relationship we have with our engineering and service teams. Yeah. Our marketing teams, you know, product development and we're gonna be at reinvent the big presence of reinvent. We're looking forward to seeing you there again. >>Yeah. We'll see you guys there. Yeah. Again, good ecosystem. I love the ecosystem evolutions happening this next gen cloud is here. We're seeing this evolve kind of new economics, new value propositions kind of scaling up, producing more so you guys are doing a great job. Thanks for coming on the Cuban, taking time. Chill. Great to see you at the check. Thanks for having us. Thanks. Going. Okay. Cube coverage here. The world's changing as APN comes to give the marketplace for a new partner organization at Amazon web services, the Cube's got a covered. This should be a very big growing ecosystem as this continues, billions of being sold through the marketplace. Of course the buyers are happy as well. So we've got it all covered. I'm John furry, your host of the cube. Thanks for watching.
SUMMARY :
Thanks for good to see you again. Yeah, John, great to be here. Obviously it makes it's a no brainer on the micro, you know, You're in the middle of it. you know, unique use cases. So this is speaks volumes to the, the relationship you have years. And when you look at what the APN allows us to do, And so we see customers, you know, doing rapid experimentation pilots, POCs, So you got the big contracts with the private offer. And that's, that is the problem that data bricks is out there to solve, They just couldn't solve before a good example of this, you know, And if you think about what does it take to set that up? So how do you guys look at this? Well, I I'd add what Joel just talked about with, with, you know, what the solution, the value of the solution our entire offering And that really helps customers because if you get data bricks So they're looking at this as saying, you know, multiple ISV spend through that same primary provider, you get pricing And then you gotta say, okay, now let's coordinate our sales teams, a lot of moving parts. So the marketplace allows multiple ways to procure your So it doesn't change your business structure. Yeah, So you guys are actually incented to Yeah. It's the right thing to do for our relationship with Amazon, So one of the other things I might add to that too, you know, and why this is advantageous for, I get the infrastructure side, you know, spin up and provision. you know, augment and enhance our platform. you know, I don't wanna be forced to buy something because it's part of a suite and the data. And that is one of the things that's allowed data bricks to have the breadth of integrations that it has with When you see other solutions out there that aren't as open as you guys, you guys are very open by the I think the challenge with proprietary ecosystems is you become beholden to the Exactly. I'll say it in the open world, you know, you have to continue to innovate. I call it the chessboard, you know, you got opening game and mid game. And so it has to do with, can you get that data outta silos? And I would say that, you know, the argument for why I think Amazon Price and access to the S and access to the core value. So I wanna get you guys thought on something else. You gotta, you gotta be here. If those consulting SI partners happen to resell the solution as well. And we're seeing, you know, both SI begin to be This gets at the end of the day. I think that the managed service provider business is alive and well, right? I think being the way you guys are open this channel I think, you know, what it comes down to is you're seeing ecosystems begin to evolve around So you have relationships in And so as you build these platforms out into the cloud, you're able to really take advantage you don't know the outcome. And John, can I just add that when I was in Amazon, we had a, a theory that there's buyers and builders, That's why, you know, when we had our super cloud panel So the idea that you can have a multi-cloud implementation of data bricks, and actually share data But you guys have done a great job taking that building differentiation into the product. We're looking forward to seeing you there again. Great to see you at the check.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
Joel Minick | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Joel | PERSON | 0.99+ |
Ali | PERSON | 0.99+ |
Jack Anderson | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
$5 million | QUANTITY | 0.99+ |
Jack | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Goldman Sachs | ORGANIZATION | 0.99+ |
XYZ | ORGANIZATION | 0.99+ |
Joel Minnick | PERSON | 0.99+ |
Jack Andersen | PERSON | 0.99+ |
Andy jazzy | PERSON | 0.99+ |
third aspect | QUANTITY | 0.99+ |
John fur | PERSON | 0.99+ |
NASDAQ | ORGANIZATION | 0.99+ |
Barney | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
five short months | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
APO | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
first 100 million | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
billions | QUANTITY | 0.98+ |
Johnny | PERSON | 0.97+ |
Davis | PERSON | 0.97+ |
a million dollars | QUANTITY | 0.96+ |
Salesforce | ORGANIZATION | 0.96+ |
data bricks | ORGANIZATION | 0.95+ |
each ISV | QUANTITY | 0.95+ |
Seattle, Washington | LOCATION | 0.95+ |
two different ways | QUANTITY | 0.95+ |
one data platform | QUANTITY | 0.95+ |
seven years ago | DATE | 0.94+ |
Breaking Analysis Further defining Supercloud W/ tech leaders VMware, Snowflake, Databricks & others
from the cube studios in palo alto in boston bringing you data driven insights from the cube and etr this is breaking analysis with dave vellante at our inaugural super cloud 22 event we further refined the concept of a super cloud iterating on the definition the salient attributes and some examples of what is and what is not a super cloud welcome to this week's wikibon cube insights powered by etr you know snowflake has always been what we feel is one of the strongest examples of a super cloud and in this breaking analysis from our studios in palo alto we unpack our interview with benoit de javille co-founder and president of products at snowflake and we test our super cloud definition on the company's data cloud platform and we're really looking forward to your feedback first let's examine how we defl find super cloudant very importantly one of the goals of super cloud 22 was to get the community's input on the definition and iterate on previous work super cloud is an emerging computing architecture that comprises a set of services which are abstracted from the underlying primitives of hyperscale clouds we're talking about services such as compute storage networking security and other native tooling like machine learning and developer tools to create a global system that spans more than one cloud super cloud as shown on this slide has five essential properties x number of deployment models and y number of service models we're looking for community input on x and y and on the first point as well so please weigh in and contribute now we've identified these five essential elements of a super cloud let's talk about these first the super cloud has to run its services on more than one cloud leveraging the cloud native tools offered by each of the cloud providers the builder of the super cloud platform is responsible for optimizing the underlying primitives of each cloud and optimizing for the specific needs be it cost or performance or latency or governance data sharing security etc but those primitives must be abstracted such that a common experience is delivered across the clouds for both users and developers the super cloud has a metadata intelligence layer that can maximize efficiency for the specific purpose of the super cloud i.e the purpose that the super cloud is intended for and it does so in a federated model and it includes what we call a super pass this is a prerequisite that is a purpose-built component and enables ecosystem partners to customize and monetize incremental services while at the same time ensuring that the common experiences exist across clouds now in terms of deployment models we'd really like to get more feedback on this piece but here's where we are so far based on the feedback we got at super cloud 22. we see three deployment models the first is one where a control plane may run on one cloud but supports data plane interactions with more than one other cloud the second model instantiates the super cloud services on each individual cloud and within regions and can support interactions across more than one cloud with a unified interface connecting those instantiations those instances to create a common experience and the third model superimposes its services as a layer or in the case of snowflake they call it a mesh on top of the cloud on top of the cloud providers region or regions with a single global instantiation a single global instantiation of those services which spans multiple cloud providers this is our understanding from a comfort the conversation with benoit dejaville as to how snowflake approaches its solutions and for now we're going to park the service models we need to more time to flesh that out and we'll propose something shortly for you to comment on now we peppered benoit dejaville at super cloud 22 to test how the snowflake data cloud aligns to our concepts and our definition let me also say that snowflake doesn't use the term data cloud they really want to respect and they want to denigrate the importance of their hyperscale partners nor do we but we do think the hyperscalers today anyway are building or not building what we call super clouds but they are but but people who bar are building super clouds are building on top of hyperscale clouds that is a prerequisite so here are the questions that we tested with snowflake first question how does snowflake architect its data cloud and what is its deployment model listen to deja ville talk about how snowflake has architected a single system play the clip there are several ways to do this you know uh super cloud as as you name them the way we we we picked is is to create you know one single system and that's very important right the the the um [Music] there are several ways right you can instantiate you know your solution uh in every region of a cloud and and you know potentially that region could be a ws that region could be gcp so you are indeed a multi-cloud solution but snowflake we did it differently we are really creating cloud regions which are superposed on top of the cloud provider you know region infrastructure region so we are building our regions but but where where it's very different is that each region of snowflake is not one in instantiation of our service our service is global by nature we can move data from one region to the other when you land in snowflake you land into one region but but you can grow from there and you can you know exist in multiple clouds at the same time and that's very important right it's not one single i mean different instantiation of a system is one single instantiation which covers many cloud regions and many cloud providers snowflake chose the most advanced level of our three deployment models dodgeville talked about too presumably so it could maintain maximum control and ensure that common experience like the iphone model next we probed about the technical enablers of the data cloud listen to deja ville talk about snow grid he uses the term mesh and then this can get confusing with the jamaicani's data mesh concept but listen to benoit's explanation well as i said you know first we start by building you know snowflake regions we have today furry region that spawn you know the world so it's a worldwide worldwide system with many regions but all these regions are connected together they are you know meshed together with our technology we name it snow grid and that makes it hard because you know regions you know azure region can talk to a ws region or gcp regions and and as a as a user of our cloud you you don't see really these regional differences that you know regions are in different you know potentially clown when you use snowflake you can exist your your presence as an organization can be in several regions several clouds if you want geographic and and and both geographic and cloud provider so i can share data irrespective of the the cloud and i'm in the snowflake data cloud is that correct i can do that today exactly and and that's very critical right what we wanted is to remove data silos and and when you instantiate a system in one single region and that system is locked in that region you cannot communicate with other parts of the world you are locking the data in one region right and we didn't want to do that we wanted you know data to be distributed the way customer wants it to be distributed across the world and potentially sharing data at world scale now maybe there are many ways to skin the other cat meaning perhaps if a platform does instantiate in multiple places there are ways to share data but this is how snowflake chose to approach the problem next question how do you deal with latency in this big global system this is really important to us because while snowflake has some really smart people working as engineers and and the like we don't think they've solved for the speed of light problem the best people working on it as we often joke listen to benoit deja ville's comments on this topic so yes and no the the way we do it it's very expensive to do that because generally if you want to join you know data which is in which are in different regions and different cloud it's going to be very expensive because you need to move you know data every time you join it so the way we do it is that you replicate the subset of data that you want to access from one region from other regions so you can create this data mesh but data is replicated to make it very cheap and very performant too and is the snow grid does that have the metadata intelligence yes to actually can you describe that a little bit yeah snow grid is both uh a way to to exchange you know metadata about so each region of snowflake knows about all the other regions of snowflake every time we create a new region diary you know the metadata is distributed over our data cloud not only you know region knows all the regions but knows you know every organization that exists in our clouds where this organization is where data can be replicated by this organization and then of course it's it's also used as a way to uh uh exchange data right so you can exchange you know beta by scale of data size and we just had i was just receiving an email from one of our customers who moved more than four petabytes of data cross-region cross you know cloud providers in you know few days and you know it's a lot of data so it takes you know some time to move but they were able to do that online completely online and and switch over you know to the diff to the other region which is failover is very important also so yes and no probably means typically no he says yes and no probably means no so it sounds like snowflake is selectively pulling small amounts of data and replicating it where necessary but you also heard him talk about the metadata layer which is one of the essential aspects of super cloud okay next we dug into security it's one of the most important issues and we think one of the hardest parts related to deploying super cloud so we've talked about how the cloud has become the first line of defense for the cso but now with multi-cloud you have multiple first lines of defense and that means multiple shared responsibility models and multiple tool sets from different cloud providers and an expanded threat surface so listen to benoit's explanation here please play the clip this is a great question uh security has always been the most important aspect of snowflake since day one right this is the question that every customer of ours has you know how you can you guarantee the security of my data and so we secure data really tightly in region we have several layers of security it starts by by encrypting it every data at rest and that's very important a lot of customers are not doing that right you hear these attacks for example on on cloud you know where someone left you know their buckets uh uh open and then you know you can access the data because it's a non-encrypted uh so we are encrypting everything at rest we are encrypting everything in transit so a region is very secure now you know you never from one region you never access data from another region in snowflake that's why also we replicate data now the replication of that data across region or the metadata for that matter is is really highly secure so snow grits ensure that everything is encrypted everything is you know we have multiple you know encryption keys and it's you know stored in hardware you know secure modules so we we we built you know snow grids such that it's secure and it allows very secure movement of data so when we heard this explanation we immediately went to the lowest common denominator question meaning when you think about how aws for instance deals with data in motion or data and rest it might be different from how another cloud provider deals with it so how does aws uh uh uh differences for example in the aws maturity model for various you know cloud capabilities you know let's say they've got a faster nitro or graviton does it do do you have to how does snowflake deal with that do they have to slow everything else down like imagine a caravan cruising you know across the desert so you know every truck can keep up let's listen it's a great question i mean of course our software is abstracting you know all the cloud providers you know infrastructure so that when you run in one region let's say aws or azure it doesn't make any difference as far as the applications are concerned and and this abstraction of course is a lot of work i mean really really a lot of work because it needs to be secure it needs to be performance and you know every cloud and it has you know to expose apis which are uniform and and you know cloud providers even though they have potentially the same concept let's say blob storage apis are completely different the way you know these systems are secure it's completely different the errors that you can get and and the retry you know mechanism is very different from one cloud to the other performance is also different we discovered that when we were starting to port our software and and and you know we had to completely rethink how to leverage blob storage in that cloud versus that cloud because just of performance too so we had you know for example to you know stripe data so all this work is work that's you know you don't need as an application because our vision really is that applications which are running in our data cloud can you know be abstracted of all this difference and and we provide all the services all the workload that this application need whether it's transactional access to data analytical access to data you know managing you know logs managing you know metrics all of these is abstracted too such that they are not you know tied to one you know particular service of one cloud and and distributing this application across you know many regions many cloud is very seamless so from that answer we know that snowflake takes care of everything but we really don't understand the performance implications in you know in that specific case but we feel pretty certain that the promises that snowflake makes around governance and security within their data sharing construct construct will be kept now another criterion that we've proposed for super cloud is a super pass layer to create a common developer experience and an enabler for ecosystem partners to monetize please play the clip let's listen we build it you know a custom build because because as you said you know what exists in one cloud might not exist in another cloud provider right so so we have to build you know on this all these this components that modern application mode and that application need and and and and that you know goes to machine learning as i say transactional uh analytical system and the entire thing so such that they can run in isolation basically and the objective is the developer experience will be identical across those clouds yes right the developers doesn't need to worry about cloud provider and actually our system we have we didn't talk about it but the marketplace that we have which allows actually to deliver we're getting there yeah okay now we're not going to go deep into ecosystem today we've talked about snowflakes strengths in this regard but snowflake they pretty much ticked all the boxes on our super cloud attributes and definition we asked benoit dejaville to confirm that this is all shipping and available today and he also gave us a glimpse of the future play the clip and we are still developing it you know the transactional you know unistore as we call it was announced in last summit so so they are still you know working properly but but but that's the vision right and and and that's important because we talk about the infrastructure right you mentioned a lot about storage and compute but it's not only that right when you think about application they need to use the transactional database they need to use an analytical system they need to use you know machine learning so you need to provide also all these services which are consistent across all the cloud providers so you can hear deja ville talking about expanding beyond taking advantage of the core infrastructure storage and networking et cetera and bringing intelligence to the data through machine learning and ai so of course there's more to come and there better be at this company's valuation despite the recent sharp pullback in a tightening fed environment okay so i know it's cliche but everyone's comparing snowflakes and data bricks databricks has been pretty vocal about its open source posture compared to snowflakes and it just so happens that we had aligotsy on at super cloud 22 as well he wasn't in studio he had to do remote because i guess he's presenting at an investor conference this week so we had to bring him in remotely now i didn't get to do this interview john furrier did but i listened to it and captured this clip about how data bricks sees super cloud and the importance of open source take a listen to goatzee yeah i mean let me start by saying we just we're big fans of open source we think that open source is a force in software that's going to continue for you know decades hundreds of years and it's going to slowly replace all proprietary code in its way we saw that you know it could do that with the most advanced technology windows you know proprietary operating system very complicated got replaced with linux so open source can pretty much do anything and what we're seeing with the data lake house is that slowly the open source community is building a replacement for the proprietary data warehouse you know data lake machine learning real-time stack in open source and we're excited to be part of it for us delta lake is a very important project that really helps you standardize how you lay out your data in the cloud and with it comes a really important protocol called delta sharing that enables you in an open way actually for the first time ever share large data sets between organizations but it uses an open protocol so the great thing about that is you don't need to be a database customer you don't even like databricks you just need to use this open source project and you can now securely share data sets between organizations across clouds and it actually does so really efficiently just one copy of the data so you don't have to copy it if you're within the same cloud so the implication of ellie gotzi's comments is that databricks with delta sharing as john implied is playing a long game now i don't know if enough about the databricks architecture to comment in detail i got to do more research there so i reached out to my two analyst friends tony bear and sanji mohan to see what they thought because they cover these companies pretty closely here's what tony bear said quote i've viewed the divergent lake house strategies of data bricks and snowflake in the context of their roots prior to delta lake databrick's prime focus was the compute not the storage layer and more specifically they were a compute engine not a database snowflake approached from the opposite end of the pool as they originally fit the mold of the classic database company rather than a specific compute engine per se the lake house pushes both companies outside of their original comfort zones data bricks to storage snowflake to compute engine so it makes perfect sense for databricks to embrace the open source narrative at the storage layer and for snowflake to continue its walled garden approach but in the long run their strategies are already overlapping databricks is not a 100 open source company its practitioner experience has always been proprietary and now so is its sql query engine likewise snowflake has had to open up with the support of iceberg for open data lake format the question really becomes how serious snowflake will be in making iceberg a first-class citizen in its environment that is not necessarily officially branding a lake house but effectively is and likewise can databricks deliver the service levels associated with walled gardens through a more brute force approach that relies heavily on the query engine at the end of the day those are the key requirements that will matter to data bricks and snowflake customers end quote that was some deep thought by by tony thank you for that sanjay mohan added the following quote open source is a slippery slope people buy mobile phones based on open source android but it's not fully open similarly databricks delta lake was not originally fully open source and even today its photon execution engine is not we are always going to live in a hybrid world snowflake and databricks will support whatever model works best for them and their customers the big question is do customers care as deeply about which vendor has a higher degree of openness as we technology people do i believe customers evaluation criteria is far more nuanced than just to decipher each vendor's open source claims end quote okay so i had to ask dodgeville about their so-called wall garden approach and what their strategy is with apache iceberg here's what he said iceberg is is very important so just to to give some context iceberg is an open you know table format right which was you know first you know developed by netflix and netflix you know put it open source in the apache community so we embrace that's that open source standard because because it's widely used by by many um many you know companies and also many companies have you know really invested a lot of effort in building you know big data hadoop solution or data like solution and they want to use snowflake and they couldn't really use snowflake because all their data were in open you know formats so we are embracing icebergs to help these companies move through the cloud but why we have been relentless with direct access to data direct access to data is a little bit of a problem for us and and the reason is when you direct access to data now you have direct access to storage now you have to understand for example the specificity of one cloud versus the other so as soon as you start to have direct access to data you lose your you know your cloud diagnostic layer you don't access data with api when you have direct access to data it's very hard to secure data because you need to grant access direct access to tools which are not you know protected and you see a lot of you know hacking of of data you know because of that so so that was not you know direct access to data is not serving well our customers and that's why we have been relented to do that because it's it's cr it's it's not cloud diagnostic it's it's you you have to code that you have to you you you need a lot of intelligence while apis access so we want open apis that's that's i guess the way we embrace you know openness is is by open api versus you know you access directly data here's my take snowflake is hedging its bets because enough people care about open source that they have to have some open data format options and it's good optics and you heard benoit deja ville talk about the risks of directly accessing the data and the complexities it brings now is that maybe a little fud against databricks maybe but same can be said for ollie's comments maybe flooding the proprietaryness of snowflake but as both analysts pointed out open is a spectrum hey i remember unix used to equal open systems okay let's end with some etr spending data and why not compare snowflake and data bricks spending profiles this is an xy graph with net score or spending momentum on the y-axis and pervasiveness or overlap in the data set on the x-axis this is data from the january survey when snowflake was holding above 80 percent net score off the charts databricks was also very strong in the upper 60s now let's fast forward to this next chart and show you the july etr survey data and you can see snowflake has come back down to earth now remember anything above 40 net score is highly elevated so both companies are doing well but snowflake is well off its highs and data bricks has come down somewhat as well databricks is inching to the right snowflake rocketed to the right post its ipo and as we know databricks wasn't able to get to ipo during the covet bubble ali gotzi is at the morgan stanley ceo conference this week they got plenty of cash to withstand a long-term recession i'm told and they've started the message that they're a billion dollars in annualized revenue i'm not sure exactly what that means i've seen some numbers on their gross margins i'm not sure what that means i've seen some numbers on their net retention revenue or net revenue retention again i'll reserve judgment until we see an s1 but it's clear both of these companies have momentum and they're out competing in the market well as always be the ultimate arbiter different philosophies perhaps is it like democrats and republicans well it could be but they're both going after a solving data problem both companies are trying to help customers get more value out of their data and both companies are highly valued so they have to perform for their investors to paraphrase ralph nader the similarities may be greater than the differences okay that's it for today thanks to the team from palo alto for this awesome super cloud studio build alex myerson and ken shiffman are on production in the palo alto studios today kristin martin and sheryl knight get the word out to our community rob hoff is our editor-in-chief over at siliconangle thanks to all please check out etr.ai for all the survey data remember these episodes are all available as podcasts wherever you listen just search breaking analysis podcasts i publish each week on wikibon.com and siliconangle.com and you can email me at david.vellante at siliconangle.com or dm me at devellante or comment on my linkedin posts and please as i say etr has got some of the best survey data in the business we track it every quarter and really excited to be partners with them this is dave vellante for the cube insights powered by etr thanks for watching and we'll see you next time on breaking analysis [Music] you
SUMMARY :
and and the retry you know mechanism is
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
netflix | ORGANIZATION | 0.99+ |
john furrier | PERSON | 0.99+ |
palo alto | ORGANIZATION | 0.99+ |
tony bear | PERSON | 0.99+ |
boston | LOCATION | 0.99+ |
sanji mohan | PERSON | 0.99+ |
ken shiffman | PERSON | 0.99+ |
both | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
ellie gotzi | PERSON | 0.99+ |
VMware | ORGANIZATION | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
siliconangle.com | OTHER | 0.99+ |
more than four petabytes | QUANTITY | 0.99+ |
first point | QUANTITY | 0.99+ |
kristin martin | PERSON | 0.99+ |
both companies | QUANTITY | 0.99+ |
first question | QUANTITY | 0.99+ |
rob hoff | PERSON | 0.99+ |
more than one | QUANTITY | 0.99+ |
second model | QUANTITY | 0.98+ |
alex myerson | PERSON | 0.98+ |
third model | QUANTITY | 0.98+ |
one region | QUANTITY | 0.98+ |
one copy | QUANTITY | 0.98+ |
one region | QUANTITY | 0.98+ |
five essential elements | QUANTITY | 0.98+ |
android | TITLE | 0.98+ |
100 | QUANTITY | 0.98+ |
first line | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
sheryl | PERSON | 0.98+ |
more than one cloud | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
iphone | COMMERCIAL_ITEM | 0.98+ |
super cloud 22 | EVENT | 0.98+ |
each cloud | QUANTITY | 0.98+ |
each | QUANTITY | 0.97+ |
sanjay mohan | PERSON | 0.97+ |
john | PERSON | 0.97+ |
republicans | ORGANIZATION | 0.97+ |
this week | DATE | 0.97+ |
hundreds of years | QUANTITY | 0.97+ |
siliconangle | ORGANIZATION | 0.97+ |
each week | QUANTITY | 0.97+ |
data lake house | ORGANIZATION | 0.97+ |
one single region | QUANTITY | 0.97+ |
january | DATE | 0.97+ |
dave vellante | PERSON | 0.96+ |
each region | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
dave vellante | PERSON | 0.96+ |
tony | PERSON | 0.96+ |
above 80 percent | QUANTITY | 0.95+ |
more than one cloud | QUANTITY | 0.95+ |
more than one cloud | QUANTITY | 0.95+ |
data lake | ORGANIZATION | 0.95+ |
five essential properties | QUANTITY | 0.95+ |
democrats | ORGANIZATION | 0.95+ |
first time | QUANTITY | 0.95+ |
july | DATE | 0.94+ |
linux | TITLE | 0.94+ |
etr | ORGANIZATION | 0.94+ |
devellante | ORGANIZATION | 0.93+ |
dodgeville | ORGANIZATION | 0.93+ |
each vendor | QUANTITY | 0.93+ |
super cloud 22 | ORGANIZATION | 0.93+ |
delta lake | ORGANIZATION | 0.92+ |
three deployment models | QUANTITY | 0.92+ |
first lines | QUANTITY | 0.92+ |
dejaville | LOCATION | 0.92+ |
day one | QUANTITY | 0.92+ |
Ali Ghodsi, Databricks | Supercloud22
(light hearted music) >> Okay, welcome back to Supercloud '22. I'm John Furrier, host of theCUBE. We got Ali Ghodsi here, co-founder and CEO of Databricks. Ali, Great to see you. Thanks for spending your valuable time to come on and talk about Supercloud and the future of all the structural change that's happening in cloud computing. >> My pleasure, thanks for having me. >> Well, first of all, congratulations. We've been talking for many, many years, and I still go back to the video that we have in archive, you talking about cloud. And really, at the beginning of the big reboot, I called the post Hadoop, a revitalization of data. Congratulations, you've been cloud-first, now on multiple clouds. Congratulations to you and your team for achieving what looks like a billion dollars in annualized revenue as reported by the Wall Street Journal, so first, congratulations. >> Thank you so much, appreciate it. >> So I was talking to some young developers and I asked a random poll, what do you think about Databricks? Oh, we love those guys, they're AI and ML-native, and that's their advantage over the competition. So I pressed why. I don't think they knew why, but that's an interesting perspective. This idea of cloud native, AI/ML-native, ML Ops, this has been a big trend and it's continuing. This is a big part of how this change and this structural change is happening. How do you react to that? And how do you see Databricks evolving into this new Supercloud-like multi-cloud environment? >> Yeah, look, I think it's a continuum. It starts with having data, but they want to clean it, you know, and they want to get insights out of it. But then, eventually, you'd like to start asking questions, doing reports, maybe ask questions about what was my revenue yesterday, last week, but soon you want to start using the crystal ball, predictive technology. Okay, but what will my revenue be next week? Next quarter? Who's going to churn? And if you can finally automate that completely so that you can act on the predictions, right? So this credit card that got swiped, the AI thinks it's fraud, we're going to deny it. That's when you get real value. So we're trying to help all these organizations move through this data AI maturity curve, all the way to that, the prescriptive, automated AI machine learning. That's when you get real competitive advantage. And you know, we saw that with the fans, right? I mean, Google wouldn't be here today if it wasn't for AI. You know, we'd be using AltaVista or something. We want to help all organizations to be able to leverage data and AI that way that the fans did. >> One of the things we're looking at with supercloud and why we call it supercloud versus other things like multi-cloud is that today a lot of the successful companies have started in the cloud have been successful, but have realized and even enterprises who have gotten by accident, and maybe have done nothing with cloud have just some cloud projects on multiple clouds. So, people have multiple cloud operational things going on but it hasn't necessarily been a strategy per se. It's been more of kind of a default reaction to things but the ones that are innovating have been successful in one native cloud because the use cases that drove that got scale got value, and then they're making that super by bringing it on premise, putting in a modern data stack, for the modern application development, and kind of dealing with the things that you guys are in the middle of with data bricks is that, that is where the action is, and they don't want to go, lose the trajectory in all the economies of scale. So we're seeing another structural change where the evolutionary nature of the cloud has solved a bunch of use cases, but now other use cases are emerging that's on premises and edge that have been driven by applications because of the developer boom, that's happening. You guys are in the middle of it. What is happening with this structural change? Are people looking for the modern data stack? Are they looking for more AI? What's the, what's your perspective on this supercloud kind of position? >> Look, it started with not AR on multiple clouds, right? So multi-cloud has been a thing. It became a thing 70, 80% of our customers when you ask them, they're more than one cloud. But then soon to start realizing that, hey, you know, if I'm on multiple clouds, this data stuff is hard enough as it is. Do I want to redo it again and again with different proprietary technologies, on each of the clouds. And that's when I started thinking about let's standardize this, let's figure out a way which just works across them. That's where I think open source comes in, becomes really important. Hey, can we leverage open standards because then we can make it work in these different environments, as we said so that we can actually go super, as you said, that's one. The second thing is, can we simplify it? You know, and I think today, the data landscape is complicated. Conceptually it's simple. You have data which is essentially customer data that you have, maybe employee data. And you want to get some kind of insights from that. But how you do that is very complicated. You have to buy data warehouse, hire data analysts. You have to buy, store stuff in the Delta Lake you know, get your data engineers. If you want streaming real time thing that's another complete different set of technologies you have to buy. And then you have to stitch all these together, and you have to do again and again on every cloud. So they just want simplification. So that's why we're big believers in this Delta Lakehouse concept. Which is an open standard to simplifying this data stack and help people to just get value out of their data in any environment. So they can do that in this sort of supercloud as you call it. >> You know, we've been talking about that in previous interviews, do the heavy lifting let them get the value. I have to ask you about how you see that going forward, Because if I'm a customer, I have a lot of operational challenges. Cause the developers are are kicking butt right now. We see that clearly. Open sources growing at, and continue to be great. But ops and security teams they really care about this stuff. And most companies don't want to spin up multiple ops teams to deal with different stacks. This is one big problem that I think that's leading into the multi-cloud viability. How do you guys deal with that? How do you talk to customers when they say, I want to have less complications on operations? >> Yeah, you're absolutely right. You know, it's easy for a developer to adopt all these technologies and new things are coming out all the time. The ops teams are the ones that have to make sure this works. Doing that in multiple different environments is super hard. especially when there's a proprietary stack in each environment that's different. So they just want standardization. They want open source, that's super important. We hear that all the time from them. They want open the source technologies. They believe in the communities around it. You know, they know that source code is open. So you can also see if there's issues with it. If there's security breaches, those kind of things that they can have a community around it. So they can actually leverage that. So they're the ones that are really pushing this, and we're seeing it across the board. You know, it starts first with the digital natives you know, the companies that are, but slowly it's also now percolating to the other organizations, we're hearing across the board. >> Where are we, Ali on the innovation strategies for customers? Where are they on the trajectory around how they're building out their teams? How are they looking at the open source? How are they extending the value proposition of Databricks, and data at scale, as they start to build out their teams and operations, because some are like kind of starting, crawl, walk, run, kind of vibe. Some are big companies, they're dealing with data all the time. Where are they in their journey? What's the core issues that they're solving? What are some of the use cases that you see that are most pressing in customer? >> Yeah, what I've seen, that's really exciting about this Delta Lakehouse concept is that we're now seeing a lot of use cases around real time. So real time fraud detection, real time stock ticker pricing, anyone that's doing trading, they want that to work real time. Lots of use cases around that. Lots of use cases around how do we in real time drive more engagement on our web assets if we're a media company, right? We have all these assets how do we get people to get engaged? Stay on our sites. Continue engaging with the material we have. Those are real time use cases. And the interesting thing is, they're real time. So, you know, it's really important that you that now you don't want to recommend someone, hey, you should go check out this restaurant if they just came from that restaurant, half an hour ago. So you want it to be real time, but B, that it's also all based on machine learning. These are a lot of this is trying to predict what you want to see, what you want to do, is it fraudulent? And that's also interesting because basically more and more machine learning is coming in. So that's super exciting to see, the combination of real time and machine learning on the Lakehouse. And finally, I would say the Lakehouse is really important for this because that's where the data is flowing in. If they have to take that data that's flowing into the lake and actually copy it into a separate warehouse, that delays the real time use cases. And then it can't hit those real time deadlines. So that's another catalyst for this Lakehouse pattern. >> Would that be an example of how the metrics are changing? Cause I've been looking at some people saying, well you can tell if someone's doing well there's a lot of data being transferred. And then I was saying, well, wait a minute. Data transfer costs money, right? And time. So this is interesting dynamic, in a way you don't want to have a lot of movement, right? >> Yeah, movement actually decreases for a lot of these real time use cases. 'Cause what we saw in the past was that they would run a batch processing to process all the data. So once they process all the data. But actually if you look at the things that have changed since the data that we have yesterday it's actually not that much. So if you can actually incrementally process it in real time, you can actually reduce the cost of transfers and storage and processing. So that's actually a great point. That's also one of the main things that we're seeing with the use cases, the bill shrinks and the cost goes down, and they can process less. >> Yeah, and it'd be interesting to see how those KPIs evolve into industry metrics down the road around the supercloud of evolution. I got to ask you about the open source concept of data platforms. You guys have been a pioneer in there doing great work, kind of picking the baton off where the Hadoop World left off as Dave Vellante always points out. But if working across clouds is super important. How are you guys looking at the ability to work across the different clouds with data bricks? Are you going to build that abstraction yourself? Does data sharing and model sharing kind of come into play there? How do you see this data bricks capability across the clouds? >> Yeah, I mean, let me start by saying, we just we're big fans of open source. We think that open source is a force in software. That's going to continue for, decades, hundreds of years, and it's going to slowly replace all proprietary code in its way. We saw that, it could do that with the most advanced technology. Windows, you know proprietary operating system, very complicated, got replaced with Linux. So open source can pretty much do anything. And what we're seeing with the Delta Lakehouse is that slowly the open source community is building a replacement for the proprietary data warehouse, Delta Lake, machine learning, real time stack in open source. And we're excited to be part of it. For us, Delta Lake is a very important project that really helps you standardize how you layout your data in the cloud. And when it comes a really important protocol called data sharing, that enables you in a open way actually for the first time ever share large data sets between organizations, but it uses an open protocol. So the great thing about that is you don't need to be a Databricks customer. You don't need to even like Databricks, you just need to use this open source project and you can now securely share data sets between organizations across clouds. And it actually does so really efficiently just one copy of the data. So you don't have to copy it if you're within the same cloud. >> So you're playing the long game on open source. >> Absolutely. I mean, this is a force it's going to be there if if you deny it, before you know it there's going to be, something like Linux, that is going to be a threat to your propriety. >> I totally agree by the way. I was just talking to somebody the other day and they're like hey, the software industry someone made the comment, the software industry, the software industry is open source. There's no more software industry, it's called open source. It's integrations that become interesting. And I was looking at integrations now is really where the action is. And we had a panel with the Clouderati we called it, the people have been around for a long time. And it was called the innovator's dilemma. And one of the comments was it's the integrator's dilemma, not the innovator's dilemma. And this is a big part of this piece of supercloud. Can you share your thoughts on how cloud and integration need to be tightened up to really make it super? >> Actually that's a great point. I think the beauty of this is, look the ecosystem of data today is vast, there's this picture that someone puts together every year of all the different vendors and how they relate, and it gets bigger and bigger and messy and messier. So, we see customers use all kinds of different aspects of what's existing in the ecosystem and they want it to be integrated in whatever you're selling them. And that's where I think the power of open source comes in. Open source, you get integrations that people will do without you having to push it. So us, Databricks as a vendor, we don't have to go tell people please integrate with Databricks. The open source technology that we contribute to, automatically, people are integrating with it. Delta Lake has integrations with lots of different software out there and Databricks as a company doesn't have to push that. So I think open source is also another thing that really helps with the ecosystem integrations. Many of these companies in this data space actually have employees that are full-time dedicated to make sure make sure our software works well with Spark. Make sure our software works well with Delta and they contribute back to that community. And that's the way you get this sort of ecosystem to further sort of flourish. >> Well, I really appreciate your time. And I, my final question for you is, as we're kind of unpack and and kind of shape and frame supercloud for the future, how would you see a roadmap or architecture or outcome for companies that are going to clearly be in the cloud where it's open source is going to be dominating. Integrations has got to be seamless and frictionless. Abstraction layer make things super easy and take away the complexity. What is supercloud to them? What does the outcome look like? How would you define a supercloud environment for an enterprise? >> Yeah, for me, it's the simplification that you get where you standardize an open source. You get your data in one place, in one format in one standardized way, and then you can get your insights from it, without having to buy lots of different idiosyncratic proprietary software from different vendors. That's different in each environment. So it's this slow standardization that's happening. And I think it's going to happen faster than we think. And I think in a couple years it's going to be a requirement that, does your software work on all these different departments? Is it based on open source? Is it using this Delta Lake house pattern? And if it's not, I think they're going to demand it. >> Yeah, I feel like we're close to some sort of defacto standard coming and you guys are a big part of it, once that clicks in, it's going to highly accelerate in the open, and I think it's going to be super valuable. Ali, thank you so much for your time, and congratulations to you and your team. Like we've been following you guys since the beginning. Remember the early days and look how far it's come. And again, you guys are really making a big difference in making a super cool environment out there. Thanks for coming on sharing. >> Thank you so much John. >> Okay, this is supercloud 22. I'm John Furrier stay with more for more coverage and more commentary after this break. (light hearted music)
SUMMARY :
and the future of all Congratulations to you and your team And how do you see Databricks evolving And if you can finally One of the things we're And then you have to I have to ask you about how We hear that all the time from them. What are some of the use cases that delays the real time use cases. in a way you don't want to So if you can actually incrementally I got to ask you about So you don't have to copy it So you're playing the that is going to be a And one of the comments was And that's the way you and take away the complexity. simplification that you get and congratulations to you and your team. Okay, this is supercloud 22.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ali Ghodsi | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Databricks | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
last week | DATE | 0.99+ |
next week | DATE | 0.99+ |
Ali | PERSON | 0.99+ |
Next quarter | DATE | 0.99+ |
yesterday | DATE | 0.99+ |
John Furrier | PERSON | 0.99+ |
Delta | ORGANIZATION | 0.99+ |
one format | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
second thing | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Linux | TITLE | 0.98+ |
one copy | QUANTITY | 0.98+ |
Delta Lakehouse | ORGANIZATION | 0.98+ |
supercloud 22 | ORGANIZATION | 0.98+ |
more than one cloud | QUANTITY | 0.98+ |
each environment | QUANTITY | 0.98+ |
Clouderati | ORGANIZATION | 0.98+ |
Supercloud22 | ORGANIZATION | 0.98+ |
hundreds of years | QUANTITY | 0.97+ |
Delta Lake | LOCATION | 0.97+ |
one big problem | QUANTITY | 0.97+ |
70, 80% | QUANTITY | 0.97+ |
Windows | TITLE | 0.96+ |
one place | QUANTITY | 0.96+ |
first time | QUANTITY | 0.96+ |
billion dollars | QUANTITY | 0.95+ |
decades | QUANTITY | 0.95+ |
Delta Lake | ORGANIZATION | 0.95+ |
One | QUANTITY | 0.94+ |
supercloud | ORGANIZATION | 0.94+ |
Supercloud | ORGANIZATION | 0.94+ |
half an hour ago | DATE | 0.93+ |
Delta Lake | TITLE | 0.92+ |
Lakehouse | ORGANIZATION | 0.92+ |
Spark | TITLE | 0.91+ |
each | QUANTITY | 0.91+ |
a minute | QUANTITY | 0.85+ |
one of | QUANTITY | 0.73+ |
one native | QUANTITY | 0.72+ |
supercloud | TITLE | 0.7+ |
couple years | QUANTITY | 0.66+ |
AltaVista | ORGANIZATION | 0.65+ |
Wall Street Journal | ORGANIZATION | 0.63+ |
theCUBE | ORGANIZATION | 0.63+ |
Lakehouse | TITLE | 0.51+ |
Lake | LOCATION | 0.46+ |
Hadoop World | TITLE | 0.41+ |
'22 | EVENT | 0.24+ |
Data Power Panel V3
(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)
SUMMARY :
And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Doug | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tony | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Frank | PERSON | 0.99+ |
Frank Klutman | PERSON | 0.99+ |
Tony Baers | PERSON | 0.99+ |
Mars | LOCATION | 0.99+ |
Doug Henschen | PERSON | 0.99+ |
2020 | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Venus | LOCATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
2012 | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Holger Mueller | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
last year | DATE | 0.99+ |
$5 billion | QUANTITY | 0.99+ |
$10,000 | QUANTITY | 0.99+ |
14 vendors | QUANTITY | 0.99+ |
Last year | DATE | 0.99+ |
last week | DATE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
SanjMo | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
8,500 users | QUANTITY | 0.99+ |
Sanjeev | PERSON | 0.99+ |
Informatica | ORGANIZATION | 0.99+ |
32 concurrent users | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Constellation Research | ORGANIZATION | 0.99+ |
Mongo | ORGANIZATION | 0.99+ |
Sanjeev Mohan | PERSON | 0.99+ |
Ahana | ORGANIZATION | 0.99+ |
DaaS | ORGANIZATION | 0.99+ |
EMR | ORGANIZATION | 0.99+ |
32 | QUANTITY | 0.99+ |
Atlas | ORGANIZATION | 0.99+ |
Delta | ORGANIZATION | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
each | QUANTITY | 0.99+ |
Athena | ORGANIZATION | 0.99+ |
next week | DATE | 0.99+ |
Greg Rokita, Edmunds.com & Joel Minnick, Databricks | AWS re:Invent 2021
>>We'll come back to the cubes coverage of AWS reinvent 2021, the industry's most important hybrid event. Very few hybrid events, of course, in the last two years. And the cube is excited to be here. Uh, this is our ninth year covering AWS reinvent this the 10th reinvent we're here with Joel Minnick, who the vice president of product and partner marketing at smoking hot company, Databricks and Greg Rokita, who is executive director of technology at Edmonds. If you're buying a car or leasing a car, you gotta go to Edmund's. We're gonna talk about busting data silos, guys. Great to see you again. >>Welcome. Welcome. Glad to be here. >>All right. So Joel, what the heck is a lake house? This is all over the place. Everybody's talking about lake house. What is it? >>And it did well in a nutshell, a Lakehouse is the ability to have one unified platform to handle all of your traditional analytics workloads. So your BI and reporting Trisha, the lake, the workloads that you would have for your data warehouse on the same platform as the workloads that you would have for data science and machine learning. And so if you think about kind of the way that, uh, most organizations have built their infrastructure in the cloud today, what we have is generally customers will land all their data in a data lake and a data lake is fantastic because it's low cost, it's open. It's able to handle lots of different kinds of data. Um, but the challenges that data lakes have is that they don't necessarily scale very well. It's very hard to govern data in a data lake house. It's very hard to manage that data in a data lake, sorry, in a, in a data lake. >>And so what happens is that customers then move the data out of a data lake into downstream systems and what they tend to move it into our data warehouses to handle those traditional reporting kinds of workloads that they have. And they do that because data warehouses are really great at being able to have really great scale, have really great performance. The challenge though, is that data warehouses really only work for structured data. And regardless of what kind of data warehouse you adopt, all data warehouse and platforms today are built on some kind of proprietary format. So once you've put that data into the data warehouse, that's, that is kind of what you're locked into. The promise of the data lake house was to say, look, what if we could strip away all of that complexity and having to move data back and forth between all these different systems and keep the data exactly where it is today and where it is today is in the data lake. >>And then being able to apply a transaction layer on top of that. And the Databricks case, we do that through a technology and open source technology called data lake, or sorry, Delta lake. And what Delta lake allows us to do is when you need it, apply that performance, that reliability, that quality, that scale that you would expect out of a data warehouse directly on your data lake. And if I can do that, then what I'm able to do now is operate from one single source of truth that handles all of my analytics workloads, both my traditional analytics workloads and my data science and machine learning workloads, and being able to have all of those workloads on one common platform. It means that now not only do I get much, much more simple in the way that my infrastructure works and therefore able to operate at much lower costs, able to get things to production much, much faster. >>Um, but I'm also able to now to leverage open source in a much bigger way being that lake house is inherently built on an open platform. Okay. So I'm no longer locked into any kind of data format. And finally, probably one of the most, uh, lasting benefits of a lake house is that all the roles that have to take that have to touch my data for my data engineers, to my data analyst, my data scientists, they're all working on the same data, which means that collaboration that has to happen to go answer really hard problems with data. I'm now able to do much, much more easy because those silos that traditionally exist inside of my environment no longer have to be there. And so Lakehouse is that is the promise to have one single source of truth, one unified platform for all of my data. Okay, >>Great. Thank you for that very cogent description of what a lake house is now. Let's I want to hear from the customer to see, okay, this is what he just said. True. So actually, let me ask you this, Greg, because the other problem that you, you didn't mention about the data lake is that with no schema on, right, it gets messy and Databricks, I think, correct me if I'm wrong, has begun to solve that problem, right? Through series of tooling and AI. That's what Delta liked us. It's a man, like it's a managed service. Everybody thought you were going to be like the cloud era of spark and Brittany Britain, a brilliant move to create a managed service. And it's worked great. Now everybody has a managed service, but so can you paint a picture at Edmonds as to what you're doing with, maybe take us through your journey the early days of a dupe, a data lake. Oh, that sounds good. Throw it in there, paint a picture as to how you guys are using data and then tie it into what y'all just said. >>As Joel said, that they'll the, it simplifies the architecture quite a bit. Um, in a modern enterprise, you have to deal with a variety of different data sources, structured semi-structured and unstructured in the form of images and videos. And with Delta lake and built a lake, you can have one system that handles all those data sources. So what that does is that basically removes the issue of multiple systems that you have to administer. It lowers the cost, and it provides consistency. If you have multiple systems that deal with data, you always arise as the issue as to which data has to be loaded into which system. And then you have issues with consistency. Once you have issues with consistency, business users, as analysts will stop trusting your data. So that was very critical for us to unify the system of data handling in the one place. >>Additionally, you have a massive scalability. So, um, I went to the talk with from apple saying that, you know, he can process two years worth of data. Instead of just two days in an Edmonds, we have this use case of backfilling the data. So often we changed the logic and went to new. We need to reprocess massive amounts of data with the lake house. We can reprocess months worth of data in, in a matter of minutes or hours. And additionally at the data lake houses based on open, uh, open standards, like parquet that allowed us, allowed us to basically hope open source and third-party tools on top of the Delta lake house. Um, for example, a Mattson, we use a Matson for data discovery, and finally, uh, the lake house approach allows us for different skillsets of people to work on the same source data. We have analysts, we have, uh, data engineers, we have statisticians and data scientists using their own programming languages, but working on the same core of data sets without worrying about duplicating data and consistency issues between the teams. >>So what, what is, what are the primary use cases where you're using house Lakehouse Delta? >>So, um, we work, uh, we have several use cases, one of them more interesting and important use cases as vehicle pricing, you have used the Edmonds. So, you know, you go to our website and you use it to research vehicles, but it turns out that pricing and knowing whether you're getting a good or bad deal is critical for our, uh, for our business. So with the lake house, we were able to develop a data pipeline that ingests the transactions, curates the transactions, cleans them, and then feeds that curated a curated feed into the machine learning model that is also deployed on the lake house. So you have one system that handles this huge complexity. And, um, as you know, it's very hard to find unicorns that know all those technologies, but because we have flexibility of using Scala, Java, uh, Python and SQL, we have different people working on different parts of that pipeline on the same system and on the same data. So, um, having Lakehouse really enabled us to be very agile and allowed us to deploy new sources easily when we, when they arrived and fine tune the model to decrease the error rates for the price prediction. So that process is ongoing and it's, it's a very agile process that kind of takes advantage of the, of the different skill sets of different people on one system. >>Because you know, you guys democratized by car buying, well, at least the data around car buying because as a consumer now, you know, I know what they're paying and I can go in of course, but they changed their algorithms as well. I mean, the, the dealers got really smart and then they got kickbacks from the manufacturer. So you had to get smarter. So it's, it's, it's a moving target, I guess. >>Great. The pricing is actually very complex. Like I, I don't have time to explain it to you, but knowing, especially in this crazy market inflationary market where used car prices are like 38% higher year over year, and new car prices are like 10% higher and they're changing rapidly. So having very responsive pricing model is, is extremely critical. Uh, you, I don't know if you're familiar with Zillow. I mean, they almost went out of business because they mispriced their, uh, their houses. So, so if you own their stock, you probably under shorthand of it, but, you know, >>No, but it's true because I, my lease came up in the middle of the pandemic and I went to Edmonds, say, what's this car worth? It was worth like $7,000. More than that. Then the buyout costs the residual value. I said, I'm taking it, can't pass up that deal. And so you have to be flexible. You're saying the premises though, that open source technology and Delta lake and lake house enabled that flexible. >>Yes, we are able to ingest new transactions daily recalculate our model within less than an hour and deploy the new model with new pricing, you know, almost real time. So, uh, in this environment, it's very critical that you kind of keep up to date and ingest their latest transactions as they prices change and recalculate your model that predicts the future prices. >>Because the business lines inside of Edmond interact with the data teams, you mentioned data engineers, data scientists, analysts, how do the business people get access to their data? >>Originally, we only had a core team that was using Lakehouse, but because the usage was so powerful and easy, we were able to democratize it across our units. So other teams within software engineering picked it up and then analysts picked it up. And then even business users started using the dashboarding and seeing, you know, how the price has changed over time and seeing other, other metrics within the, >>What did that do for data quality? Because I feel like if I'm a business person, I might have context of the data that an analyst might not have. If they're part of a team that's servicing all these lines of business, did you find that the data quality, the collaboration affected data? >>Th the biggest thing for us was the fact that we don't have multiple systems now. So you don't have to load the data. Whenever you have to load the data from one system to another, there is always a lag. There's always a delay. There is always a problematic job that didn't do the copy correctly. And the quality is uncertain. You don't know which system tells you the truth. Now we just have one layer of data. Whether you do reports, whether you're data processing or whether you do modeling, they all read the same data. And the second thing is that with the dashboarding capabilities, people that were not very technical that before we could only use Tableau and Tableau is not the easiest thing to use as if you're not technical. Now they can use it. So anyone can see how our pricing data looks, whether you're an executive, whether you're an analyst or a casual business users, >>But Hey, so many questions, you guys are gonna have to combat. I'm gonna run out of time, but you now allow a consumer to buy a car directly. Yes. Right? So that's a new service that you launched. I presume that required new data. We give, we >>Give consumers offers. Yes. And, and that offer you >>Offered to buy my league. >>Exactly. And that offer leverages the pricing that we develop on top of the lake house. So the most important thing is accurately giving you a very good offer price, right? So if we give you a price, that's not so good. You're going to go somewhere else. If we give you price, that's too high, we're going to go bankrupt like Zillow debt, right. >>It took to enable that you're working off the same dataset. Yes. You're going to have to spin up a, did you have to inject new data? Was there a new data source that we're working on? >>Once we curate the data sources and once we clean it, we see the directly to the model. And all of those components are running on the lake house, whether you're curating the data, cleaning it or running the model. The nice thing about lake house is that machine learning is the first class citizen. If you use something like snowflake, I'm not going to slam snowflake here, but you >>Have two different use case. You have >>To, you have to load it into a different system later. You have to load it into a different system. So like good luck doing machine learning on snowflake. Right. >>Whereas, whereas Databricks, that's kind of your raison d'etre >>So what are your, your, your data engineer? I feel like I should be a salesman or something. Yeah. I'm not, I'm not saying that. Just, just because, you know, I was told to, like, I'm saying it because of that's our use case, >>Your use case. So question for each of you, what, what business results did you see when you went to kind of pre lake house, post lake house? What are the, any metrics you can share? And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, what can you tell us? Well, >>Uh, before their lake house, we had two different systems. We had one for processing, which was still data breaks. And the second one for serving and we iterated over Nateeza or Redshift, but we figured that maintaining two different system and loading data from one to the other was a huge overhead administration security costs. Um, the fact that you had to consistency issues. So the fact that you can have one system, um, with, uh, centralized data, solves all those issues. You have to have one security mechanism, one administrative mechanism, and you don't have to load the data from one system to the other. You don't have to make compromises. >>It's scale is not a problem because of the cloud, >>Because you can spend clusters at will for different use cases. So your clusters are independent. You have processing clusters that are not affecting your serving clusters. So, um, in the past, if you were running a serving, say on Nateeza or Redshift, if you were doing heavy processing, your reports would be affected, but now all those clusters are separated. So >>Consumer data consumer can take that data from the producer independ >>Using its own cluster. Okay. >>Yeah. I'll give you the final word, Joel. I know it's been, I said, you guys got to come back. This is what have you seen broadly? >>Yeah. Well, I mean, I think Greg's point about scale. It's an interesting one. So if you look at cross the entire Databricks platform, the platform is launching 9 million VMs every day. Um, and we're in total processing over nine exabytes a month. So in terms of just how much data the platform is able to flow through it, uh, and still maintain a extremely high performance is, is bar none out there. And then in terms of, if you look at just kind of the macro environment of what's happening out there, you know, I think what's been most exciting to watch or what customers are experiencing traditionally or, uh, on the traditional data warehouse and kinds of workloads, because I think that's where the promise of lake house really comes into its own is saying, yes, I can run these traditional data warehousing workloads that require a high concurrency high scale, high performance directly on my data lake. >>And, uh, I think probably the two most salient data points to raise up there is, uh, just last month, Databricks announced it's set the world record for the, for the, uh, TPC D S 100 terabyte benchmark. So that is a place where Databricks at the lake house architecture, that benchmark is built to measure data warehouse performance and the lake house beat data warehouse and sat their own game in terms of overall performance. And then what's that spends from a price performance standpoint, it's customers on Databricks right now are able to enjoy that level of performance at 12 X better price performance than what cloud data warehouses provide. So not only are we jumping on this extremely high scale and performance, but we're able to do it much, much more efficiently. >>We're gonna need a whole nother section second segment to talk about benchmarking that guys. Thanks so much, really interesting session and thank you and best of luck to both join the show. Thank you for having us. Very welcome. Okay. Keep it right there. Everybody you're watching the cube, the leader in high-tech coverage at AWS reinvent 2021
SUMMARY :
Great to see you again. Glad to be here. This is all over the place. and reporting Trisha, the lake, the workloads that you would have for your data warehouse on And regardless of what kind of data warehouse you adopt, And what Delta lake allows us to do is when you need it, that all the roles that have to take that have to touch my data for as to how you guys are using data and then tie it into what y'all just said. And with Delta lake and built a lake, you can have one system that handles all Additionally, you have a massive scalability. So you have one system that So you had to get smarter. So, so if you own their stock, And so you have to be flexible. less than an hour and deploy the new model with new pricing, you know, you know, how the price has changed over time and seeing other, other metrics within the, lines of business, did you find that the data quality, the collaboration affected data? So you don't have to load But Hey, so many questions, you guys are gonna have to combat. So the most important thing is accurately giving you a very good offer did you have to inject new data? I'm not going to slam snowflake here, but you You have To, you have to load it into a different system later. Just, just because, you know, I was told to, And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, So the fact that you can have one system, So, um, in the past, if you were running a serving, Okay. This is what have you seen broadly? So if you look at cross the entire So not only are we jumping on this extremely high scale and performance, but we're able to do it much, Thanks so much, really interesting session and thank you and best of luck to both join the show.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Joel | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
Joel Minnick | PERSON | 0.99+ |
$7,000 | QUANTITY | 0.99+ |
Greg Rokita | PERSON | 0.99+ |
38% | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
10% | QUANTITY | 0.99+ |
Java | TITLE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
two years | QUANTITY | 0.99+ |
one system | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
Scala | TITLE | 0.99+ |
apple | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
ninth year | QUANTITY | 0.99+ |
last month | DATE | 0.99+ |
lake house | ORGANIZATION | 0.99+ |
two different systems | QUANTITY | 0.99+ |
Tableau | TITLE | 0.99+ |
2021 | DATE | 0.99+ |
9 million VMs | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.99+ |
less than an hour | QUANTITY | 0.99+ |
Lakehouse | ORGANIZATION | 0.98+ |
12 X | QUANTITY | 0.98+ |
Delta | ORGANIZATION | 0.98+ |
Delta lake house | ORGANIZATION | 0.98+ |
one layer | QUANTITY | 0.98+ |
one common platform | QUANTITY | 0.98+ |
both | QUANTITY | 0.97+ |
AWS | ORGANIZATION | 0.97+ |
Zillow | ORGANIZATION | 0.97+ |
Brittany Britain | PERSON | 0.97+ |
Edmunds.com | ORGANIZATION | 0.97+ |
two different system | QUANTITY | 0.97+ |
Edmonds | ORGANIZATION | 0.97+ |
over nine exabytes a month | QUANTITY | 0.97+ |
today | DATE | 0.96+ |
Lakehouse Delta | ORGANIZATION | 0.96+ |
Delta lake | ORGANIZATION | 0.95+ |
Trisha | PERSON | 0.95+ |
data lake | ORGANIZATION | 0.94+ |
Mattson | ORGANIZATION | 0.92+ |
second segment | QUANTITY | 0.92+ |
each | QUANTITY | 0.92+ |
Matson | ORGANIZATION | 0.91+ |
two most salient data points | QUANTITY | 0.9+ |
Edmonds | LOCATION | 0.89+ |
100 terabyte | QUANTITY | 0.87+ |
one single source | QUANTITY | 0.86+ |
first class | QUANTITY | 0.85+ |
Nateeza | TITLE | 0.85+ |
one security | QUANTITY | 0.85+ |
Redshift | TITLE | 0.84+ |
Did HPE GreenLake Just Set a New Bar in the On-Prem Cloud Services Market?
>> Welcome back to The Cube's coverage of HPE's GreenLake announcements. My name is Dave Vellante and you're watching the Cube. I'm here with Holger Mueller, who is an analyst at Constellation Research. And Matt Maccaux is the global field CTO of Ezmeral software at HPE. We're going to talk data. Gents, great to see you. >> Holger: Great to be here. >> So, Holger, what do you see happening in the data market? Obviously data's hot, you know, digital, I call it the force marks to digital. Everybody realizes wow, digital business, that's a data business. We've got to get our data act together. What do you see in the market is the big trends, the big waves? >> We are all young enough or old enough to remember when people were saying data is the new oil, right? Nothing has changed, right? Data is the key ingredient, which matters to enterprise, which they have to store, which they have to enrich, which they have to use for their decision-making. It's the foundation of everything. If you want to go into machine learning or (indistinct) It's growing very fast, right? We have the capability now to look at all the data in enterprise, which weren't able 10 years ago to do that. So data is main center to everything. >> Yeah, it's even more valuable than oil, I think, right? 'Cause with oil, you can only use once. Data, you can, it's kind of polyglot. I can go in different directions and it's amazing, right? >> It's the beauty of digital products, right? They don't get consumed, right? They don't get fired up, right? And no carbon footprint, right? "Oh wait, wait, we have to think about carbon footprint." Different story, right? So to get to the data, you have to spend some energy. >> So it's that simple, right? I mean, it really is. Data is fundamental. It's got to be at the core. And so Matt, what are you guys announcing today, and how does that play into what Holger just said? >> What we're announcing today is that organizations no longer need to make a difficult choice. Prior to today, organizations were thinking if I'm going to do advanced machine learning and really exploit my data, I have to go to the cloud. But all my data's still on premises because of privacy rules, industry rules. And so what we're announcing today, through GreenLake Services, is a cloud services way to deliver that same cloud-based analytical capability. Machine learning, data engineering, through hybrid analytics. It's a unified platform to tie together everything from data engineering to advance data science. And we're also announcing the world's first Kubernetes native object store, that is hybrid cloud enabled. Which means you can keep your data connected across clouds in a data fabric, or Dave, as you say, mesh. >> Okay, can we dig into that a little bit? So, you're essentially saying that, so you're going to have data in both places, right? Public cloud, edge, on-prem, and you're saying, HPE is announcing a capability to connect them, I think you used the term fabric. I'm cool, by the way, with the term fabric, we can, we'll parse that out another time. >> I love for you to discuss textiles. Fabrics vs. mesh. For me, every fabric breaks down to mesh if you put it on a microscope. It's the same thing. >> Oh wow, now that's really, that's too detailed for my brain, right this moment. But, you're saying you can connect all those different estates because data by its very nature is everywhere. You're going to unify that, and what, that can manage that through sort of a single view? >> That's right. So, the management is centralized. We need to be able to know where our data is being provisioned. But again, we don't want organizations to feel like they have to make the trade off. If they want to use cloud surface A in Azure, and cloud surface B in GCP, why not connect them together? Why not allow the data to remain in sync or not, through a distributed fabric? Because we use that term fabric over and over again. But the idea is let the data be where it most naturally makes sense, and exploit it. Monetization is an old tool, but exploit it in a way that works best for your users and applications. >> In sync or not, that's interesting. So it's my choice? >> That's right. Because the back of an automobile could be a teeny tiny, small edge location. It's not always going to be in sync until it connects back up with a training facility. But we still need to be able to manage that. And maybe that data gets persisted to a core data center. Maybe it gets pushed to the cloud, but we still need to know where that data is, where it came from, its lineage, what quality it has, what security we're going to wrap around that, that all should be part of this fabric. >> Okay. So, you've got essentially a governance model, at least maybe you're working toward that, and maybe it's not all baked today, but that's the north star. Is this fabric connect, single management view, governed in a federated fashion? >> Right. And it's available through the most common API's that these applications are already written in. So, everybody today's talking S3. I've got to get all of my data, I need to put it into an object store, it needs to be S3 compatible. So, we are extending this capability to be S3 native. But it's optimized for performance. Today, when you put data in an object store, it's kind of one size fits all. Well, we know for those streaming analytical capabilities, those high performance workloads, it needs to be tuned for that. So, how about I give you a very small object on the very fastest disk in your data center and maybe that cheaper location somewhere else. And so we're giving you that balance as part of the overall management estate. >> Holger, what's your take on this? I mean, Frank Slootman says we'll never, we're not going halfway house. We're never going to do on-prem, we're only in the cloud. So that basically says, okay, he's ignoring a pretty large market by choice. You're not, Matt, you must love those words. But what do you see as the public cloud players, kind of the moves on-prem, particularly in this realm? >> Well, we've seen lots of cloud players who were only cloud coming back towards on-premise, right? We call it the next generation compute platform where I can move data and workloads between on-premise and ideally, multiple clouds, right? Because I don't want to be logged into public cloud vendors. And we see two trends, right? One trend is the traditional hardware supplier of on-premise has not scaled to cloud technology in terms of big data analytics. They just missed the boat for that in the past, this is changing. You guys are a traditional player and changing this, so congratulations. The other thing, is there's been no innovation for the on-premise tech stack, right? The only technology stack to run modern application has been invested for a long time in the cloud. So what we see since two, three years, right? With the first one being Google with Kubernetes, that are good at GKE on-premise, then onto us, right? Bringing their tech stack with compromises to on-premises, right? Acknowledging exactly what we're talking about, the data is everywhere, data is important. Data gravity is there, right? It's just the network's fault, where the networks are too slow, right? If you could just move everything anywhere we want like juggling two balls, then we'd be in different place. But that's the not enough investment for the traditional IT players for that stack, and the modern stack being there. And now every public cloud player has an on-premise offering with different flavors, different capabilities. >> I want to give you guys Dave's story of kind of history and you can kind of course correct, and tell me how this, Matt, maybe fits into what's happened with customers. So, you know, before Hadoop, obviously you had to buy a big Oracle database and you know, you running Unix, and you buy some big storage subsystem if you had any money left over, you know, you maybe, you know, do some actual analytics. But then Hadoop comes in, lowers the cost, and then S3 kneecaps the entire Hadoop market, right? >> I wouldn't say that, I wouldn't agree. Sorry to jump on your history. Because the fascinating thing, what Hadoop brought to the enterprise for the first time, you're absolutely right, affordable, right, to do that. But it's not only about affordability because S3 as the affordability. The big thing is you can store information without knowing how to analyze it, right? So, you mentioned Snowflake, right? Before, it was like an Oracle database. It was Starschema for data warehouse, and so on. You had to make decisions how to store that data because compute capabilities, storage capabilities, were too limited, right? That's what Hadoop blew away. >> I agree, no schema on, right. But then that created data lakes, which create a data swamps, and that whole mess, and then Spark comes in and help clean it out, okay, fine. So, we're cool with that. But the early days of Hadoop, you had, companies would have a Hadoop monolith, they probably had their data catalog in Excel or Google sheets, right? And so now, my question to you, Matt, is there's a lot of customers that are still in that world. What do they do? They got an option to go to the cloud. I'm hearing that you're giving them another option? >> That's right. So we know that data is going to move to the cloud, as I mentioned. So let's keep that data in sync, and governed, and secured, like you expect. But for the data that can't move, let's bring those cloud native services to your data center. And so that's a big part of this announcement is this unified analytics. So that you can continue to run the tools that you want to today while bringing those next generation tools based on Apache Spark, using libraries like Delta Lake so you can go anything from Tableaux through Presto sequel, to advance machine learning in your Jupiter notebooks on-premises where you know your data is secured. And if it happens to sit in existing Hadoop data lake, that's fine too. We don't want our customers to have to make that trade off as they go from one to the other. Let's give you the best of both worlds, or as they say, you can eat your cake and have it too. >> Okay, so. Now let's talk about sort of developers on-prem, right? They've been kind of... If they really wanted to go cloud native, they had to go to the cloud. Do you feel like this changes the game? Do on-prem developers, do they want that capability? Will they lean into that capability? Or will they say no, no, the cloud is cool. What's your take? >> I love developers, right? But it's about who makes the decision, who pays the developers, right? So the CXOs in the enterprises, they need exactly, this is why we call the next-gen computing platform, that you can move your code assets. It's very hard to build software, so it's very valuable to an enterprise. I don't want to have limited to one single location or certain computing infrastructure, right? Luckily, we have Kubernetes to be able to move that, but I want to be able to deploy it on-premise if I have to. I want to deploy it, would be able to deploy in the multiple clouds which are available. And that's the key part. And that makes developers happy too, because the code you write has got to run multiple places. So you can build more code, better code, instead of building the same thing multiple places, because a little compiler change here, a little compiler change there. Nobody wants to do portability testing and rewriting, recertified for certain platforms. >> The head of application development or application architecture and the business are ultimately going to dictate that, number one. Number two, you're saying that developers shouldn't care because it can write once, run anywhere. >> That is the promise, and that's the interesting thing which is available now, 'cause people know, thanks to Kubernetes as a container platform and the abstraction which containers provide, and that makes everybody's life easier. But it goes much more higher than the Head of Apps, right? This is the digital transformation strategy, the next generation application the company has to build as a response to a pandemic, as a pivot, as digital transformation, as digital disruption capability. >> I mean, I see a lot of organizations basically modernizing by building some kind of abstraction to their backend systems, modernizing it through cloud native, and then saying, hey, as you were saying Holger, run it anywhere you want, or connect to those cloud apps, or connect across clouds, connect to other on-prem apps, and eventually out to the edge. Is that what you see? >> It's so much easier said than done though. Organizations have struggled so much with this, especially as we start talking about those data intensive app and workloads. Kubernetes and Hadoop? Up until now, organizations haven't been able to deploy those services. So, what we're offering as part of these GreenLake unified analytics services, a Kubernetes runtime. It's not ours. It's top of branch open source. And open source operators like Apache Spark, bringing in Delta Lake libraries, so that if your developer does want to use cloud native tools to build those next generation advanced analytics applications, but prod is still on-premises, they should just be able to pick that code up, and because we are deploying 100% open-source frameworks, the code should run as is. >> So, it seems like the strategy is to basically build, now that's what GreenLake is, right? It's a cloud. It's like, hey, here's your options, use whatever you want. >> Well, and it's your cloud. That's, what's so important about GreenLake, is it's your cloud, in your data center or co-lo, with your data, your tools, and your code. And again, we know that organizations are going to go to a multi or hybrid cloud location and through our management capabilities, we can reach out if you don't want us to control those, not necessarily, that's okay, but we should at least be able to monitor and audit the data that sits in those other locations, the applications that are running, maybe I register your GKE cluster. I don't manage it, but at least through a central pane of glass, I can tell the Head of Applications, what that person's utilization is across these environments. >> You know, and you said something, Matt, that struck, resonated with me, which is this is not trivial. I mean, not as simple to do. I mean what you see, you see a lot of customers or companies, what they're doing, vendors, they'll wrap their stack in Kubernetes, shove it in the cloud, it's essentially hosted stack, right? And, you're kind of taking a different approach. You're saying, hey, we're essentially building a cloud that's going to connect all these estates. And the key is you're going to have to keep, and you are, I think that's probably part of the reason why we're here, announcing stuff very quickly. A lot of innovation has to come out to satisfy that demand that you're essentially talking about. >> Because we've oversimplified things with containers, right? Because containers don't have what matters for data, and what matters for enterprise, which is persistence, right? I have to be able to turn my systems down, or I don't know when I'm going to use that data, but it has to stay there. And that's not solved in the container world by itself. And that's what's coming now, the heavy lifting is done by people like HPE, to provide that persistence of the data across the different deployment platforms. And then, there's just a need to modernize my on-premise platforms. Right? I can't run on a server which is two, three years old, right? It's no longer safe, it doesn't have trusted identity, all the good stuff that you need these days, right? It cannot be operated remotely, or whatever happens there, where there's two, three years, is long enough for a server to have run their course, right? >> Well you're a software guy, you hate hardware anyway, so just abstract that hardware complexity away from you. >> Hardware is the necessary evil, right? It's like TSA. I want to go somewhere, but I have to go through TSA. >> But that's a key point, let me buy a service, if I need compute, give it to me. And if I don't, I don't want to hear about it, right? And that's kind of the direction that you're headed. >> That's right. >> Holger: That's what you're offering. >> That's right, and specifically the services. So GreenLake's been offering infrastructure, virtual machines, IaaS, as a service. And we want to stop talking about that underlying capability because it's a dial tone now. What organizations and these developers want is the service. Give me a service or a function, like I get in the cloud, but I need to get going today. I need it within my security parameters, access to my data, my tools, so I can get going as quickly as possible. And then beyond that, we're going to give you that cloud billing practices. Because, just because you're deploying a cloud native service, if you're still still being deployed via CapEx, you're not solving a lot of problems. So we also need to have that cloud billing model. >> Great. Well Holger, we'll give you the last word, bring us home. >> It's very interesting to have the cloud qualities of subscription-based pricing maintained by HPE as the cloud vendor from somewhere else. And that gives you that flexibility. And that's very important because data is essential to enterprise processes. And there's three reasons why data doesn't go to the cloud, right? We know that. It's privacy residency requirement, there is no cloud infrastructure in the country. It's performance, because network latency plays a role, right? Especially for critical appraisal. And then there's not invented here, right? Remember Charles Phillips saying how old the CIO is? I know if they're going to go to the cloud or not, right? So, it was not invented here. These are the things which keep data on-premise. You know that load, and HP is coming on with a very interesting offering. >> It's physics, it's laws, it's politics, and sometimes it's cost, right? Sometimes it's too expensive to move and migrate. Guys, thanks so much. Great to see you both. >> Matt: Dave, it's always a pleasure. All right, and thank you for watching the Cubes continuous coverage of HPE's big GreenLake announcements. Keep it right there for more great content. (calm music begins)
SUMMARY :
And Matt Maccaux is the global field CTO I call it the force marks to digital. So data is main center to everything. 'Cause with oil, you can only use once. So to get to the data, you And so Matt, what are you I have to go to the cloud. capability to connect them, It's the same thing. You're going to unify that, and what, We need to be able to know So it's my choice? It's not always going to be in sync but that's the north star. I need to put it into an object store, But what do you see as for that in the past, I want to give you guys Sorry to jump on your history. And so now, my question to you, Matt, And if it happens to sit in they had to go to the cloud. because the code you write has and the business the company has to build as and eventually out to the edge. to pick that code up, So, it seems like the and audit the data that sits to have to keep, and you are, I have to be able to turn my systems down, guy, you hate hardware anyway, I have to go through TSA. And that's kind of the but I need to get going today. the last word, bring us home. I know if they're going to go Great to see you both. the Cubes continuous coverage
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
Matt | PERSON | 0.99+ |
Matt Maccaux | PERSON | 0.99+ |
Holger | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Holger Mueller | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
100% | QUANTITY | 0.99+ |
Charles Phillips | PERSON | 0.99+ |
Constellation Research | ORGANIZATION | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
Excel | TITLE | 0.99+ |
HP | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
three years | QUANTITY | 0.99+ |
GreenLake | ORGANIZATION | 0.99+ |
three reasons | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
ORGANIZATION | 0.99+ | |
two balls | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
Oracle | ORGANIZATION | 0.98+ |
10 years ago | DATE | 0.98+ |
Ezmeral | ORGANIZATION | 0.98+ |
both worlds | QUANTITY | 0.98+ |
first time | QUANTITY | 0.98+ |
S3 | TITLE | 0.98+ |
One trend | QUANTITY | 0.98+ |
GreenLake Services | ORGANIZATION | 0.98+ |
first one | QUANTITY | 0.98+ |
Snowflake | TITLE | 0.97+ |
both places | QUANTITY | 0.97+ |
Kubernetes | TITLE | 0.97+ |
once | QUANTITY | 0.96+ |
both | QUANTITY | 0.96+ |
two trends | QUANTITY | 0.96+ |
Delta Lake | TITLE | 0.95+ |
TITLE | 0.94+ | |
Hadoop | TITLE | 0.94+ |
CapEx | ORGANIZATION | 0.93+ |
Tableaux | TITLE | 0.93+ |
Azure | TITLE | 0.92+ |
GKE | ORGANIZATION | 0.92+ |
Cubes | ORGANIZATION | 0.92+ |
Unix | TITLE | 0.92+ |
one single location | QUANTITY | 0.91+ |
single view | QUANTITY | 0.9+ |
Spark | TITLE | 0.86+ |
Apache | ORGANIZATION | 0.85+ |
pandemic | EVENT | 0.82+ |
Hadoop | ORGANIZATION | 0.81+ |
three years old | QUANTITY | 0.8+ |
single | QUANTITY | 0.8+ |
Kubernetes | ORGANIZATION | 0.74+ |
big waves | EVENT | 0.73+ |
Apache Spark | ORGANIZATION | 0.71+ |
Number two | QUANTITY | 0.69+ |
Next Gen Analytics & Data Services for the Cloud that Comes to You | An HPE GreenLake Announcement
(upbeat music) >> Welcome back to theCUBE's coverage of HPE GreenLake announcements. We're seeing the transition of Hewlett Packard Enterprise as a company, yes they're going all in for as a service, but we're also seeing a transition from a hardware company to what I look at increasingly as a data management company. We're going to talk today to Vishal Lall who's GreenLake cloud services solutions at HPE and Matt Maccaux who's a global field CTO, Ezmeral Software at HPE. Gents welcome back to theCube. Good to see you again. >> Thank you for having us here. >> Thanks Dave. >> So Vishal let's start with you. What are the big mega trends that you're seeing in data? When you talk to customers, when you talk to partners, what are they telling you? What's your optic say? >> Yeah, I mean, I would say the first thing is data is getting even more important. It's not that data hasn't been important for enterprises, but as you look at the last, I would say 24 to 36 months has become really important, right? And it's become important because customers look at data and they're trying to stitch data together across different sources, whether it's marketing data, it's supply chain data, it's financial data. And they're looking at that as a source of competitive advantage. So, customers were able to make sense out of the data, enterprises that are able to make sense out of that data, really do have a competitive advantage, right? And they actually get better business outcomes. So that's really important, right? If you start looking at, where we are from an analytics perspective, I would argue we are in maybe the third generation of data analytics. Kind of the first one was in the 80's and 90's with data warehousing kind of EDW. A lot of companies still have that, but think of Teradata, right? The second generation more in the 2000's was around data lakes, right? And that was all about Hadoop and others, and really the difference between the first and the second generation was the first generation was more around structured data, right? Second became more about unstructured data, but you really couldn't run transactions on that data. And I would say, now we are entering this third generation, which is about data lake houses, right? Customers what they want really is, or enterprises, what they want really is they want structured data. They want unstructured data altogether. They want to run transactions on them, right? They want to use the data to mine it for machine learning purposes, right? Use it for SQL as well as non-SQL, right? And that's kind of where we are today. So, that's really what we are hearing from our customers in terms of at least the top trends. And that's how we are thinking about our strategy in context of those trends. >> So lake house use that term. It's an increasing popular term. It connotes, "Okay, I've got the best of data warehouse "and I've got the best of data lake. "I'm going to try to simplify the data warehouse. "And I'm going to try to clean up the data swamp "if you will." Matt, so, talk a little bit more about what you guys are doing specifically and what that means for your customers. >> Well, what we think is important is that there has to be a hybrid solution, that organizations are going to build their analytics. They're going to deploy algorithms, where the data either is being produced or where it's going to be stored. And that could be anywhere. That could be in the trunk of a vehicle. It could be in a public cloud or in many cases, it's on-premises in the data center. And where organizations struggle is they feel like they have to make a choice and a trade-off going from one to the other. And so what HPE is offering is a way to unify the experiences of these different applications, workloads, and algorithms, while connecting them together through a fabric so that the experience is tied together with consistent, security policies, not having to refactor your applications and deploying tools like Delta lake to ensure that the organization that needs to build a data product in one cloud or deploy another data product in the trunk of an automobile can do so. >> So, Vishal I wonder if we could talk about some of the patterns that you're seeing with customers as you go to deploy solutions. Are there other industry patterns? Are there any sort of things you can share that you're discerning? >> Yeah, no, absolutely. As we kind of hear back from our customers across industries, I think the problem sets are very similar, right? Whether you look at healthcare customers. You look at telco customers, you look at consumer goods, financial services, they're all quite similar. I mean, what are they looking for? They're looking for making sense, making business value from the data, breaking down the silos that I think Matt spoke about just now, right? How do I stitch intelligence across my data silos to get more business intelligence out of it. They're looking for openness. I think the problem that's happened is over time, people have realized that they are locked in with certain vendors or certain technologies. So, they're looking for openness and choice. So that's an important one that we've at least heard back from our customers. The other one is just being able to run machine learning on algorithms on the data. I think that's another important one for them as well. And I think the last one I would say is, TCO is important as customers over the last few years have realized going to public cloud is starting to become quite expensive, to run really large workloads on public cloud, especially as they want to egress data. So, cost performance, trade offs are starting to become really important and starting to enter into the conversation now. So, I would say those are some of the key things and themes that we are hearing from customers cutting across industries. >> And you talked to Matt about basically being able to essentially leave the data where it belongs, bring the compute to data. We talk about that all the time. And so that has to include on-prem, it's got to include the cloud. And I'm kind of curious on the edge, where you see that 'cause that's... Is that an eventual piece? Is that something that's actually moving in parallel? There's lot of fuzziness as an observer in the edge. >> I think the edge is driving the most interesting use cases. The challenge up until recently has been, well, I think it's always been connectivity, right? Whether we have poor connection, little connection or no connection, being able to asynchronously deploy machine learning jobs into some sort of remote location. Whether it's a very tiny edge or it's a very large edge, like a factory floor, the challenge as Vishal mentioned is that if we're going to deploy machine learning, we need some sort of consistency of runtime to be able to execute those machine learning models. Yes, we need consistent access to data, but consistent access in terms of runtime is so important. And I think Hadoop got us started down this path, the ability to very efficiently and cost-effectively run large data jobs against large data sets. And it attempted to work into the source ecosystem, but because of the monolithic deployment, the tightly coupling of the compute and the data, it never achieved that cloud native vision. And so what as role in HPE through GreenLake services is delivering with open source-based Kubernetes, open source Apache Spark, open source Delta lake libraries, those same cloud native services that you can develop on your workstation, deploy in your data center in the same way you deploy through automation out at the edge. And I think that is what's so critical about what we're going to see over the next couple of years. The edge is driving these use cases, but it's consistency to build and deploy those machine learning models and connect it consistently with data that's what's going to drive organizations to success. >> So you're saying you're able to decouple, to compute from the storage. >> Absolutely. You wouldn't have a cloud if you didn't decouple compute from storage. And I think this is sort of the demise of Hadoop was forcing that coupling. We have high-speed networks now. Whether I'm in a cloud or in my data center, even at the edge, I have high-performance networks, I can now do distributed computing and separate compute from storage. And so if I want to, I can have high-performance compute for my really data intensive applications and I can have cost-effective storage where I need to. And by separating that off, I can now innovate at the pace of those individual tools in that opensource ecosystem. >> So, can I stay on this for a second 'cause you certainly saw Snowflake popularize that, they were kind of early on. I don't know if they're the first, but they certainly one of the most successful. And you saw Amazon Redshift copied it. And Redshift was kind of a bolt on. What essentially they did is they teared off. You could never turn off the compute. You still had to pay for a little bit compute, that's kind of interesting. Snowflakes at the t-shirt sizes, so there's trade offs there. There's a lot of ways to skin the cat. How did you guys skin the cat? >> What we believe we're doing is we're taking the best of those worlds. Through GreenLake cloud services, the ability to pay for and provision on demand the computational services you need. So, if someone needs to spin up a Delta lake job to execute a machine learning model, you spin up that. We're of course spinning that up behind the scenes. The job executes, it spins down, and you only pay for what you need. And we've got reserve capacity there. So you, of course, just like you would in the public cloud. But more importantly, being able to then extend that through a fabric across clouds and edge locations, so that if a customer wants to deploy in some public cloud service, like we know we're going to, again, we're giving that consistency across that, and exposing it through an S3 API. >> So, Vishal at the end of the day, I mean, I love to talk about the plumbing and the tech, but the customer doesn't care, right? They want the lowest cost. They want the fastest outcome. They want the greatest value. My question is, how are you seeing data organizations evolve to sort of accommodate this third era of this next generation? >> Yeah. I mean, the way at least, kind of look at, from a customer perspective, what they're trying to do is first of all, I think Matt addressed it somewhat. They're looking at a consistent experience across the different groups of people within the company that do something to data, right? It could be a SQL users. People who's just writing a SQL code. It could be people who are writing machine learning models and running them. It could be people who are writing code in Spark. Right now they are, you know the experience is completely disjointed across them, across the three types of users or more. And so that's one thing that they trying to do, is just try to get that consistency. We spoke about performance. I mean the disjointedness between compute and storage does provide the agility, because there customers are looking for elasticity. How can I have an elastic environment? So, that's kind of the other thing they're looking at. And performance and DCU, I think a big deal now. So, I think that that's definitely on a customer's mind. So, as enterprises are looking at their data journey, those are the at least the attributes that they are trying to hit as they organize themselves to make the most out of the data. >> Matt, you and I have talked about this sort of trend to the decentralized future. We're sort of hitting on that. And whether it's in a first gen data warehouse, second gen data lake, data hub, bucket, whatever, that essentially should ideally stay where it is, wherever it should be from a performance standpoint, from a governance standpoint and a cost perspective, and just be a node on this, I like the term data mesh, but be a node on that, and essentially allow the business owners, those with domain context to you've mentioned data products before to actually build data products, maybe air quotes, but a data product is something that can be monetized. Maybe it cuts costs. Maybe it adds value in other ways. How do you see HPE fitting into that long-term vision which we know is going to take some time to play out? >> I think what's important for organizations to realize is that they don't have to go to the public cloud to get that experience they're looking for. Many organizations are still reluctant to push all of their data, their critical data, that is going to be the next way to monetize business into the public cloud. And so what HPE is doing is bringing the cloud to them. Bringing that cloud from the infrastructure, the virtualization, the containerization, and most importantly, those cloud native services. So, they can do that development rapidly, test it, using those open source tools and frameworks we spoke about. And if that model ends up being deployed on a factory floor, on some common X86 infrastructure, that's okay, because the lingua franca is Kubernetes. And as Vishal mentioned, Apache Spark, these are the common tools and frameworks. And so I want organizations to think about this unified analytics experience, where they don't have to trade off security for cost, efficiency for reliability. HPE through GreenLake cloud services is delivering all of that where they need to do it. >> And what about the speed to quality trade-off? Have you seen that pop up in customer conversations, and how are organizations dealing with that? >> Like I said, it depends on what you mean by speed. Do you mean a computational speed? >> No, accelerating the time to insights, if you will. We've got to go faster, faster, agile to the data. And it's like, "Whoa, move fast break things. "Whoa, whoa. "What about data quality and governance and, right?" They seem to be at odds. >> Yeah, well, because the processes are fundamentally broken. You've got a developer who maybe is able to spin up an instance in the public cloud to do their development, but then to actually do model training, they bring it back on-premises, but they're waiting for a data engineer to get them the data available. And then the tools to be provisioned, which is some esoteric stack. And then runtime is somewhere else. The entire process is broken. So again, by using consistent frameworks and tools, and bringing that computation to where the data is, and sort of blowing this construct of pipelines out of the water, I think is what is going to drive that success in the future. A lot of organizations are not there yet, but that's I think aspirationally where they want to be. >> Yeah, I think you're right. I think that is potentially an answer as to how you, not incrementally, but revolutionized sort of the data business. Last question, is talking about GreenLake, how this all fits in. Why GreenLake? Why do you guys feel as though it's differentiable in the market place? >> So, I mean, something that you asked earlier as well, time to value, right? I think that's a very important attribute and kind of a design factor as we look at GreenLake. If you look at GreenLake overall, kind of what does it stand for? It stands for experience. How do we make sure that we have the right experience for the users, right? We spoke about it in context of data. How do we have a similar experience for different users of data, but just broadly across an enterprise? So, it's all about experience. How do you automate it, right? How do you automate the workloads? How do you provision fast? How do you give folks a cloud... An experience that they have been used to in the public cloud, on using an Apple iPhone? So it's all about experience, I think that's number one. Number two is about choice and openness. I mean, as we look at GreenLake is not a proprietary platform. We are very, very clear that the design, one of the important design principles is about choice and openness. And that's the reason we are, you hear us talk about Kubernetes, about Apaches Spark, about Delta lake et cetera, et cetera, right? We're using kind of those open source models where customers have a choice. If they don't want to be on GreenLake, they can go to public cloud tomorrow. Or they can run in our Holos if they want to do it that way or in their Holos, if they want to do it. So they should have the choice. Third is about performance. I mean, what we've done is it's not just about the software, but we as a company know how to configure infrastructure for that workload. And that's an important part of it. I mean if you think about the machine learning workloads, we have the right Nvidia chips that accelerate those transactions. So, that's kind of the last, the third one, and the last one, I think, as I spoke about earlier is cost. We are very focused on TCO, but from a customer perspective, we want to make sure that we are giving a value proposition, which is just not about experience and performance and openness, but also about costs. So if you think about GreenLake, that's kind of the value proposition that we bring to our customers across those four dimensions. >> Guys, great conversation. Thanks so much, really appreciate your time and insights. >> Matt: Thanks for having us here, David. >> All right, you're welcome. And thank you for watching everybody. Keep it right there for more great content from HPE GreenLake announcements. You're watching theCUBE. (upbeat music)
SUMMARY :
Good to see you again. What are the big mega trends enterprises that are able to "and I've got the best of data lake. fabric so that the experience about some of the patterns that And I think the last one I would say is, And so that has to include on-prem, the ability to very efficiently to compute from the storage. of the demise of Hadoop of the most successful. services, the ability to pay for end of the day, I mean, So, that's kind of the other I like the term data mesh, bringing the cloud to them. on what you mean by speed. to insights, if you will. that success in the future. in the market place? And that's the reason we are, Thanks so much, really appreciate And thank you for watching everybody.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Vishal | PERSON | 0.99+ |
Matt Maccaux | PERSON | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
Matt | PERSON | 0.99+ |
24 | QUANTITY | 0.99+ |
Vishal Lall | PERSON | 0.99+ |
Hewlett Packard Enterprise | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
Second | QUANTITY | 0.99+ |
second generation | QUANTITY | 0.99+ |
first generation | QUANTITY | 0.99+ |
third generation | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
iPhone | COMMERCIAL_ITEM | 0.99+ |
Spark | TITLE | 0.99+ |
Third | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
36 months | QUANTITY | 0.99+ |
Nvidia | ORGANIZATION | 0.99+ |
second generation | QUANTITY | 0.99+ |
telco | ORGANIZATION | 0.99+ |
GreenLake | ORGANIZATION | 0.98+ |
Redshift | TITLE | 0.98+ |
first gen | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
one thing | QUANTITY | 0.98+ |
Teradata | ORGANIZATION | 0.98+ |
third one | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
theCUBE | ORGANIZATION | 0.97+ |
second gen | QUANTITY | 0.96+ |
S3 | TITLE | 0.96+ |
today | DATE | 0.96+ |
Ezmeral Software | ORGANIZATION | 0.96+ |
Apple | ORGANIZATION | 0.96+ |
three types | QUANTITY | 0.96+ |
2000's | DATE | 0.95+ |
third | QUANTITY | 0.95+ |
90's | DATE | 0.95+ |
HPE GreenLake | ORGANIZATION | 0.95+ |
TCO | ORGANIZATION | 0.94+ |
Delta lake | ORGANIZATION | 0.93+ |
80's | DATE | 0.91+ |
Number two | QUANTITY | 0.88+ |
last | DATE | 0.88+ |
theCube | ORGANIZATION | 0.87+ |
Amazon | ORGANIZATION | 0.87+ |
Apache | ORGANIZATION | 0.87+ |
Kubernetes | TITLE | 0.86+ |
Kubernetes | ORGANIZATION | 0.83+ |
Hadoop | TITLE | 0.83+ |
first thing | QUANTITY | 0.82+ |
Snowflake | TITLE | 0.82+ |
four dimensions | QUANTITY | 0.8+ |
Holos | TITLE | 0.79+ |
years | DATE | 0.78+ |
second | QUANTITY | 0.75+ |
X86 | TITLE | 0.73+ |
next couple of years | DATE | 0.73+ |
Delta lake | TITLE | 0.69+ |
Apaches Spark | ORGANIZATION | 0.65+ |
Victor Chang, ThoughtSpot | AWS Startup Showcase
(bright music) >> Hello everyone, welcome today's session for the "AWS Startup Showcase" presented by theCUBE, featuring ThoughtSpot for this track and data and analytics. I'm John Furrier, your host. Today, we're joined by Victor Chang, VP of ThoughtSpot Everywhere and Corporate Development for ThoughtSpot. Victor, thanks for coming on and thanks for presenting. Talking about this building interactive data apps through ThoughtSpot Everywhere. Thanks for coming on. >> Thank you, it's my pleasure to be here. >> So digital transformation is reality. We're seeing it large-scale. More and more reports are being told fast. People are moving with modern application development and if you don't have AI, you don't have automation, you don't have the analytics, you're going to get slowed down by other forces and even inside companies. So data is driving everything, data is everywhere. What's the pitch to customers that you guys are doing as everyone realizes, "I got to go faster, I got to be more secure," (laughs) "And I don't want to get slowed down." What's the- >> Yeah, thank you John. No, it's true. I think with digital transformation, what we're seeing basically is everything is done in the cloud, everything gets done in applications, and everything has a lot of data. So basically what we're seeing is if you look at companies today, whether you are a SaaS emerging growth startup, or if you're a traditional company, the way you engage with your customers, first impression is usually through some kind of an application, right? And the application collects a lot of data from the users and the users have to engage with that. So for most of the companies out there, one of the key things that really have to do is find a way to make sense and get value for the users out of their data and create a delightful and engaging experience. And usually, that's pretty difficult these days. You know, if you are an application company, whether it doesn't really matter what you do, if you're hotel management, you're productivity application, analytics is not typically your strong suit, and where ThoughtSpot Everywhere comes in is instead of you having to build your own analytics and interactivity experience with a data, ThoughtSpot Everywhere helps deliver a really self-service interactive experience and transform your application into a data application. And with digital transformation these days, all applications have to engage, all applications have to delight, and all applications have to be self-service. And with analytics, ThoughtSpot Everywhere brings that for you to your customers and your users. >> So a lot of the mainstream enterprises and even businesses from SMB, small businesses that are in the cloud are scaling up, they're seeing the benefits. What's the problem that you guys are targeting? What's the use case? When does a potential customer or customer know they get that ThoughtSpot is needed to be called in and to work with? Is it that they want low code, no code? Is it more democratization? What's the problem statement and how do you guys turn that problem being solved into an opportunity and benefit? >> I think the key problem we're trying to solve is that most applications today, when they try to deliver analytics, really when they're delivering, is usually a static representation of some data, some answers, and some insights that are created by someone else. So usually the company would present, you know, if you think about it, if you go to your banking application, they usually show some pretty charts for you and then it sparks your curiosity about your credit card transactions or your banking transactions over the last month. Naturally, usually for me, I would then want to click in and ask the next question, which transactions fall into this category, what time, you know, change the categories a bit, usually you're stuck. So what happens with most applications? The challenge is because someone else is asking the questions and then the user is just consuming static insights, you wet their appetite and you don't satisfy it. So application users typically get stunted, they're not satisfied, and then leave application. Where ThoughtSpot comes in, ThoughtSpots through differentiation is our ability to create an interactive curiosity journey with the user. So ThoughtSpot in general, if you buy a standalone, that's the experience that we really stand by, now you can deliberate your application where the user, any user, business user, untrained, without the help of an analyst can ask their own questions. So if you see, going back to my example, if it's in your banking app, you see some kind of visualization around expense actions, you can dig in. What about last month? What about last week? Which transactions? Which merchant? You know, all those things you can continue your curiosity journey so that the business user and the app user ask their questions instead of an analyst who's sitting in the company behind a desk kind of asking your questions for you. >> And that's the outcome that everyone wants. I totally see that and everyone kind of acknowledges that, but I got to then ask you, okay, how do you make that happen? Because you've got the developers who have essentially make that happen and so, the cloud is essentially SaaS, right? So you got a SaaS kind of marketplace here. The apps can be deployed very quickly, but in order to do that, you kind of need self-service and you got to have good analytics, right? So self-service, you guys have that. Now on the analytics side, most people have to build their own or use an existing tool and tools become specialists, you know what I'm saying? So you're in this kind of like weird cycle of, "Okay, I got to deploy and spend resource to build my own, which could be long and tiresome." >> Yeah. >> "And or rely on other tools that could be good, but then I have too many tools but that creates specialism kind of silos." These seems to be trends. Do you agree with that? And if customers have this situation, you guys come in, can you help there? >> Absolutely, absolutely. So, you know, if you think about the two options that you just laid out, that you could either roll your own, kind of build your own, and that's really hard. If you think about analyst industry, where 20, $30 billion industry with a lot of companies that specialize in building analytics so it's a really tough thing to do. So it doesn't really matter how big of a company you are, even if you're a Microsoft or an Amazon, it's really hard for them to actually build analytics internally. So for a company to try to do it on their own, hire the talent and also to come up with that interactive experience, most companies fail. So what ends up happening is you deliver the budget and the time to market ends up taking much longer, and then the experience is engaging for the users and they still end up leaving your app, having a bad impression. Now you can also buy something. They are our competitors who offer embedded analytics options as well, but the mainstream paradigm today with analytics is delivering. We talked about earlier static visualizations of insights that are created by someone else. So that certainly is an option. You know, where ThoughtSpot Everywhere really stands out above everything else is our technology is fundamentally built for search and interactive and cloud-scale data kind of an experience that the static visualizations today can't really deliver. So you could deliver a static dashboard purchase from one of our competitors, or if you really want to engage your users again, today is all about self-service, it's all about interactivity, and only ThoughtSpot's architecture can deliver that embedded in a data app for you. >> You know, one of the things I'm really impressed with you guys at ThoughtSpot is that you see data as I see strategic advantage for companies and people say that it's kind of a cliche but, or a punchline, and some sort of like business statement. But when you start getting into new kinds of workflows, that's the intellectual property. If you can enable people to essentially with very little low-code, no-code, or just roll their own analysis and insights from a platform, you're then creating intellectual property for the company. So this is kind of a new paradigm. And so a lot of CIO's that I talked to, or even CSOs on the security side of like, they kind of want this but maybe can't get there overnight. So if I'm a CIO, Victor, who do I, how do I point to on my team to engage with you guys? Like, okay, you sold me on it, I love the vision. This is definitely where we want to go. Who do I bring into the meeting? >> I think that in any application, in any company actually, there's usually product leaders and developers that create applications. So, you know, if you are a SaaS company, obviously your core product, your core product team would be the right team we want to talk to. If you're a traditional enterprise, you'd be surprised actually, how many traditional enterprises that been around for 50, 100 years, you might think of them selling a different product but actually, they have a lot of visual applications and product teams within their company as well. For example, you know, we have customers like a big tractor company. You can probably imagine who they might be. They actually have visual applications that they use ThoughtSpot to offer to the dealers so that they can look at their businesses with the tractors. We also have a big telecom company, for example, that you would think about telecom as a whole service but they have a building application that they offer to their merchants to track their billing. So what I'm saying is really, whether you're a software company where that's your core product, or you're a traditional enterprise that has visual applications underneath to support your core product, there's usually product teams, product leaders, and developers. Those are the ones that we want to talk to and we can help them realize a better vision for the product that they're responsible for. >> I mean, the reality is all applications need analytics, right, at some level. >> Yes. >> Full instrumentation at a minimum log everything and then the ability to roll that up, that's where I see people always telling me like that's where the challenge seems to be. Okay, I can log everything, but now how do I have a... And then after the fact that they say, "Give me a report, what's happening?" >> That's right. >> They get stuck. >> They get stuck 'cause you get that report and you know, someone else asked that question for you and you're probably a curious person. I'm a curious person. You always have that next question, and then usually if you're in a company, let's just say, you're a CIO. You're probably used to having a team of analysts at your fingertip so at least if you have a question, you don't like the report, you can find two people, five people they'll respond to your request. But if you're a business application user, you're sitting there, I don't know about you, but I don't remember the last time I actually went through and really found a support ticket in my application, or I really read a detailed documentation describing features in application. Users like to be self-taught, self-service and they like to explore it on their own. And there's no analyst there, there's no IT guy that they can lean on so if they get a static report of the data, they'll naturally always want to ask more questions, then they're stuck. So it's that kind of unsatisfying where, "I have some curiosity, you sparked by questions, I can't answer them." That's where I think a lot of companies struggle with. That's why a lot of applications, they're data intensive but they don't deliver any insights. >> It's interesting and I like this anywhere idea because you think about like what you guys do, applications can be, they always start small, right? I mean, applications got to be built. So you guys, your solution really fits for small startups and business all the way up to large enterprises which in a large enterprise, they could have hundreds and thousands of applications which look like small startups. >> Absolutely, absolutely. You know, that's a great thing about the sort of ThoughtSpot Everywhere which takes the engine around ThoughtSpot that we built over the last eight or nine years and could deliver in any kind of a context. 'Cause nowadays, as opposed to 10, 15, 20 years ago, everything does run in applications these days. We talk about visual transformation at the beginning of the call. That's really what it means is today, the workflows of business are conducted in applications no matter who you're interacting with. And so we have all these applications. A lot of times, yes, if you have big analytical problems, you can take the data and put into a different context like ThoughtSpot's own UI and do a lot of analytics, but we also understand that a lot of times customers and users, they like to analyze in the context the workflow of the application they're actually working in. And so with that situation, actually having the analytics embedded within right next to their workflow is something that I think a lot of, especially business users that are less trained, they'd like to do that right in the context of their business productivity workflow. And so that's where ThoughtSpot Everywhere, I know the terminology is a little self-serving, but ThoughtSpot Everywhere, we think ThoughtSpot could actually be everywhere in your business workflow. >> That's great value proposition. I'm going to put my skeptic hat on challenge you and say, Okay, I don't want to... Prove it to me, what's in it for me? And how much is it going to cost me, how do I engage? So, you know- >> Yeah. >> What's in it for me as the buyer? If people want to buy this, I want to use it, I'm going to get engaged with ThoughtSpot and how much does it cost and what's the engagements look like? >> So, what's in it for you is easy. So if you have data in the cloud and you have an application, you should use ThoughtSpot Everywhere to deliver a much more valuable, interactive experience for your user's data. So that's clear. How do you engage? So we have a very flexible pricing models. If your data's in the cloud, we can either, you can purchase with us, we'll land small and then grow with your consumption. You know, that's always the kind of thing, "Hey, allow us to prove it to you, right?" We start, and then if a user starts to consume, you don't really have to pay a big bill until we see the consumption increase. So we have consumption and data capacity-based types of pricing models. And you know, one of the real advantages that we have for cloud applications is if you're a developer, often, even in the past for ThoughtSpot, we haven't always made that development experience very easy. You have to embed a relatively heavy product but the beauty for ThoughtSpot is from the beginning, we were designed with a modern API-based kind of architecture. Now, a lot of our BI competitors were designed and developed in the desktop server kind of era where everything you embed is very monolithic. But because we have an API driven architecture, we invest a lot of time now to wrap a seamless developer SDK, plus very easy to use REST APIs, plus an interactive kind of a portal to make that development experience also really simple. So if you're a developer, now you really can get from zero to an easy app for ThoughtSpot embedded in your data app in just often in less than 60 minutes. >> John: Yeah. >> So that's also a very great proposition where modern leaders is your data's in the cloud, you've got developers with an SDK, it can get you into an app very quickly. >> All right so bottom line, if you're in the cloud, you got to get the data embed in the apps, data everywhere with ThoughtSpot. >> Yes. >> All right, so let's unpack it a little bit because I think you just highlighted I think what I think is the critical factor for companies as they evaluate their plethora of tools that they have and figuring out how to streamline and be cloud native in scale. You mentioned static and old BI competitors to the cloud. They also have a team of analysts as well that just can make the executives feel like the all of the reports are dynamic but they're not, they're just static. But look at, I know you guys have a relation with Snowflake, and not to kind of bring them into this but to highlight this, Snowflake disrupted the data warehouse. >> Yes. >> Because they're in the cloud and then they refactored leveraging cloud scale to provide a really easy, fast type of value for their product and then the rest is history. They're public, they're worth a lot of money. That's kind of an example of what's coming for every category of companies. There's going to be that. In fact, Jerry Chen, who was just given the keynote here at the event, had just had a big talk called "Castles In The Cloud", you can build a moat in the cloud with your application if you have the right architecture. >> Absolutely. >> So this is kind of a new, this is a new thing and it's almost like beachfront property, whoever gets there first wins the category. >> Exactly, exactly. And we think the timing is right now. You know, Snowflake, and even earlier, obviously we had the best conference with Redshift, which really started the whole cloud data warehouse wave, and now you're seeing Databricks even with their Delta Lake and trying to get into that kind of swim lane as well. Right now, all of a sudden, all these things that have been brewing in the background in the data architecture has to becoming mainstream. We're now seeing even large financial institutions starting to always have to test and think about moving their data into cloud data warehouse. But once you're in the cloud data warehouse, all the benefits of its elasticity, performance, that can really get realized at the analytics layer. And what ThoughtSpot really can bring to the table is we've always, because we're a search-based paradigm and when you think about search. Search is all about, doesn't really matter what kind of search you're doing, it's about digging really deep into a lot of data and delivering interactive performance. Those things have always... Doesn't really matter what data architecture we sit on, I've always been really fundamental to how we build our product. And that translates extremely well when you have your data in a Snowflake or Redshift have billions of rows in the cloud. We're the only company, we think, that can deliver interactive performance on all the data you have in a cloud data warehouse. >> Well, I want to congratulate you, guys. I'm really a big fan of the company. I think a lot of companies are misunderstood until they become big and there was, "Why didn't everyone else do that search? Well, I thought they were a search engine?" Being search centric is an architectural philosophy. I know as a North Star for your company but that creates value, right? So if you look at like say, Snowflake, Redshift and Databricks, you mentioned a few of those, you have kind of a couple of things going on. You have multiple personas kind of living well together and the developers like the data people. Normally, they hated each other, right? (giggles) Or maybe they didn't hate each other but there's conflict, there's always cultural tension between the data people and the developers. Now, you have developers who are becoming data native, if you will, just by embedding that in. So what Snowflake, these guys, are doing is interesting. You can be a developer and program and get great results and have great performance. The developers love Snowflake, they love Databricks, they love Redshift. >> Absolutely. >> And it's not that hard and the results are powerful. This is a new dynamic. What's your reaction to that? >> Yeah, no, I absolutely believe that. I think, part of the beauty of the cloud is I like your kind of analogy of bringing people together. So being in the cloud, first of all, the data is accessible by everyone, everywhere. You just need a browser and the right permissions, you can get your data, and also different kind of roles. They all kind of come together. Things best of breed tools get blended together through APIs. Everything just becomes a lot more accessible and collaborative and I know that sounds kind of little kumbaya, but the great thing about the cloud is it does blur the lines between goals. Everyone can do a little bit of everything and everyone can access a little bit more of their data and get more value out of it. >> Yeah. >> So all of that, I think that's the... If you talk about digital transformation, you know, that's really at the crux of it. >> Yeah, and I think at the end of the day, speed and high quality applications is a result and I think, the speed game if automation being built in on data plays a big role in that, it's super valuable and people will get slowed down. People get kind of angry. Like I don't want to get, I want to go faster, because automations and AI is going to make things go faster on the dev side, certainly with DevOps, clouds proven that. But if you're like an old school IT department (giggles) or data department, you're talking to weeks not minutes for results. >> Yes. >> I mean, that's the powerful scale we're talking about here. >> Absolutely. And you know, if you think about it, you know, if it's days to minutes, it sounds like a lot but if you think about like also each question, 'cause usually when you're thinking about questions, they come in minutes. Every minute you have a new question and if each one then adds days to your journey, that over time is just amplified, it's just not sustainable. >> Okay- >> So now in the cloud world, you need to have things delivered on demand as you think about it. >> Yeah, and of course you need the data from a security standpoint as well and build that in. Chances is people shift left. I got to ask you if I'm a customer, I want to just run this by you. You mentioned you have an SDK and obviously talking to developers. So I'm working with ThoughtSpot, I'm the leader of the organization. I'm like, "Okay, what's the headroom? What's going to happen as a bridge, the future gets built so I'm going to ride with ThoughtSpot." You mentioned SDK, how much more can I do to build and wrap around ThoughtSpot? Because obviously, this kind of value proposition is enabling value. >> Yes. >> So I want to build around it. How do I get started and where does it go? >> Yeah, well, you can get started as easy as starting with our free trial and just play around with it. And you know, the beauty of SDK and when I talk about how ThoughtSpot is built with API-driven architecture is, hey, there's a lot of magic and features built into ThoughtSpot core pod. You could embed all of that into an application if you would like or you could also use our SDK and our APIs to say, "I just want to embed a couple of visualizations," start with that and allow the users to take into that. You could also embed the whole search feature and allow users to ask repetitive questions, or you can have different role-based kind of experiences. So all of that is very flexible and very dynamic and with SDK, it's low-code in the sense where it creates a JavaScript portal for you and even for me who's haven't coded in a long time. I can just copy and paste some JavaScript code and I can see my applications reflecting in real time. So it's really kind of a modern experience that developers in today's world appreciate, and because all the data's in the cloud and in the cloud, applications are built as services connected through APIs, we really think that this is the modern way that developers would get started. And analysts, even analysts who don't have strong developer training can get started with our developer portal. So really, it's a very easy experience and you can customize it in whichever way you want that suits your application's needs. >> Yeah, I think it's, you don't have to be a developer to really understand the basic value of reuse and discovery of services. I think that's one of these we hear from developers all the time, "I had no idea that Victor did that code. Why do I have to rewrite that?" So you see, reuse come up a lot around automation where code is building with code, right? So you have this new vibe and you need data to discover that search paradigm mindset. How prevalent is that on the minds of customers? Are they just trying to like hold on and survive through the pandemic? (giggles) >> Well, customers are definitely thinking about it. You know, the challenge is change is always hard, you know? So it takes time for people to see the possibilities and then have to go through especially in larger organizations, but even in smaller organizations, people think about, "Well, how do I change my workflow?" and then, "How do I change my data pipeline?" You know, those are the kinds of things where, you know, it takes time, and that's why Redshift has been around since 2012 or I believe, but it took years before enterprises really are now saying, "The benefits are so profound that we really have to change the workflows, change the data pipelines to make it work because we can't hold on to the old ways." So it takes time but when the benefits are so clear, it's really kind of a snowball effect, you know? Once you change a data warehouse, you got to think about, "Do I need to change my application architecture?" Then, "Do I need to change the analytics layer?" And then, "Do I need to change the workflow?" And then you start seeing new possibilities because it's all more flexible that you can add more features to your application and it's just kind of a virtuous cycle, but it starts with taking that first step to your point of considering migrating your data into the cloud and we're seeing that across all kinds of industries now. I think nobody's holding back anymore. It just takes time, sometimes some are slower and some are faster. >> Well, all apps or data apps and it's interesting, I wrote a blog post in 2017 called, "Data Is The New Developer Kit" meaning it was just like a vision statement around data will be part of how apps, like software, it'll be data as code. And you guys are doing that. You're allowing data to be a key ingredient for interactivity with analytics. This is really important. Can you just give us a use case example of how someone builds an interactive data app with ThoughtSpot Everywhere? >> Yeah, absolutely. So I think there are certain applications that when naturally things relates to data, you know, I talk about bending or those kinds of things. Like when you use it, you just kind of inherently know, "Hey, there's tons of data and then can I get some?" But a lot of times we're seeing, you know, for example, one of our customers is a very small company that provides software for personal trainers and small fitness studios. You know, you would think like, "Oh well, these are small businesses. They don't have a ton of data. A lot of them would probably just run on QuickBooks or Excel and all of that." But they could see the value is kind of, once a personal trainer conducts his business on a cloud software, then he'll realize, "Oh, I don't need to download any more data. I don't need to run Excel anymore, the data is already there in a software." And hey, on top of that, wouldn't it be great if you have an analytics layer that can analyze how your clients paid you, where your appointments are, and so forth? And that's even just for, again like I said, no disrespect to personal trainers, but even for one or two personal trainers, hey, they can be an analytics and they could be an analyst on their business data. >> Yeah, why not? Everyone's got a Fitbits and watches and they could have that built into their studio APIs for the trainers. They can get collaboration. >> That's right. So there's no application you can think that's too simple or you might think too traditional or whatnot for analytics. Every application now can become a very engaging data application. >> Well Victor, it's great to have you on. Obviously, great conversation around ThoughtSpot anywhere. And as someone who runs corp dev for ThoughtSpot, for the folks watching that aren't customers yet for ThoughtSpot, what should they know about you guys as a company that they might not know about or they should know about? And what are people talking about ThoughtsSpot, what are they saying about it? So what should they know that know that's not being talked about or they may not understand? And what are other people saying about ThoughtSpot? >> So a couple of things. One is there's a lot of fun out there. I think about search in general, search is generally a very broad term but I think it, you know, I go back to what I was saying earlier is really what differentiates ThoughtSpot is not just that we have a search bar that's put on some kind of analytics UI. Really, it's the fundamental technical architecture underlying that is from the ground up built for search large data, granular, and detailed exploration of your data. That makes us truly unique and nobody else can really do search if you're not built with a technical foundation. The second thing is, we're very much a cloud first company now, and a ton of our over the past few years because of the growth of these highly performing data warehouses like Snowflake and Redshift, we're able to really focus on what we do best which is the search and the query processing performance on the front end and we're fully engaged with cloud platforms now. So if you have data in the cloud, we are the best analytics front end for that. >> Awesome, well, thanks for coming on. Great the feature you guys here in the "Startup Showcase", great conversation, ThoughtSpot leading company, hot startup. We did their event with them with theCUBE a couple of months ago. Congratulations on all your success. Victor Chang, VP of ThoughtSpot Everywhere and Corporate Development here on theCUBE and "AWS Startup Showcase". Go to awsstartups.com and be part of the community, we're doing these quarterly featuring the hottest startups in the cloud. I'm John Furrier, thanks for watching. >> Victor: Thank you so much. (bright music)
SUMMARY :
for the "AWS Startup Showcase" and if you don't have AI, the way you engage with your customers, So a lot of the mainstream and you don't satisfy it. but in order to do that, you can you help there? and the time to market to engage with you guys? that you would think about I mean, the reality is all and then the ability to roll that up, get that report and you know, So you guys, your solution A lot of times, yes, if you hat on challenge you and say, the cloud and you have an it can get you into an app very quickly. you got to get the data embed in the apps, of the reports are "Castles In The Cloud", you So this is kind of a new, and when you think about search. and Databricks, you and the results are powerful. of all, the data is accessible transformation, you know, on the dev side, certainly with I mean, that's the powerful scale And you know, if you think about it, So now in the cloud world, Yeah, and of course you need the data So I want to build and in the cloud, applications are built and you need data to discover of things where, you know, And you guys are doing that. relates to data, you know, APIs for the trainers. So there's no application you Well Victor, it's great to have you on. So if you have data in the cloud, Great the feature you guys Victor: Thank you so much.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jerry Chen | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Victor Chang | PERSON | 0.99+ |
2017 | DATE | 0.99+ |
John | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
two people | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Excel | TITLE | 0.99+ |
Victor | PERSON | 0.99+ |
last week | DATE | 0.99+ |
ThoughtSpot | ORGANIZATION | 0.99+ |
Today | DATE | 0.99+ |
last month | DATE | 0.99+ |
five people | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.99+ |
less than 60 minutes | QUANTITY | 0.99+ |
each question | QUANTITY | 0.99+ |
two options | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
JavaScript | TITLE | 0.99+ |
ThoughtsSpot | ORGANIZATION | 0.99+ |
Redshift | ORGANIZATION | 0.99+ |
2012 | DATE | 0.98+ |
awsstartups.com | OTHER | 0.98+ |
first | QUANTITY | 0.98+ |
QuickBooks | TITLE | 0.98+ |
today | DATE | 0.98+ |
each one | QUANTITY | 0.98+ |
Snowflake | EVENT | 0.98+ |
first impression | QUANTITY | 0.98+ |
100 years | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
10 | DATE | 0.98+ |
Databricks | ORGANIZATION | 0.97+ |
One | QUANTITY | 0.97+ |
SDK | TITLE | 0.96+ |
theCUBE | ORGANIZATION | 0.96+ |
first company | QUANTITY | 0.95+ |
15 | DATE | 0.95+ |
Startup Showcase | EVENT | 0.95+ |
20 years ago | DATE | 0.94+ |
pandemic | EVENT | 0.93+ |
ThoughtSpot Everywhere | ORGANIZATION | 0.92+ |
AWS Startup Showcase | EVENT | 0.92+ |
AWS | ORGANIZATION | 0.9+ |
Isha Sharma, Dremio | CUBE Conversation | March 2021
>>Well, welcome to the special cube conversation. I'm Jennifer with the cube, your host, we're here with Jeremy and Iisha Sharma director of product management for trim. We're going to talk about data, data lakes, the future of data, and how it works with cloud and in the new applications. Iisha thanks for joining me. >>Thank you for having me, John, >>You guys are a cutting-edge startup. You've got a lot of good action going on. You're kind of on the new, the new guard as Andy Jassy at AWS always talks about this. The old guard incumbents you guys are on the, on the new breed, you guys are doing the new stuff around data lakes and also making data accessible for customers. Uh, what, what is that all about? Take us through what is Dremio. >>So Dremio is the data Lake service that essentially allows you to very simply run SQL queries on directly on your data Lake storage, without having to make any of those copies that everybody's going on about all the time. So you're really able to get that fast time to value without having to have this long process of let's put in a request to my data team, let's make all of those copies and then finally get this very reduced scope of, of your data and still have to go back to your data team every time you need it, you need a change to that. So dreamy is bringing you that fast time to value with that. No copy data strategy, and really providing you the flexibility to keep your data in your data Lake storage, as the single source of truth. >>You know, the past 10 years, we've watched with cube coverage since we've been doing this program and in the community following from the early days of Hadoop to now, we've seen the trials and tribulations of ETL data warehousing. We've seen the starts and stops, and we've seen that the most successful formula has been store everything. Um, and then, you know, then the ease of use became a challenge. I don't want to have to hire really high powered engineers to manage certain kinds of clusters. I just got cloud now comes into the mix. I got on-premise storage, but the notion of a data Lake became hugely popular because it became a phrase meant store everything, and it meant different things to different peoples. And since then, teams of people have been hired to be the data teams. So it's kind of new. So I got to ask you, what is the challenge of these data teams? What do they look like? What's the psychology going on with some of the people on these teams? What problems are they solving what's going on? Because you know, they becoming data full >>To take >>Us through what's going on with data teams, >>To your point, the volumes, the variety of data, Eastern growing exponentially every day, there's really no end to it, right? And companies are looking to get their hands on as much data as they possibly can. So that means data teams in a position to how do I provide access to as many users as easily as possible that self service experience or data, um, and data democratization as much of a great concept as it is in theory, it comes with its own challenges in terms of all of those copies that ended up being created to provide the quote unquote self service experience. And then with all of these copies comes the cost to store all of them. And you've just added a tremendous amount of complexity and delayed your time to value significantly. >>You mentioned self-service is one of those things that seems like a moving train. Everyone I talked to is like, Oh, self-service is the Holy grail we've got to get to self-service almost. And then you get to some self serves, then you gotta, you gotta re rethink it cause more stuff's changing. So I have to ask in that capacity, you've got data architects and you've got analysts, the customer of the data. How's the, what's the relationship between those two is who gives and who gets, who drives it, who leans in to the analyst, feed the requirements into the architect, set up the boundaries. How is that relationship? Can you take us through how you guys view the relationship between the data analyst and architect? I mean data architect and the data analysts. >>Sure. So you have the data architect, the data team that's actually responsible for providing data access at the end of the day, right? They're the people that have the data democratization requirement on them. And so they've created these copies, tremendous amount of copies. A lot of the times the data Lake storage is, is that source of truth. But, um, you're copying your data into a data warehouse. And then what they end up doing is your, your end user, your analyst, they want, they all want different types of data. They want different views of this data. So there's a tremendous amount of personalized copies that the architects end up creating. And then on top of it, there's performance. We need to get everything back in a timely manner. Otherwise what's the point, right? Real time analytics. So there's all these performance related copies, whether that be additive tables or, you know, VI extract cues, all of that fun stuff. >>And so the architect is the one that's responsible for creating all of those. That's what they have to do to provide access to the analyst. And then, like I'm saying, when we need an update to that data set, when I discover that I have a new data set, that I need to join with an existing one, I have the analyst go to the data architect and say, Hey, by the way, I need this new data set. Can you make this usable for me? Or can you provide me access? And so then we did protect has to process that request now. And so again, coming back to all these copies that have been created, um, the data architect goes through a tremendous amount of work and almost, um, has, has to do this over and over again to actually make the data available to the analyst. But it's a cycle that goes on between the two. >>Yeah. It's interesting dynamic. It's a power dynamic, but also trying to get to the innovation. I've got to ask you, some people are saying that data copies are the major obstacle for democratization. How do you respond to that? What's your view? >>They absolutely are. Data copies are the complete opposite of data democratization. There's no aspect of self-service there, which is exactly what you're looking to do with data democratization. Um, because of those copies, how do you manage those? How do you govern those? How, uh, like I was saying, when somebody needs a new data set or an update to one, they have to go back to that data team. And there goes that self-service actually Dana coffees create a bottleneck because it all comes back to that data team that has to continue to get through those requests that are coming in from their analysts. So, uh, data copies and data democratization is completely automated. >>You know, I remember talking to David latte in a cube event two years ago, he said infrastructure as code was the big DevOps movement. And we felt that data ops would be something similar where data as code, where you didn't have to think about it. So you're kind of getting to this idea of, you know, copies are bad because it doesn't, it holds back that self-service this modern error is looking for more of programmability with data. Kind of what you're teasing out here is that's the modern architecture. Is that how you see it? How do, how do you see, uh, a, uh, a modern data architecture? >>Yeah, so the modern data or the data architecture has evolved significantly in the last several years, right? We started with traditional data warehouses and the traditional data Lake with Duke where the storage and compute were totally tightly coupled. And then we moved on to cloud data warehouses, where there was a separation of compute and storage, and that provided a little more flexibility there. But then with the modern data architecture now with cloud data lakes, you have this aspect of separating, not only storage and compute, but also compute data. So that creates a separate tier for data altogether. What does that look like? So you have your data and your feeling storage as three ATLs, whatever it may be. And on top of that. So of course it's an open format, right? And so on top of that, thanks to technology. It's like Apache iceberg and Delta Lake. There's this ability to give your files, your data, a table structure. And so that starts to bring the capabilities that a data warehouse was providing the data. Thanks to these. You have the ability to do transactions, record level mutations, burgeoning things that were missing completely from a data Lake architecture before. And so, um, introducing that, that data to your, having that separation of compute and data really, really accelerate the ability to get that time to value because you're keeping your data in the data Lake storage at the end of the day. >>And it's interesting, you see all the hot companies tend to be, have that kind of mindset and architecture, and it's creating new opportunities as a ton of white space. So I have to kind of ask you guys, how does Dremio fit into this because you guys are playing in this kind of the new wave here with data it's growing extremely, it's moving fast. You got, again, edge is developing more. Data's coming in at the edge. You've got hybrid testing multi-cloud environments on the horizon. I mean this ultimate multicloud, but I mean, data in real time across multiple clouds is the next kind of area people are focused on. What does, what's the role of GMU and all this to take, take us through that. >>Yeah. So Dremio provides, again, like I said, this data Lake service, and we're all referring to just storage or Hadoop. When we say data Lake, we're talking about an entire solution. Um, so you keep your data, you keep your data in your data, Lake orange. And then on top of that, with the integrations that Dremio has with Apache iceberg and Delta, like we do provide that data here that I was talking about. And so you've given your data, this table structure, and now you can operate on it like you would in a data warehouse. So there's really no need to move your data from a data Lake data warehouse, again, keeping that data Lake as that source of truth. And then on top of that, um, when we talk about copies, personalized copies, performance related copies, you, you really, like I was saying, you've created so much complexity with Jeremy of you don't do that when it comes to personalized copies, we've got the semantic layer and that's a very key aspect of Dremio where you can provide as many views of, of data that you want without having to make any copies. So it really accelerates that, that data democratization story, and then when it, >>So it's the no cop, my strategy trim, you guys are on it, but you're about no copy keeps semantic layer, have that be horizontal across whatever environment and just applications have, can applications tap into this, or how do you guys integrate into apps if I'm an app developer, for instance, how does that work? >>Of course. So that's, that's one of the most important use cases in the sense that when there's an application or even when it's a, you know, a BI client or some other tool that's tapping into the data in S3 or ATLs, a lot of people see performance degradation. Typically with the Dremio, that's not the case we've got, Aeroflight integrated into Tremino, it's a key component as well. And that puts so much, uh, it, so put so much ease in terms of running dashboards off of that, running your analytics apps off of that, because that replay can deliver 20 times the performance that PIO DBC could. So coming back to the no data strategy or note copy data strategy, there's no those local copies anymore that you needed to make. >>So one of the things I got to ask you is, cause this comes up all the time. So she had less pass re-invent. I notice again, Amazon was, I was banging on this hard Azure as well on their side too. Their whole thing is we want to take the AI environment and make it so that people can normal people can use it and deploy machine learning. The same thing kind of comes down into this layer where you're talking about is this democratization is a huge trend because you don't have to be super peaked, you know, math, PhD, data scientist, or ETL, or data Wrangler. You just want to actually code the data or play party with the data in any way you want to do with it. So, so the question I have is is that that's certainly a great trend and no one debates that, but the reality is people are storing data, like almost hoarding it, just throw it in a data Lake and we'll deal with them later. How does you guys solve that problem? Because once that starts happening, do you have to hire someone super smart to dig that out or rearchitected or because that seems to be kind of the pattern, right? You know, throw everything into data Lake, uh, and we'll deal with it later >>Called the data swamp. And it's like, no one knows what's going on. >>Of course though, you don't actually want to throw everything into a data Lake. There still needs to be a certain amount of structure that all of this lands in. You want it to live in one place, but have still a little bit of structure so that, um, Dremio and other are, are much more enabled to query that with fantastic performance. So there's, there's still some amount of structure that needs to happen at a data Lake level, but from, uh, that semantic layer that we have with during the, you you're, you're creating structure for your end user, >>How would you advise, how would you advise someone who wants to hedge their future and not take on too much technical debt, but says, Hey, you know, I do have the store. Is there a best practice on kind of some guard rails around getting going, how do you, how do you advise your customers who want to get it going? >>So how we advise our customers is again, plugin put your, put your data in that data Lake. A lot of them already have three TLS in place. And getting started with Bermeo is really easy. I would say I did it for the first time and it took a matter of minutes if not less. And so what you're doing with Dremio is connecting data directly to that data source and then creating a semantic layer on top. So you bring together a bunch of data. That's sitting in your data Lake, you know, if that sales data and Sophia, and we give you a really streamlined way to say together, the, you know, last, however, we go back in time, create a view on top of all of that. If you have that structured in folders as great, we will provide you a way to create one view on top of all of that, as opposed to having a view for every day or whatnot. And so again, that semantic layer really comes in handy when you're trying to, as the architect provide access to this data Lake. And then as the user who just, just interacts with the data as, as the views are provided to them, there's really, uh, there's a whole lot of transparency there, and it's really easy to get up and running with drumming. >>I'm looking forward to it. I got to finally ask the question is how do I get started? How do people engage with you guys? Is it, is it a freemium? Is it a cloud service? What's the requirements? What are some of the ways that people can engage and work with you guys? >>Yeah, so we get started, uh, on our website at dot com. And speaking of self-service, we've got a virtual lab at dremio.com/labs that you can get started with that gives you a product tour and even gives you a getting started, walk through the tissue through your first query so that you can see how well it works. And in addition to that, we've got a free trial of Dremio available on AWS marketplace. >>Awesome. Net marketplace is a good place to download stuff. So can I ask you a personal question, Isha? Um, you're the director of product management. You get to see inside the kitchen where everyone's making the, making the product. You also got the customer relationships out there looking at product market fit, as it evolves, customer's requirements evolve. What's some of the cool things that you've seen in this space. That's just interesting to you that either you kind of expected or maybe some surprises, what's the coolest thing you've seen come out of this new data environment we're living in. >>I think just the ability to the way things have evolved, right? It used to be data Lake or data warehouse, and you pick one, you probably have both, but you're not like reaching either to their highest potential. Now you've got, this is coming together of both of them. I think it's been fantastic to see how you've got technology is like a iceberg and Delta Lake and bringing those two things together. And you know, you're in your data Lake and it's great in terms of cost and storage and all of that. But now you're able to have so much flexibility in terms of some of those data warehouse capabilities. And on top of that with technologies like Dremio, and just in general, this open format concept, you're, you're never locked in with a particular vendor with a particular format. You're not locking yourself out of a technology that you don't even know exists yet. And thinking in the past, you were always going to end up there. You always ended up putting your data in something where it was going to be difficult to change it, to get it out. But now you have so much flexibility with the open architecture that's coming. What's the DNA like of the >>Culture at Treme. And obviously you've got a cutting edge. We're in a big, hot wave data. You're enabling a lot of value. Uh, what's the, what's it like there at Jemena? What do you guys strive for? What's the purpose? What's the, what's the DNA of the culture. >>There's a lot of excitement in terms of getting customers to this flexibility, to get them out of things they're locked into really in providing them with accessibility to their data, right? This data access data democratization concept to make that actually happen so that, you know, time to value is a key thing. You want to derive insights out of your, out of your data. And everybody, I drove you in super excited and charging towards that, >>Unlocking that value. That's awesome. Aisha, thank you for coming on the cube conversation. Great to see you. Thanks for coming on. Appreciate it. He's just Sharma director of product management. Dremio here inside the cube. I'm John for your host. Thanks for watching.
SUMMARY :
We're going to talk about data, data lakes, the future of data, you guys are on the, on the new breed, you guys are doing the new stuff around data lakes and also So Dremio is the data Lake service that essentially allows you to very following from the early days of Hadoop to now, we've seen the trials and tribulations of ETL So that means data teams in a position to And then you get to some self serves, then you gotta, you gotta re rethink it cause more A lot of the times the data Lake storage one, I have the analyst go to the data architect and say, Hey, by the way, How do you respond to that? Um, because of those copies, how do you manage those? Is that how you see it? the modern data architecture now with cloud data lakes, you have this aspect So I have to kind of ask you guys, how does Dremio fit So there's really no need to move your data from a data Lake that when there's an application or even when it's a, you know, a BI client or So one of the things I got to ask you is, cause this comes up all the time. And it's like, no one knows what's going on. that semantic layer that we have with during the, you you're, you're creating structure for your end user, How would you advise, how would you advise someone who wants to hedge their future and not take So you bring together a bunch of data. What are some of the ways that people can engage and work with you guys? so that you can see how well it works. That's just interesting to you that either you kind of expected or maybe some surprises, And you know, you're in your data Lake and it's great in terms What do you guys strive for? make that actually happen so that, you know, time to value is a Aisha, thank you for coming on the cube conversation.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeremy | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
Aisha | PERSON | 0.99+ |
March 2021 | DATE | 0.99+ |
Isha Sharma | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Iisha Sharma | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dremio | PERSON | 0.99+ |
20 times | QUANTITY | 0.99+ |
John | PERSON | 0.99+ |
Jennifer | PERSON | 0.99+ |
Iisha | PERSON | 0.99+ |
Dremio | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Sharma | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Sophia | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
GMU | ORGANIZATION | 0.99+ |
Hadoop | TITLE | 0.99+ |
David latte | PERSON | 0.99+ |
two years ago | DATE | 0.98+ |
first time | QUANTITY | 0.98+ |
Delta | ORGANIZATION | 0.98+ |
first query | QUANTITY | 0.98+ |
Bermeo | ORGANIZATION | 0.98+ |
Duke | ORGANIZATION | 0.98+ |
dremio.com/labs | OTHER | 0.95+ |
S3 | TITLE | 0.95+ |
dot com | ORGANIZATION | 0.95+ |
Apache iceberg | ORGANIZATION | 0.94+ |
SQL | TITLE | 0.93+ |
Jemena | ORGANIZATION | 0.93+ |
one place | QUANTITY | 0.92+ |
Azure | TITLE | 0.91+ |
Isha | PERSON | 0.9+ |
single source | QUANTITY | 0.88+ |
one view | QUANTITY | 0.83+ |
Dana coffees | ORGANIZATION | 0.8+ |
past 10 years | DATE | 0.73+ |
last several years | DATE | 0.73+ |
Treme | ORGANIZATION | 0.72+ |
three | QUANTITY | 0.71+ |
Lake | ORGANIZATION | 0.68+ |
Dremio | TITLE | 0.64+ |
Aeroflight | TITLE | 0.64+ |
Tremino | TITLE | 0.57+ |
Delta Lake | ORGANIZATION | 0.56+ |
dreamy | PERSON | 0.55+ |
Lake | LOCATION | 0.46+ |
Breaking Analysis: How Snowflake Plans to Change a Flawed Data Warehouse Model
>> From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE in ETR. This is Breaking Analysis with Dave Vellante. >> Snowflake is not going to grow into its valuation by stealing the croissant from the breakfast table of the on-prem data warehouse vendors. Look, even if snowflake got 100% of the data warehouse business, it wouldn't come close to justifying its market cap. Rather Snowflake has to create an entirely new market based on completely changing the way organizations think about monetizing data. Every organization I talk to says it wants to be, or many say they already are data-driven. why wouldn't you aspire to that goal? There's probably nothing more strategic than leveraging data to power your digital business and creating competitive advantage. But many businesses are failing, or I predict, will fail to create a true data-driven culture because they're relying on a flawed architectural model formed by decades of building centralized data platforms. Welcome everyone to this week's Wikibon Cube Insights powered by ETR. In this Breaking Analysis, I want to share some new thoughts and fresh ETR data on how organizations can transform their businesses through data by reinventing their data architectures. And I want to share our thoughts on why we think Snowflake is currently in a very strong position to lead this effort. Now, on November 17th, theCUBE is hosting the Snowflake Data Cloud Summit. Snowflake's ascendancy and its blockbuster IPO has been widely covered by us and many others. Now, since Snowflake went public, we've been inundated with outreach from investors, customers, and competitors that wanted to either better understand the opportunities or explain why their approach is better or different. And in this segment, ahead of Snowflake's big event, we want to share some of what we learned and how we see it. Now, theCUBE is getting paid to host this event, so I need you to know that, and you draw your own conclusions from my remarks. But neither Snowflake nor any other sponsor of theCUBE or client of SiliconANGLE Media has editorial influence over Breaking Analysis. The opinions here are mine, and I would encourage you to read my ethics statement in this regard. I want to talk about the failed data model. The problem is complex, I'm not debating that. Organizations have to integrate data and platforms with existing operational systems, many of which were developed decades ago. And as a culture and a set of processes that have been built around these systems, and they've been hardened over the years. This chart here tries to depict the progression of the monolithic data source, which, for me, began in the 1980s when Decision Support Systems or DSS promised to solve our data problems. The data warehouse became very popular and data marts sprung up all over the place. This created more proprietary stovepipes with data locked inside. The Enron collapse led to Sarbanes-Oxley. Now, this tightened up reporting. The requirements associated with that, it breathed new life into the data warehouse model. But it remained expensive and cumbersome, I've talked about that a lot, like a snake swallowing a basketball. The 2010s ushered in the big data movement, and Data Lakes emerged. With a dupe, we saw the idea of no schema online, where you put structured and unstructured data into a repository, and figure it all out on the read. What emerged was a fairly complex data pipeline that involved ingesting, cleaning, processing, analyzing, preparing, and ultimately serving data to the lines of business. And this is where we are today with very hyper specialized roles around data engineering, data quality, data science. There's lots of batch of processing going on, and Spark has emerged to improve the complexity associated with MapReduce, and it definitely helped improve the situation. We're also seeing attempts to blend in real time stream processing with the emergence of tools like Kafka and others. But I'll argue that in a strange way, these innovations actually compound the problem. And I want to discuss that because what they do is they heighten the need for more specialization, more fragmentation, and more stovepipes within the data life cycle. Now, in reality, and it pains me to say this, it's the outcome of the big data movement, as we sit here in 2020, that we've created thousands of complicated science projects that have once again failed to live up to the promise of rapid cost-effective time to insights. So, what will the 2020s bring? What's the next silver bullet? You hear terms like the lakehouse, which Databricks is trying to popularize. And I'm going to talk today about data mesh. These are other efforts they look to modernize datalakes and sometimes merge the best of data warehouse and second-generation systems into a new paradigm, that might unify batch and stream frameworks. And this definitely addresses some of the gaps, but in our view, still suffers from some of the underlying problems of previous generation data architectures. In other words, if the next gen data architecture is incremental, centralized, rigid, and primarily focuses on making the technology to get data in and out of the pipeline work, we predict it's going to fail to live up to expectations again. Rather, what we're envisioning is an architecture based on the principles of distributed data, where domain knowledge is the primary target citizen, and data is not seen as a by-product, i.e, the exhaust of an operational system, but rather as a service that can be delivered in multiple forms and use cases across an ecosystem. This is why we often say the data is not the new oil. We don't like that phrase. A specific gallon of oil can either fuel my home or can lubricate my car engine, but it can't do both. Data does not follow the same laws of scarcity like natural resources. Again, what we're envisioning is a rethinking of the data pipeline and the associated cultures to put data needs of the domain owner at the core and provide automated, governed, and secure access to data as a service at scale. Now, how is this different? Let's take a look and unpack the data pipeline today and look deeper into the situation. You all know this picture that I'm showing. There's nothing really new here. The data comes from inside and outside the enterprise. It gets processed, cleanse or augmented so that it can be trusted and made useful. Nobody wants to use data that they can't trust. And then we can add machine intelligence and do more analysis, and finally deliver the data so that domain specific consumers can essentially build data products and services or reports and dashboards or content services, for instance, an insurance policy, a financial product, a loan, that these are packaged and made available for someone to make decisions on or to make a purchase. And all the metadata associated with this data is packaged along with the dataset. Now, we've broken down these steps into atomic components over time so we can optimize on each and make them as efficient as possible. And down below, you have these happy stick figures. Sometimes they're happy. But they're highly specialized individuals and they each do their job and they do it well to make sure that the data gets in, it gets processed and delivered in a timely manner. Now, while these individual pieces seemingly are autonomous and can be optimized and scaled, they're all encompassed within the centralized big data platform. And it's generally accepted that this platform is domain agnostic. Meaning the platform is the data owner, not the domain specific experts. Now there are a number of problems with this model. The first, while it's fine for organizations with smaller number of domains, organizations with a large number of data sources and complex domain structures, they struggle to create a common data parlance, for example, in a data culture. Another problem is that, as the number of data sources grows, organizing and harmonizing them in a centralized platform becomes increasingly difficult, because the context of the domain and the line of business gets lost. Moreover, as ecosystems grow and you add more data, the processes associated with the centralized platform tend to get further genericized. They again lose that domain specific context. Wait (chuckling), there are more problems. Now, while in theory organizations are optimizing on the piece parts of the pipeline, the reality is, as the domain requires a change, for example, a new data source or an ecosystem partnership requires a change in access or processes that can benefit a domain consumer, the reality is the change is subservient to the dependencies and the need to synchronize across these discrete parts of the pipeline or actually, orthogonal to each of those parts. In other words, in actuality, the monolithic data platform itself remains the most granular part of the system. Now, when I complain about this faulty structure, some folks tell me this problem has been solved. That there are services that allow new data sources to really easily be added. A good example of this is Databricks Ingest, which is, it's an auto loader. And what it does is it simplifies the ingestion into the company's Delta Lake offering. And rather than centralizing in a data warehouse, which struggles to efficiently allow things like Machine Learning frameworks to be incorporated, this feature allows you to put all the data into a centralized datalake. More so the argument goes, that the problem that I see with this, is while the approach does definitely minimizes the complexities of adding new data sources, it still relies on this linear end-to-end process that slows down the introduction of data sources from the domain consumer beside of the pipeline. In other words, the domain experts still has to elbow her way into the front of the line or the pipeline, in this case, to get stuff done. And finally, the way we are organizing teams is a point of contention, and I believe is going to continue to cause problems down the road. Specifically, we've again, we've optimized on technology expertise, where for example, data engineers, well, really good at what they do, they're often removed from the operations of the business. Essentially, we created more silos and organized around technical expertise versus domain knowledge. As an example, a data team has to work with data that is delivered with very little domain specificity, and serves a variety of highly specialized consumption use cases. All right. I want to step back for a minute and talk about some of the problems that people bring up with Snowflake and then I'll relate it back to the basic premise here. As I said earlier, we've been hammered by dozens and dozens of data points, opinions, criticisms of Snowflake. And I'll share a few here. But I'll post a deeper technical analysis from a software engineer that I found to be fairly balanced. There's five Snowflake criticisms that I'll highlight. And there are many more, but here are some that I want to call out. Price transparency. I've had more than a few customers telling me they chose an alternative database because of the unpredictable nature of Snowflake's pricing model. Snowflake, as you probably know, prices based on consumption, just like AWS and other cloud providers. So just like AWS, for example, the bill at the end of the month is sometimes unpredictable. Is this a problem? Yes. But like AWS, I would say, "Kill me with that problem." Look, if users are creating value by using Snowflake, then that's good for the business. But clearly this is a sore point for some users, especially for procurement and finance, which don't like unpredictability. And Snowflake needs to do a better job communicating and managing this issue with tooling that can predict and help better manage costs. Next, workload manage or lack thereof. Look, if you want to isolate higher performance workloads with Snowflake, you just spin up a separate virtual warehouse. It's kind of a brute force approach. It works generally, but it will add expense. I'm kind of reminded of Pure Storage and its approach to storage management. The engineers at Pure, they always design for simplicity, and this is the approach that Snowflake is taking. Usually, Pure and Snowflake, as I have discussed in a moment, is Pure's ascendancy was really based largely on stealing share from Legacy EMC systems. Snowflake, in my view, has a much, much larger incremental market opportunity. Next is caching architecture. You hear this a lot. At the end of the day, Snowflake is based on a caching architecture. And a caching architecture has to be working for some time to optimize performance. Caches work well when the size of the working set is small. Caches generally don't work well when the working set is very, very large. In general, transactional databases have pretty small datasets. And in general, analytics datasets are potentially much larger. Is it Snowflake in the analytics business? Yes. But the good thing that Snowflake has done is they've enabled data sharing, and it's caching architecture serves its customers well because it allows domain experts, you're going to hear this a lot from me today, to isolate and analyze problems or go after opportunities based on tactical needs. That said, very big queries across whole datasets or badly written queries that scan the entire database are not the sweet spot for Snowflake. Another good example would be if you're doing a large audit and you need to analyze a huge, huge dataset. Snowflake's probably not the best solution. Complex joins, you hear this a lot. The working set of complex joins, by definition, are larger. So, see my previous explanation. Read only. Snowflake is pretty much optimized for read only data. Maybe stateless data is a better way of thinking about this. Heavily right intensive workloads are not the wheelhouse of Snowflake. So where this is maybe an issue is real-time decision-making and AI influencing. A number of times, Snowflake, I've talked about this, they might be able to develop products or acquire technology to address this opportunity. Now, I want to explain. These issues would be problematic if Snowflake were just a data warehouse vendor. If that were the case, this company, in my opinion, would hit a wall just like the NPP vendors that proceeded them by building a better mouse trap for certain use cases hit a wall. Rather, my promise in this episode is that the future of data architectures will be really to move away from large centralized warehouses or datalake models to a highly distributed data sharing system that puts power in the hands of domain experts at the line of business. Snowflake is less computationally efficient and less optimized for classic data warehouse work. But it's designed to serve the domain user much more effectively in our view. We believe that Snowflake is optimizing for business effectiveness, essentially. And as I said before, the company can probably do a better job at keeping passionate end users from breaking the bank. But as long as these end users are making money for their companies, I don't think this is going to be a problem. Let's look at the attributes of what we're proposing around this new architecture. We believe we'll see the emergence of a total flip of the centralized and monolithic big data systems that we've known for decades. In this architecture, data is owned by domain-specific business leaders, not technologists. Today, it's not much different in most organizations than it was 20 years ago. If I want to create something of value that requires data, I need to cajole, beg or bribe the technology and the data team to accommodate. The data consumers are subservient to the data pipeline. Whereas in the future, we see the pipeline as a second class citizen, with a domain expert is elevated. In other words, getting the technology and the components of the pipeline to be more efficient is not the key outcome. Rather, the time it takes to envision, create, and monetize a data service is the primary measure. The data teams are cross-functional and live inside the domain versus today's structure where the data team is largely disconnected from the domain consumer. Data in this model, as I said, is not the exhaust coming out of an operational system or an external source that is treated as generic and stuffed into a big data platform. Rather, it's a key ingredient of a service that is domain-driven and monetizable. And the target system is not a warehouse or a lake. It's a collection of connected domain-specific datasets that live in a global mesh. What is a distributed global data mesh? A data mesh is a decentralized architecture that is domain aware. The datasets in the system are purposely designed to support a data service or data product, if you prefer. The ownership of the data resides with the domain experts because they have the most detailed knowledge of the data requirement and its end use. Data in this global mesh is governed and secured, and every user in the mesh can have access to any dataset as long as it's governed according to the edicts of the organization. Now, in this model, the domain expert has access to a self-service and obstructed infrastructure layer that is supported by a cross-functional technology team. Again, the primary measure of success is the time it takes to conceive and deliver a data service that could be monetized. Now, by monetize, we mean a data product or data service that it either cuts cost, it drives revenue, it saves lives, whatever the mission is of the organization. The power of this model is it accelerates the creation of value by putting authority in the hands of those individuals who are closest to the customer and have the most intimate knowledge of how to monetize data. It reduces the diseconomies at scale of having a centralized or a monolithic data architecture. And it scales much better than legacy approaches because the atomic unit is a data domain, not a monolithic warehouse or a lake. Zhamak Dehghani is a software engineer who is attempting to popularize the concept of a global mesh. Her work is outstanding, and it's strengthened our belief that practitioners see this the same way that we do. And to paraphrase her view, "A domain centric system must be secure and governed with standard policies across domains." It has to be trusted. As I said, nobody's going to use data they don't trust. It's got to be discoverable via a data catalog with rich metadata. The data sets have to be self-describing and designed for self-service. Accessibility for all users is crucial as is interoperability, without which distributed systems, as we know, fail. So what does this all have to do with Snowflake? As I said, Snowflake is not just a data warehouse. In our view, it's always had the potential to be more. Our assessment is that attacking the data warehouse use cases, it gave Snowflake a straightforward easy-to-understand narrative that allowed it to get a foothold in the market. Data warehouses are notoriously expensive, cumbersome, and resource intensive, but they're a critical aspect to reporting and analytics. So it was logical for Snowflake to target on-premise legacy data warehouses and their smaller cousins, the datalakes, as early use cases. By putting forth and demonstrating a simple data warehouse alternative that can be spun up quickly, Snowflake was able to gain traction, demonstrate repeatability, and attract the capital necessary to scale to its vision. This chart shows the three layers of Snowflake's architecture that have been well-documented. The separation of compute and storage, and the outer layer of cloud services. But I want to call your attention to the bottom part of the chart, the so-called Cloud Agnostic Layer that Snowflake introduced in 2018. This layer is somewhat misunderstood. Not only did Snowflake make its Cloud-native database compatible to run on AWS than Azure in the 2020 GCP, what Snowflake has done is to obstruct cloud infrastructure complexity and create what it calls the data cloud. What's the data cloud? We don't believe the data cloud is just a marketing term that doesn't have any substance. Just as SAS is Simplified Application Software and iOS made it possible to eliminate the value drain associated with provisioning infrastructure, a data cloud, in concept, can simplify data access, and break down fragmentation and enable shared data across the globe. Snowflake, they have a first mover advantage in this space, and we see a number of fundamental aspects that comprise a data cloud. First, massive scale with virtually unlimited compute and storage resource that are enabled by the public cloud. We talk about this a lot. Second is a data or database architecture that's built to take advantage of native public cloud services. This is why Frank Slootman says, "We've burned the boats. We're not ever doing on-prem. We're all in on cloud and cloud native." Third is an obstruction layer that hides the complexity of infrastructure. and fourth is a governed and secured shared access system where any user in the system, if allowed, can get access to any data in the cloud. So a key enabler of the data cloud is this thing called the global data mesh. Now, earlier this year, Snowflake introduced its global data mesh. Over the course of its recent history, Snowflake has been building out its data cloud by creating data regions, strategically tapping key locations of AWS regions and then adding Azure and GCP. The complexity of the underlying cloud infrastructure has been stripped away to enable self-service, and any Snowflake user becomes part of this global mesh, independent of the cloud that they're on. Okay. So now, let's go back to what we were talking about earlier. Users in this mesh will be our domain owners. They're building monetizable services and products around data. They're most likely dealing with relatively small read only datasets. They can adjust data from any source very easily and quickly set up security and governance to enable data sharing across different parts of an organization, or, very importantly, an ecosystem. Access control and governance is automated. The data sets are addressable. The data owners have clearly defined missions and they own the data through the life cycle. Data that is specific and purposely shaped for their missions. Now, you're probably asking, "What happens to the technical team and the underlying infrastructure and the cluster it's in? How do I get the compute close to the data? And what about data sovereignty and the physical storage later, and the costs?" All these are good questions, and I'm not saying these are trivial. But the answer is these are implementation details that are pushed to a self-service layer managed by a group of engineers that serves the data owners. And as long as the domain expert/data owner is driving monetization, this piece of the puzzle becomes self-funding. As I said before, Snowflake has to help these users to optimize their spend with predictive tooling that aligns spend with value and shows ROI. While there may not be a strong motivation for Snowflake to do this, my belief is that they'd better get good at it or someone else will do it for them and steal their ideas. All right. Let me end with some ETR data to show you just how Snowflake is getting a foothold on the market. Followers of this program know that ETR uses a consistent methodology to go to its practitioner base, its buyer base each quarter and ask them a series of questions. They focus on the areas that the technology buyer is most familiar with, and they ask a series of questions to determine the spending momentum around a company within a specific domain. This chart shows one of my favorite examples. It shows data from the October ETR survey of 1,438 respondents. And it isolates on the data warehouse and database sector. I know I just got through telling you that the world is going to change and Snowflake's not a data warehouse vendor, but there's no construct today in the ETR dataset to cut a data cloud or globally distributed data mesh. So you're going to have to deal with this. What this chart shows is net score in the y-axis. That's a measure of spending velocity, and it's calculated by asking customers, "Are you spending more or less on a particular platform?" And then subtracting the lesses from the mores. It's more granular than that, but that's the basic concept. Now, on the x-axis is market share, which is ETR's measure of pervasiveness in the survey. You can see superimposed in the upper right-hand corner, a table that shows the net score and the shared N for each company. Now, shared N is the number of mentions in the dataset within, in this case, the data warehousing sector. Snowflake, once again, leads all players with a 75% net score. This is a very elevated number and is higher than that of all other players, including the big cloud companies. Now, we've been tracking this for a while, and Snowflake is holding firm on both dimensions. When Snowflake first hit the dataset, it was in the single digits along the horizontal axis and continues to creep to the right as it adds more customers. Now, here's another chart. I call it the wheel chart that breaks down the components of Snowflake's net score or spending momentum. The lime green is new adoption, the forest green is customers spending more than 5%, the gray is flat spend, the pink is declining by more than 5%, and the bright red is retiring the platform. So you can see the trend. It's all momentum for this company. Now, what Snowflake has done is they grabbed a hold of the market by simplifying data warehouse. But the strategic aspect of that is that it enables the data cloud leveraging the global mesh concept. And the company has introduced a data marketplace to facilitate data sharing across ecosystems. This is all about network effects. In the mid to late 1990s, as the internet was being built out, I worked at IDG with Bob Metcalfe, who was the publisher of InfoWorld. During that time, we'd go on speaking tours all over the world, and I would listen very carefully as he applied Metcalfe's law to the internet. Metcalfe's law states that the value of the network is proportional to the square of the number of connected nodes or users on that system. Said another way, while the cost of adding new nodes to a network scales linearly, the consequent value scores scales exponentially. Now, apply that to the data cloud. The marginal cost of adding a user is negligible, practically zero, but the value of being able to access any dataset in the cloud... Well, let me just say this. There's no limitation to the magnitude of the market. My prediction is that this idea of a global mesh will completely change the way leading companies structure their businesses and, particularly, their data architectures. It will be the technologists that serve domain specialists as it should be. Okay. Well, what do you think? DM me @dvellante or email me at david.vellante@siliconangle.com or comment on my LinkedIn? Remember, these episodes are all available as podcasts, so please subscribe wherever you listen. I publish weekly on wikibon.com and siliconangle.com, and don't forget to check out etr.plus for all the survey analysis. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching. Be well, and we'll see you next time. (upbeat music)
SUMMARY :
This is Breaking Analysis and the data team to accommodate.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
Bob Metcalfe | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
Metcalfe | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
November 17th | DATE | 0.99+ |
75% | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
five | QUANTITY | 0.99+ |
2020 | DATE | 0.99+ |
Snowflake | TITLE | 0.99+ |
1,438 respondents | QUANTITY | 0.99+ |
2018 | DATE | 0.99+ |
October | DATE | 0.99+ |
david.vellante@siliconangle.com | OTHER | 0.99+ |
today | DATE | 0.99+ |
more than 5% | QUANTITY | 0.99+ |
theCUBE Studios | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
2020s | DATE | 0.99+ |
Snowflake Data Cloud Summit | EVENT | 0.99+ |
Second | QUANTITY | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
both dimensions | QUANTITY | 0.99+ |
theCUBE | ORGANIZATION | 0.99+ |
iOS | TITLE | 0.99+ |
DSS | ORGANIZATION | 0.99+ |
1980s | DATE | 0.99+ |
each company | QUANTITY | 0.99+ |
decades ago | DATE | 0.98+ |
zero | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
2010s | DATE | 0.98+ |
each quarter | QUANTITY | 0.98+ |
Third | QUANTITY | 0.98+ |
20 years ago | DATE | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
earlier this year | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Pure | ORGANIZATION | 0.98+ |
fourth | QUANTITY | 0.98+ |
IDG | ORGANIZATION | 0.97+ |
Today | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
Decision Support Systems | ORGANIZATION | 0.96+ |
Boston | LOCATION | 0.96+ |
single digits | QUANTITY | 0.96+ |
siliconangle.com | OTHER | 0.96+ |
one | QUANTITY | 0.96+ |
Spark | TITLE | 0.95+ |
Legacy EMC | ORGANIZATION | 0.95+ |
Kafka | TITLE | 0.94+ |
ORGANIZATION | 0.94+ | |
Snowflake | EVENT | 0.92+ |
first mover | QUANTITY | 0.92+ |
Azure | TITLE | 0.91+ |
InfoWorld | ORGANIZATION | 0.91+ |
dozens and | QUANTITY | 0.91+ |
mid to | DATE | 0.91+ |
Ali Ghodsi, Databricks | Informatica World 2019
>> Live from Las Vegas, it's theCUBE, covering Informatica World 2019. Brought to you by Informatica. >> Welcome back everyone to theCUBE's live coverage of Informatica World 2019. I'm your host Rebecca Knight, along with my co-host John Furrier. We're joined by Ali Ghodsi, he is the CEO of Databricks, thank you so much for coming on, for returning to theCUBE. You're a CUBE veteran. >> Yes, thank you for having me. >> So I want to pick up on something that you said up on the main stage, and that is that every enterprise on the planet wants to add AI capabilities, but the hardest part of AI is not AI, it's the data. >> Yeah. >> Can you riff on that a little bit for our viewers? Elaborate? >> Yeah, actually, the interesting part is that, if you look at the company that succeeded with AI, the actual AI algorithms they're using, are actually algorithms from the 70s, you know, they're actually developed in the 70s, that's 50 years ago. So then how come they're succeeding now? When actually the same algorithms weren't working in the 70s, so people gave up on them. Like, these things called neural nets, right? Now they're en vogue and they're, you know, super successful. The reason is you have to apply orders of magnitude more data. If you feed those algorithms that we thought were broken orders of magnitude more data, you actually get great results, but that's actually hard. You know, dealing with petabyte scale data and cleaning it, making sure that it's actually the right data for the task at hand is not easy. So that's the part that people are struggling with. >> I saw you up on stage, I'm like ah, Ali's here, Databricks is here, that's awesome. Psyched that you stopped by theCUBE. Been a while. I wanted to get a quick update, 'cause you guys have been on a tear, doing some great work at Cal, we were just told before we came on camera. But what are you doing here? What's the, is there any announcements or news with Informatica? What's the story? >> Yeah, it's, we're doing partnership around Delta Lake, which is our next generation engine that we built, so we're super excited about that. It integrates with all of the Informatica platform. So their ingestion tools, their transformation tools, and the catalog that they also have. So we think together, this can actually really help enterprises make that transition into the AI era. >> So you know, we've been followers, our 10th year, so remember when we were in the cloud era office of Mike Olsen and Amr Awadallah when we first started and now, Hadoop movement started, and then the cloud came along. Right when you guys started your company, the cloud growth took off. You guys were instrumental in changing the equation in dealing with data, data lakes, whatever they're calling it back then. So now, data, holistically, is a systems architecture. On premise it's a huge challenge, cloud native, well no real challenge, people love that. Data feeds AI, lot of risk taking, lot of reward. We're seeing the SaaS business explode, Zoom communications. The list goes on and on. Do you know, enterprise that's trying to be SAS is hard. So you can't just take data from an enterprise and make it SaaS-ified. You really got to think differently. What are you guys doing? How have you guys evolved and vectored into that challenge, because this is where your core value proposition initially started change. Take us through that Databricks story and how you're solving that problem today. >> Yeah, it's a great question. Really what happened is that people started collecting a lot of our data about a decade ago. And the promise was, you can do great things with this. There are all these aspirational use cases around machine learning, real time, it's going to be amazing. Right? So people started collecting it. They started storing one petabytes, two petabytes, and they kept going back to their boss and saying this project is real successful I now have five petabytes in it. But at some point the business said, okay that's great but what can you do with it? What business problems are you actually addressing? What are you solving? And so, in the last couple years there's been a push towards let's prove the value of these data lakes. And actually, many of these projects are falling short. Many are failing. And the reason is, people have just been dumping this data into data lakes without thinking about, the structure, the quality, how it's going to be used. The use cases have been an afterthought. So the number one thing in the top of mind for everyone right now is how do we make these data lakes that we have successful so we can prove some business value to our management? Towards this, this is the main problem that we're focusing on. Towards this, we built something called Delta Lake. It's something you situate on top of your data lake. And what it does is it increases the quality, the reliability, the performance, and the scale of your data lake. >> (John) So it's like a filter. >> Yeah. >> The cream rises to the top. >> (Ari) Exactly. >> Let's the sludge, the data swamp stay below the clean water, if you will. >> Exactly actually you nailed it. So basically, we look at the data as it comes in, filter as you said, and then look at, if there's any quality issues we then put it back in the data lake. It's fine, it can stay there. We'll figure out how to get value out of it later. But if it makes it into the Delta Lake, it will have high quality. Right? So that's great. And since we're anyway already looking at all the data as it's coming in, we might as well also store a lot of inducees and a lot of things that let us performance optimize it later on. So that, later, when people are actually trying to use that data they get really high performance, they get really good quality. And we also added asset transactions to it so that now you're also getting all those transactional use cases working on your existing data lake. >> I saw, at my daughter's graduation in Cal Berkley this weekend and yesterday, people around with Databricks backpacks. Very popular in academic. You guys got the young generation coming in. What's the update on the company? How many employees? What's the traction? Give us a quick business update. >> Yeah we're about 800 employees now. About 100 people in Europe, I would say, and maybe 40-50 people in Asiapac. We're expanding the ME and the Asia business. >> (John) Growth mode. >> Yeah, growth mode. So it's expanding as fast as possible. I mean, I actually, as a CEO, I try to always, slow the hiring down to make sure that we keep the quality bars. So that's actually top of mind for me. But yeah we're-- >> (John) You did Delta Lake on that one. >> Yeah (laughing) >> Exactly. Yeah and we're super excited about working with these universities. We get a lot of graduate students from top universities-- >> And Cal had the first ever class in college of data analytics, what was that? Data analytics are the first inagaural class graduated. Shows how early it is. >> Yeah, yeah, yeah. And actually used Databricks, the community edition, for a class of over a thousand students at Cal used the platform. So they're going to be trained in data science as they come out. >> So I want to ask about that because as you said you're trying to slow down the hiring to make sure that you are maintaining a high bar for your new hires. But yet, I'm sure there's a huge demand because you are in growth mode. So what are you doing? You said you're working with universities to make sure that the next generation is trained up and is capable of performing at Databricks. So tell us more about those efforts. >> Yeah I mean, so, obviously university recruiting is big for us. Cal, I think Databricks has the longest line of all the companies that come there on the career fair day. So, we work very closely with these universities. I think, next generation, as they come out, this generation that's coming out today actually is data science trained. So it's a big difference. There is a huge skills gap out there. Every big enterprise you talk tells you my biggest problem is actually, I don't have skilled people. Can you help me hire people? I say, hey we're not in the recruiting business. But, the good news is, if you look at the universities, they're all training thousands and thousands of data scientists every year now. I can tell you just at Cal, because, I happpen to be on the faculty there, is, almost every applicant now, to grad school, wants to do something AI related. Which has actually led to, if you look at all the programs in universities today, people used to do networking, professors used to do networking, say we do intelligent networks. People who do databases say, we do intelligent databases. People who do systems research say, hey we do intelligent systems, right? So what that means is, in a couple years you'll have lots of students coming out and these companies, that are now struggling hiring, then will be able to hire this talent and will actually succeed better with these AI projects. >> As they say in Berkley, nothing like a good revolution once in a while. AI is kind of changing everyone over. I got to ask you for the young kids out there, and parents who have kids either in elementary school or high school, everyone is trying to figure out, and there's no yet clear playbook, we're starting to see first generation training, but is there a skill set, because there's a range in surface area, you got hardcore coding to ethics, and everything in between from visualization, multiple dimensions of opportunities. What skills do you that people could hone or tweak that may not be on a curriculum that they could get, or pieces of different curriculums in school that would be a good foundation for folks learning and wanting to jump in to data and data value, whether it's coding to ethics? >> Yeah, just looking at my own background and seeing how, what I got to learn in school, the thing that was lacking, compared to what's needed today, is statistics. Understanding of statistics, statistical knowledge, That I think, it's going to be pervasive. So I think, 10, 15 years from now, no matter which field you're in, actually whatever job you have, you have to have some basic level of statistical understanding 'cause the systems you're working with will be, they'll be spitting out statistics and numbers and you need to understand what is false positives, what is this, what is the sample, what is that? What do these things mean? So that's one thing that's definitely missing and actually it's coming, that's one. The second is computing will continue being important. So, in the intersection of those two is, I think a lot of those jobs. >> In all fields, we were talking about earlier, biology, everything's intersecting, biochemistry to whatever right? >> (Ali) Yeah. >> I got to ask you about, well I'm a little old school, I'm 53 years old but I remember when I broke into the business coding, I used to walk into departments, they were called DP, data processing. So we're getting into the data processing world now, you've got statistics, you've got pipeline, these are data concepts. So I got to ask you as companies that are in the enterprise may be slower to move to the cutting edge like you guys are, they got to figure out where to store the data. So can you share your opinion or view on how customers are thinking and how they maybe should be architecting data on premise, in the cloud. Certainly cloud's great, if you're getting cloud native for pure SAS, and born in the cloud like a start-up. But if you're a large enterprise, and you want to be SAS-like, to have all that benefit, take the risk with the reward of being agile, you got to have data because if you don't the data into the machine learning or AI, you're not going to have good AI. So you need to get that data feeding in fast. And if it's constrained with regulation compliance you're screwed. So what's your view on this? Where should it be stored? What's your opinion? >> Yeah, we've had the same opinion for five, six years, right? Which is the data belongs in the cloud. Don't try to do this yourself. Don't try to do this on prem. Don't store it in, at Duke, it's not built for this. Store it in the cloud. In the cloud, first of all, you get a lot of security benefits that the cloud vendors are already working on. So that's one good thing about it. Second, you get it, it's realiable. You get the 10, 11 lines of availability, so that's great, you get that. Start collecting data there. Another reason you want to do it in the cloud is that a lot of the data sets that you need to actually get good quality results, are available in the cloud. Often times what happens with AI is, you build a predictive model, but actually, it's terrible. It didn't work well. So you go back, and then the main trick, the first tricks you use to increase the quality is actually augmenting that data with other data sets. You might purchase those data sets from other vendors. You don't want to be shipping hard drives around or, you know, getting that into your data center. Those will be available in the cloud, so you can augment that data. So we're big fans of storing your data in data lakes, in the cloud. We obviously believe that you need to make that data high quality and reliable. With that we believe the Delta Lake platform, open-source project that we created is a great vehicle for that. But I think moving to the cloud is the number one thing. >> (John) And hybrid works with that if you need to have something on premise? >> In my opinion the two worlds are so different, that it's hard. You hear a lot of vendors that say we're the hybrid solution that works on both and so on. But the two models are so different, fundamentally, that it's hard to actually make them work well. I have not yet seen a customer yet or enterprise. You see a lot of offerings, where people say hybrid is the way. Of course, a lot of on prem vendors are now saying, hey, we're the hybrid solution. I haven't actually seen that be successful to be frank. Maybe someone will crack that nut but-- >> I think it's an operational question to see who can make it work. Ali, congratulations on all your success. Great to see you. >> Yeah it's been great having you on the show. >> Thank you so much for having me. >> You are watching theCUBE, Informatica 2019. I'm Rebecca Knight, for John Furrier, stay tuned.
SUMMARY :
Brought to you by Informatica. thank you so much for coming on, for returning to theCUBE. So I want to pick up on something that you said So that's the part that people are struggling with. Psyched that you stopped by theCUBE. and the catalog that they also have. So you know, we've been followers, our 10th year, And the promise was, you can do great things with this. the clean water, if you will. But if it makes it into the Delta Lake, You guys got the young generation coming in. We're expanding the ME and the Asia business. slow the hiring down to make sure that Yeah and we're super excited about And Cal had the first ever class in So they're going to be trained in data science the hiring to make sure that you are But, the good news is, if you look at the I got to ask you for the young kids out there, and numbers and you need to understand So I got to ask you as companies that are in the enterprise is that a lot of the data sets that you need But the two models are so different, fundamentally, to see who can make it work. You are watching theCUBE,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Rebecca Knight | PERSON | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
10 | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Europe | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Informatica | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
five | QUANTITY | 0.99+ |
Cal | ORGANIZATION | 0.99+ |
Ali | PERSON | 0.99+ |
John | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
two models | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
one petabytes | QUANTITY | 0.99+ |
10th year | QUANTITY | 0.99+ |
Second | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
two petabytes | QUANTITY | 0.99+ |
70s | DATE | 0.99+ |
six years | QUANTITY | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
Duke | ORGANIZATION | 0.99+ |
five petabytes | QUANTITY | 0.99+ |
Delta Lake | LOCATION | 0.99+ |
both | QUANTITY | 0.99+ |
Delta Lake | ORGANIZATION | 0.99+ |
second | QUANTITY | 0.98+ |
first tricks | QUANTITY | 0.98+ |
Berkley | LOCATION | 0.98+ |
40-50 people | QUANTITY | 0.98+ |
two worlds | QUANTITY | 0.98+ |
one good thing | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Asia | LOCATION | 0.98+ |
50 years ago | DATE | 0.98+ |
CUBE | ORGANIZATION | 0.97+ |
Cal Berkley | LOCATION | 0.97+ |
over a thousand students | QUANTITY | 0.97+ |
theCUBE | ORGANIZATION | 0.96+ |
15 years | QUANTITY | 0.96+ |
today | DATE | 0.96+ |
Asiapac | LOCATION | 0.96+ |
Mike Olsen | PERSON | 0.96+ |
Amr Awadallah | PERSON | 0.96+ |
About 100 people | QUANTITY | 0.96+ |
53 years old | QUANTITY | 0.95+ |
about 800 employees | QUANTITY | 0.95+ |
first generation | QUANTITY | 0.92+ |
11 lines | QUANTITY | 0.92+ |
one thing | QUANTITY | 0.91+ |
2019 | DATE | 0.89+ |
Informatica World 2019 | EVENT | 0.88+ |
SaaS | TITLE | 0.86+ |
a decade ago | DATE | 0.85+ |
thousands of data scientists | QUANTITY | 0.84+ |
SAS | ORGANIZATION | 0.84+ |
this weekend | DATE | 0.82+ |
last couple years | DATE | 0.81+ |
Informatica World | TITLE | 0.62+ |