Breaking Analysis: How Snowflake Plans to Change a Flawed Data Warehouse Model
>> From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE in ETR. This is Breaking Analysis with Dave Vellante. >> Snowflake is not going to grow into its valuation by stealing the croissant from the breakfast table of the on-prem data warehouse vendors. Look, even if snowflake got 100% of the data warehouse business, it wouldn't come close to justifying its market cap. Rather Snowflake has to create an entirely new market based on completely changing the way organizations think about monetizing data. Every organization I talk to says it wants to be, or many say they already are data-driven. why wouldn't you aspire to that goal? There's probably nothing more strategic than leveraging data to power your digital business and creating competitive advantage. But many businesses are failing, or I predict, will fail to create a true data-driven culture because they're relying on a flawed architectural model formed by decades of building centralized data platforms. Welcome everyone to this week's Wikibon Cube Insights powered by ETR. In this Breaking Analysis, I want to share some new thoughts and fresh ETR data on how organizations can transform their businesses through data by reinventing their data architectures. And I want to share our thoughts on why we think Snowflake is currently in a very strong position to lead this effort. Now, on November 17th, theCUBE is hosting the Snowflake Data Cloud Summit. Snowflake's ascendancy and its blockbuster IPO has been widely covered by us and many others. Now, since Snowflake went public, we've been inundated with outreach from investors, customers, and competitors that wanted to either better understand the opportunities or explain why their approach is better or different. And in this segment, ahead of Snowflake's big event, we want to share some of what we learned and how we see it. Now, theCUBE is getting paid to host this event, so I need you to know that, and you draw your own conclusions from my remarks. But neither Snowflake nor any other sponsor of theCUBE or client of SiliconANGLE Media has editorial influence over Breaking Analysis. The opinions here are mine, and I would encourage you to read my ethics statement in this regard. I want to talk about the failed data model. The problem is complex, I'm not debating that. Organizations have to integrate data and platforms with existing operational systems, many of which were developed decades ago. And as a culture and a set of processes that have been built around these systems, and they've been hardened over the years. This chart here tries to depict the progression of the monolithic data source, which, for me, began in the 1980s when Decision Support Systems or DSS promised to solve our data problems. The data warehouse became very popular and data marts sprung up all over the place. This created more proprietary stovepipes with data locked inside. The Enron collapse led to Sarbanes-Oxley. Now, this tightened up reporting. The requirements associated with that, it breathed new life into the data warehouse model. But it remained expensive and cumbersome, I've talked about that a lot, like a snake swallowing a basketball. The 2010s ushered in the big data movement, and Data Lakes emerged. With a dupe, we saw the idea of no schema online, where you put structured and unstructured data into a repository, and figure it all out on the read. What emerged was a fairly complex data pipeline that involved ingesting, cleaning, processing, analyzing, preparing, and ultimately serving data to the lines of business. And this is where we are today with very hyper specialized roles around data engineering, data quality, data science. There's lots of batch of processing going on, and Spark has emerged to improve the complexity associated with MapReduce, and it definitely helped improve the situation. We're also seeing attempts to blend in real time stream processing with the emergence of tools like Kafka and others. But I'll argue that in a strange way, these innovations actually compound the problem. And I want to discuss that because what they do is they heighten the need for more specialization, more fragmentation, and more stovepipes within the data life cycle. Now, in reality, and it pains me to say this, it's the outcome of the big data movement, as we sit here in 2020, that we've created thousands of complicated science projects that have once again failed to live up to the promise of rapid cost-effective time to insights. So, what will the 2020s bring? What's the next silver bullet? You hear terms like the lakehouse, which Databricks is trying to popularize. And I'm going to talk today about data mesh. These are other efforts they look to modernize datalakes and sometimes merge the best of data warehouse and second-generation systems into a new paradigm, that might unify batch and stream frameworks. And this definitely addresses some of the gaps, but in our view, still suffers from some of the underlying problems of previous generation data architectures. In other words, if the next gen data architecture is incremental, centralized, rigid, and primarily focuses on making the technology to get data in and out of the pipeline work, we predict it's going to fail to live up to expectations again. Rather, what we're envisioning is an architecture based on the principles of distributed data, where domain knowledge is the primary target citizen, and data is not seen as a by-product, i.e, the exhaust of an operational system, but rather as a service that can be delivered in multiple forms and use cases across an ecosystem. This is why we often say the data is not the new oil. We don't like that phrase. A specific gallon of oil can either fuel my home or can lubricate my car engine, but it can't do both. Data does not follow the same laws of scarcity like natural resources. Again, what we're envisioning is a rethinking of the data pipeline and the associated cultures to put data needs of the domain owner at the core and provide automated, governed, and secure access to data as a service at scale. Now, how is this different? Let's take a look and unpack the data pipeline today and look deeper into the situation. You all know this picture that I'm showing. There's nothing really new here. The data comes from inside and outside the enterprise. It gets processed, cleanse or augmented so that it can be trusted and made useful. Nobody wants to use data that they can't trust. And then we can add machine intelligence and do more analysis, and finally deliver the data so that domain specific consumers can essentially build data products and services or reports and dashboards or content services, for instance, an insurance policy, a financial product, a loan, that these are packaged and made available for someone to make decisions on or to make a purchase. And all the metadata associated with this data is packaged along with the dataset. Now, we've broken down these steps into atomic components over time so we can optimize on each and make them as efficient as possible. And down below, you have these happy stick figures. Sometimes they're happy. But they're highly specialized individuals and they each do their job and they do it well to make sure that the data gets in, it gets processed and delivered in a timely manner. Now, while these individual pieces seemingly are autonomous and can be optimized and scaled, they're all encompassed within the centralized big data platform. And it's generally accepted that this platform is domain agnostic. Meaning the platform is the data owner, not the domain specific experts. Now there are a number of problems with this model. The first, while it's fine for organizations with smaller number of domains, organizations with a large number of data sources and complex domain structures, they struggle to create a common data parlance, for example, in a data culture. Another problem is that, as the number of data sources grows, organizing and harmonizing them in a centralized platform becomes increasingly difficult, because the context of the domain and the line of business gets lost. Moreover, as ecosystems grow and you add more data, the processes associated with the centralized platform tend to get further genericized. They again lose that domain specific context. Wait (chuckling), there are more problems. Now, while in theory organizations are optimizing on the piece parts of the pipeline, the reality is, as the domain requires a change, for example, a new data source or an ecosystem partnership requires a change in access or processes that can benefit a domain consumer, the reality is the change is subservient to the dependencies and the need to synchronize across these discrete parts of the pipeline or actually, orthogonal to each of those parts. In other words, in actuality, the monolithic data platform itself remains the most granular part of the system. Now, when I complain about this faulty structure, some folks tell me this problem has been solved. That there are services that allow new data sources to really easily be added. A good example of this is Databricks Ingest, which is, it's an auto loader. And what it does is it simplifies the ingestion into the company's Delta Lake offering. And rather than centralizing in a data warehouse, which struggles to efficiently allow things like Machine Learning frameworks to be incorporated, this feature allows you to put all the data into a centralized datalake. More so the argument goes, that the problem that I see with this, is while the approach does definitely minimizes the complexities of adding new data sources, it still relies on this linear end-to-end process that slows down the introduction of data sources from the domain consumer beside of the pipeline. In other words, the domain experts still has to elbow her way into the front of the line or the pipeline, in this case, to get stuff done. And finally, the way we are organizing teams is a point of contention, and I believe is going to continue to cause problems down the road. Specifically, we've again, we've optimized on technology expertise, where for example, data engineers, well, really good at what they do, they're often removed from the operations of the business. Essentially, we created more silos and organized around technical expertise versus domain knowledge. As an example, a data team has to work with data that is delivered with very little domain specificity, and serves a variety of highly specialized consumption use cases. All right. I want to step back for a minute and talk about some of the problems that people bring up with Snowflake and then I'll relate it back to the basic premise here. As I said earlier, we've been hammered by dozens and dozens of data points, opinions, criticisms of Snowflake. And I'll share a few here. But I'll post a deeper technical analysis from a software engineer that I found to be fairly balanced. There's five Snowflake criticisms that I'll highlight. And there are many more, but here are some that I want to call out. Price transparency. I've had more than a few customers telling me they chose an alternative database because of the unpredictable nature of Snowflake's pricing model. Snowflake, as you probably know, prices based on consumption, just like AWS and other cloud providers. So just like AWS, for example, the bill at the end of the month is sometimes unpredictable. Is this a problem? Yes. But like AWS, I would say, "Kill me with that problem." Look, if users are creating value by using Snowflake, then that's good for the business. But clearly this is a sore point for some users, especially for procurement and finance, which don't like unpredictability. And Snowflake needs to do a better job communicating and managing this issue with tooling that can predict and help better manage costs. Next, workload manage or lack thereof. Look, if you want to isolate higher performance workloads with Snowflake, you just spin up a separate virtual warehouse. It's kind of a brute force approach. It works generally, but it will add expense. I'm kind of reminded of Pure Storage and its approach to storage management. The engineers at Pure, they always design for simplicity, and this is the approach that Snowflake is taking. Usually, Pure and Snowflake, as I have discussed in a moment, is Pure's ascendancy was really based largely on stealing share from Legacy EMC systems. Snowflake, in my view, has a much, much larger incremental market opportunity. Next is caching architecture. You hear this a lot. At the end of the day, Snowflake is based on a caching architecture. And a caching architecture has to be working for some time to optimize performance. Caches work well when the size of the working set is small. Caches generally don't work well when the working set is very, very large. In general, transactional databases have pretty small datasets. And in general, analytics datasets are potentially much larger. Is it Snowflake in the analytics business? Yes. But the good thing that Snowflake has done is they've enabled data sharing, and it's caching architecture serves its customers well because it allows domain experts, you're going to hear this a lot from me today, to isolate and analyze problems or go after opportunities based on tactical needs. That said, very big queries across whole datasets or badly written queries that scan the entire database are not the sweet spot for Snowflake. Another good example would be if you're doing a large audit and you need to analyze a huge, huge dataset. Snowflake's probably not the best solution. Complex joins, you hear this a lot. The working set of complex joins, by definition, are larger. So, see my previous explanation. Read only. Snowflake is pretty much optimized for read only data. Maybe stateless data is a better way of thinking about this. Heavily right intensive workloads are not the wheelhouse of Snowflake. So where this is maybe an issue is real-time decision-making and AI influencing. A number of times, Snowflake, I've talked about this, they might be able to develop products or acquire technology to address this opportunity. Now, I want to explain. These issues would be problematic if Snowflake were just a data warehouse vendor. If that were the case, this company, in my opinion, would hit a wall just like the NPP vendors that proceeded them by building a better mouse trap for certain use cases hit a wall. Rather, my promise in this episode is that the future of data architectures will be really to move away from large centralized warehouses or datalake models to a highly distributed data sharing system that puts power in the hands of domain experts at the line of business. Snowflake is less computationally efficient and less optimized for classic data warehouse work. But it's designed to serve the domain user much more effectively in our view. We believe that Snowflake is optimizing for business effectiveness, essentially. And as I said before, the company can probably do a better job at keeping passionate end users from breaking the bank. But as long as these end users are making money for their companies, I don't think this is going to be a problem. Let's look at the attributes of what we're proposing around this new architecture. We believe we'll see the emergence of a total flip of the centralized and monolithic big data systems that we've known for decades. In this architecture, data is owned by domain-specific business leaders, not technologists. Today, it's not much different in most organizations than it was 20 years ago. If I want to create something of value that requires data, I need to cajole, beg or bribe the technology and the data team to accommodate. The data consumers are subservient to the data pipeline. Whereas in the future, we see the pipeline as a second class citizen, with a domain expert is elevated. In other words, getting the technology and the components of the pipeline to be more efficient is not the key outcome. Rather, the time it takes to envision, create, and monetize a data service is the primary measure. The data teams are cross-functional and live inside the domain versus today's structure where the data team is largely disconnected from the domain consumer. Data in this model, as I said, is not the exhaust coming out of an operational system or an external source that is treated as generic and stuffed into a big data platform. Rather, it's a key ingredient of a service that is domain-driven and monetizable. And the target system is not a warehouse or a lake. It's a collection of connected domain-specific datasets that live in a global mesh. What is a distributed global data mesh? A data mesh is a decentralized architecture that is domain aware. The datasets in the system are purposely designed to support a data service or data product, if you prefer. The ownership of the data resides with the domain experts because they have the most detailed knowledge of the data requirement and its end use. Data in this global mesh is governed and secured, and every user in the mesh can have access to any dataset as long as it's governed according to the edicts of the organization. Now, in this model, the domain expert has access to a self-service and obstructed infrastructure layer that is supported by a cross-functional technology team. Again, the primary measure of success is the time it takes to conceive and deliver a data service that could be monetized. Now, by monetize, we mean a data product or data service that it either cuts cost, it drives revenue, it saves lives, whatever the mission is of the organization. The power of this model is it accelerates the creation of value by putting authority in the hands of those individuals who are closest to the customer and have the most intimate knowledge of how to monetize data. It reduces the diseconomies at scale of having a centralized or a monolithic data architecture. And it scales much better than legacy approaches because the atomic unit is a data domain, not a monolithic warehouse or a lake. Zhamak Dehghani is a software engineer who is attempting to popularize the concept of a global mesh. Her work is outstanding, and it's strengthened our belief that practitioners see this the same way that we do. And to paraphrase her view, "A domain centric system must be secure and governed with standard policies across domains." It has to be trusted. As I said, nobody's going to use data they don't trust. It's got to be discoverable via a data catalog with rich metadata. The data sets have to be self-describing and designed for self-service. Accessibility for all users is crucial as is interoperability, without which distributed systems, as we know, fail. So what does this all have to do with Snowflake? As I said, Snowflake is not just a data warehouse. In our view, it's always had the potential to be more. Our assessment is that attacking the data warehouse use cases, it gave Snowflake a straightforward easy-to-understand narrative that allowed it to get a foothold in the market. Data warehouses are notoriously expensive, cumbersome, and resource intensive, but they're a critical aspect to reporting and analytics. So it was logical for Snowflake to target on-premise legacy data warehouses and their smaller cousins, the datalakes, as early use cases. By putting forth and demonstrating a simple data warehouse alternative that can be spun up quickly, Snowflake was able to gain traction, demonstrate repeatability, and attract the capital necessary to scale to its vision. This chart shows the three layers of Snowflake's architecture that have been well-documented. The separation of compute and storage, and the outer layer of cloud services. But I want to call your attention to the bottom part of the chart, the so-called Cloud Agnostic Layer that Snowflake introduced in 2018. This layer is somewhat misunderstood. Not only did Snowflake make its Cloud-native database compatible to run on AWS than Azure in the 2020 GCP, what Snowflake has done is to obstruct cloud infrastructure complexity and create what it calls the data cloud. What's the data cloud? We don't believe the data cloud is just a marketing term that doesn't have any substance. Just as SAS is Simplified Application Software and iOS made it possible to eliminate the value drain associated with provisioning infrastructure, a data cloud, in concept, can simplify data access, and break down fragmentation and enable shared data across the globe. Snowflake, they have a first mover advantage in this space, and we see a number of fundamental aspects that comprise a data cloud. First, massive scale with virtually unlimited compute and storage resource that are enabled by the public cloud. We talk about this a lot. Second is a data or database architecture that's built to take advantage of native public cloud services. This is why Frank Slootman says, "We've burned the boats. We're not ever doing on-prem. We're all in on cloud and cloud native." Third is an obstruction layer that hides the complexity of infrastructure. and fourth is a governed and secured shared access system where any user in the system, if allowed, can get access to any data in the cloud. So a key enabler of the data cloud is this thing called the global data mesh. Now, earlier this year, Snowflake introduced its global data mesh. Over the course of its recent history, Snowflake has been building out its data cloud by creating data regions, strategically tapping key locations of AWS regions and then adding Azure and GCP. The complexity of the underlying cloud infrastructure has been stripped away to enable self-service, and any Snowflake user becomes part of this global mesh, independent of the cloud that they're on. Okay. So now, let's go back to what we were talking about earlier. Users in this mesh will be our domain owners. They're building monetizable services and products around data. They're most likely dealing with relatively small read only datasets. They can adjust data from any source very easily and quickly set up security and governance to enable data sharing across different parts of an organization, or, very importantly, an ecosystem. Access control and governance is automated. The data sets are addressable. The data owners have clearly defined missions and they own the data through the life cycle. Data that is specific and purposely shaped for their missions. Now, you're probably asking, "What happens to the technical team and the underlying infrastructure and the cluster it's in? How do I get the compute close to the data? And what about data sovereignty and the physical storage later, and the costs?" All these are good questions, and I'm not saying these are trivial. But the answer is these are implementation details that are pushed to a self-service layer managed by a group of engineers that serves the data owners. And as long as the domain expert/data owner is driving monetization, this piece of the puzzle becomes self-funding. As I said before, Snowflake has to help these users to optimize their spend with predictive tooling that aligns spend with value and shows ROI. While there may not be a strong motivation for Snowflake to do this, my belief is that they'd better get good at it or someone else will do it for them and steal their ideas. All right. Let me end with some ETR data to show you just how Snowflake is getting a foothold on the market. Followers of this program know that ETR uses a consistent methodology to go to its practitioner base, its buyer base each quarter and ask them a series of questions. They focus on the areas that the technology buyer is most familiar with, and they ask a series of questions to determine the spending momentum around a company within a specific domain. This chart shows one of my favorite examples. It shows data from the October ETR survey of 1,438 respondents. And it isolates on the data warehouse and database sector. I know I just got through telling you that the world is going to change and Snowflake's not a data warehouse vendor, but there's no construct today in the ETR dataset to cut a data cloud or globally distributed data mesh. So you're going to have to deal with this. What this chart shows is net score in the y-axis. That's a measure of spending velocity, and it's calculated by asking customers, "Are you spending more or less on a particular platform?" And then subtracting the lesses from the mores. It's more granular than that, but that's the basic concept. Now, on the x-axis is market share, which is ETR's measure of pervasiveness in the survey. You can see superimposed in the upper right-hand corner, a table that shows the net score and the shared N for each company. Now, shared N is the number of mentions in the dataset within, in this case, the data warehousing sector. Snowflake, once again, leads all players with a 75% net score. This is a very elevated number and is higher than that of all other players, including the big cloud companies. Now, we've been tracking this for a while, and Snowflake is holding firm on both dimensions. When Snowflake first hit the dataset, it was in the single digits along the horizontal axis and continues to creep to the right as it adds more customers. Now, here's another chart. I call it the wheel chart that breaks down the components of Snowflake's net score or spending momentum. The lime green is new adoption, the forest green is customers spending more than 5%, the gray is flat spend, the pink is declining by more than 5%, and the bright red is retiring the platform. So you can see the trend. It's all momentum for this company. Now, what Snowflake has done is they grabbed a hold of the market by simplifying data warehouse. But the strategic aspect of that is that it enables the data cloud leveraging the global mesh concept. And the company has introduced a data marketplace to facilitate data sharing across ecosystems. This is all about network effects. In the mid to late 1990s, as the internet was being built out, I worked at IDG with Bob Metcalfe, who was the publisher of InfoWorld. During that time, we'd go on speaking tours all over the world, and I would listen very carefully as he applied Metcalfe's law to the internet. Metcalfe's law states that the value of the network is proportional to the square of the number of connected nodes or users on that system. Said another way, while the cost of adding new nodes to a network scales linearly, the consequent value scores scales exponentially. Now, apply that to the data cloud. The marginal cost of adding a user is negligible, practically zero, but the value of being able to access any dataset in the cloud... Well, let me just say this. There's no limitation to the magnitude of the market. My prediction is that this idea of a global mesh will completely change the way leading companies structure their businesses and, particularly, their data architectures. It will be the technologists that serve domain specialists as it should be. Okay. Well, what do you think? DM me @dvellante or email me at david.vellante@siliconangle.com or comment on my LinkedIn? Remember, these episodes are all available as podcasts, so please subscribe wherever you listen. I publish weekly on wikibon.com and siliconangle.com, and don't forget to check out etr.plus for all the survey analysis. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching. Be well, and we'll see you next time. (upbeat music)
SUMMARY :
This is Breaking Analysis and the data team to accommodate.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
Bob Metcalfe | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
Metcalfe | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
November 17th | DATE | 0.99+ |
75% | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
five | QUANTITY | 0.99+ |
2020 | DATE | 0.99+ |
Snowflake | TITLE | 0.99+ |
1,438 respondents | QUANTITY | 0.99+ |
2018 | DATE | 0.99+ |
October | DATE | 0.99+ |
david.vellante@siliconangle.com | OTHER | 0.99+ |
today | DATE | 0.99+ |
more than 5% | QUANTITY | 0.99+ |
theCUBE Studios | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
2020s | DATE | 0.99+ |
Snowflake Data Cloud Summit | EVENT | 0.99+ |
Second | QUANTITY | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
both dimensions | QUANTITY | 0.99+ |
theCUBE | ORGANIZATION | 0.99+ |
iOS | TITLE | 0.99+ |
DSS | ORGANIZATION | 0.99+ |
1980s | DATE | 0.99+ |
each company | QUANTITY | 0.99+ |
decades ago | DATE | 0.98+ |
zero | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
2010s | DATE | 0.98+ |
each quarter | QUANTITY | 0.98+ |
Third | QUANTITY | 0.98+ |
20 years ago | DATE | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
earlier this year | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Pure | ORGANIZATION | 0.98+ |
fourth | QUANTITY | 0.98+ |
IDG | ORGANIZATION | 0.97+ |
Today | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
Decision Support Systems | ORGANIZATION | 0.96+ |
Boston | LOCATION | 0.96+ |
single digits | QUANTITY | 0.96+ |
siliconangle.com | OTHER | 0.96+ |
one | QUANTITY | 0.96+ |
Spark | TITLE | 0.95+ |
Legacy EMC | ORGANIZATION | 0.95+ |
Kafka | TITLE | 0.94+ |
ORGANIZATION | 0.94+ | |
Snowflake | EVENT | 0.92+ |
first mover | QUANTITY | 0.92+ |
Azure | TITLE | 0.91+ |
InfoWorld | ORGANIZATION | 0.91+ |
dozens and | QUANTITY | 0.91+ |
mid to | DATE | 0.91+ |