Breaking Analysis: Databricks faces critical strategic decisions…here’s why

>> From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. >> Spark became a top level Apache project in 2014, and then shortly thereafter, burst onto the big data scene. Spark, along with the cloud, transformed and in many ways, disrupted the big data market. Databricks optimized its tech stack for Spark and took advantage of the cloud to really cleverly deliver a managed service that has become a leading AI and data platform among data scientists and data engineers. However, emerging customer data requirements are shifting into a direction that will cause modern data platform players generally and Databricks, specifically, we think, to make some key directional decisions and perhaps even reinvent themselves. Hello and welcome to this week's wikibon theCUBE Insights, powered by ETR. In this Breaking Analysis, we're going to do a deep dive into Databricks. We'll explore its current impressive market momentum. We're going to use some ETR survey data to show that, and then we'll lay out how customer data requirements are changing and what the ideal data platform will look like in the midterm future. We'll then evaluate core elements of the Databricks portfolio against that vision, and then we'll close with some strategic decisions that we think the company faces. And to do so, we welcome in our good friend, George Gilbert, former equities analyst, market analyst, and current Principal at TechAlpha Partners. George, good to see you. Thanks for coming on. >> Good to see you, Dave. >> All right, let me set this up. We're going to start by taking a look at where Databricks sits in the market in terms of how customers perceive the company and what it's momentum looks like. And this chart that we're showing here is data from ETS, the emerging technology survey of private companies. The N is 1,421. What we did is we cut the data on three sectors, analytics, database-data warehouse, and AI/ML. The vertical axis is a measure of customer sentiment, which evaluates an IT decision maker's awareness of the firm and the likelihood of engaging and/or purchase intent. The horizontal axis shows mindshare in the dataset, and we've highlighted Databricks, which has been a consistent high performer in this survey over the last several quarters. And as we, by the way, just as aside as we previously reported, OpenAI, which burst onto the scene this past quarter, leads all names, but Databricks is still prominent. You can see that the ETR shows some open source tools for reference, but as far as firms go, Databricks is very impressively positioned. Now, let's see how they stack up to some mainstream cohorts in the data space, against some bigger companies and sometimes public companies. This chart shows net score on the vertical axis, which is a measure of spending momentum and pervasiveness in the data set is on the horizontal axis. You can see that chart insert in the upper right, that informs how the dots are plotted, and net score against shared N. And that red dotted line at 40% indicates a highly elevated net score, anything above that we think is really, really impressive. And here we're just comparing Databricks with Snowflake, Cloudera, and Oracle. And that squiggly line leading to Databricks shows their path since 2021 by quarter. And you can see it's performing extremely well, maintaining an elevated net score and net range. Now it's comparable in the vertical axis to Snowflake, and it consistently is moving to the right and gaining share. Now, why did we choose to show Cloudera and Oracle? The reason is that Cloudera got the whole big data era started and was disrupted by Spark. And of course the cloud, Spark and Databricks and Oracle in many ways, was the target of early big data players like Cloudera. Take a listen to Cloudera CEO at the time, Mike Olson. This is back in 2010, first year of theCUBE, play the clip. >> Look, back in the day, if you had a data problem, if you needed to run business analytics, you wrote the biggest check you could to Sun Microsystems, and you bought a great big, single box, central server, and any money that was left over, you handed to Oracle for a database licenses and you installed that database on that box, and that was where you went for data. That was your temple of information. >> Okay? So Mike Olson implied that monolithic model was too expensive and inflexible, and Cloudera set out to fix that. But the best laid plans, as they say, George, what do you make of the data that we just shared? >> So where Databricks has really come up out of sort of Cloudera's tailpipe was they took big data processing, made it coherent, made it a managed service so it could run in the cloud. So it relieved customers of the operational burden. Where they're really strong and where their traditional meat and potatoes or bread and butter is the predictive and prescriptive analytics that building and training and serving machine learning models. They've tried to move into traditional business intelligence, the more traditional descriptive and diagnostic analytics, but they're less mature there. So what that means is, the reason you see Databricks and Snowflake kind of side by side is there are many, many accounts that have both Snowflake for business intelligence, Databricks for AI machine learning, where Snowflake, I'm sorry, where Databricks also did really well was in core data engineering, refining the data, the old ETL process, which kind of turned into ELT, where you loaded into the analytic repository in raw form and refine it. And so people have really used both, and each is trying to get into the other. >> Yeah, absolutely. We've reported on this quite a bit. Snowflake, kind of moving into the domain of Databricks and vice versa. And the last bit of ETR evidence that we want to share in terms of the company's momentum comes from ETR's Round Tables. They're run by Erik Bradley, and now former Gartner analyst and George, your colleague back at Gartner, Daren Brabham. And what we're going to show here is some direct quotes of IT pros in those Round Tables. There's a data science head and a CIO as well. Just make a few call outs here, we won't spend too much time on it, but starting at the top, like all of us, we can't talk about Databricks without mentioning Snowflake. Those two get us excited. Second comment zeros in on the flexibility and the robustness of Databricks from a data warehouse perspective. And then the last point is, despite competition from cloud players, Databricks has reinvented itself a couple of times over the year. And George, we're going to lay out today a scenario that perhaps calls for Databricks to do that once again. >> Their big opportunity and their big challenge for every tech company, it's managing a technology transition. The transition that we're talking about is something that's been bubbling up, but it's really epical. First time in 60 years, we're moving from an application-centric view of the world to a data-centric view, because decisions are becoming more important than automating processes. So let me let you sort of develop. >> Yeah, so let's talk about that here. We going to put up some bullets on precisely that point and the changing sort of customer environment. So you got IT stacks are shifting is George just said, from application centric silos to data centric stacks where the priority is shifting from automating processes to automating decision. You know how look at RPA and there's still a lot of automation going on, but from the focus of that application centricity and the data locked into those apps, that's changing. Data has historically been on the outskirts in silos, but organizations, you think of Amazon, think Uber, Airbnb, they're putting data at the core, and logic is increasingly being embedded in the data instead of the reverse. In other words, today, the data's locked inside the app, which is why you need to extract that data is sticking it to a data warehouse. The point, George, is we're putting forth this new vision for how data is going to be used. And you've used this Uber example to underscore the future state. Please explain? >> Okay, so this is hopefully an example everyone can relate to. The idea is first, you're automating things that are happening in the real world and decisions that make those things happen autonomously without humans in the loop all the time. So to use the Uber example on your phone, you call a car, you call a driver. Automatically, the Uber app then looks at what drivers are in the vicinity, what drivers are free, matches one, calculates an ETA to you, calculates a price, calculates an ETA to your destination, and then directs the driver once they're there. The point of this is that that cannot happen in an application-centric world very easily because all these little apps, the drivers, the riders, the routes, the fares, those call on data locked up in many different apps, but they have to sit on a layer that makes it all coherent. >> But George, so if Uber's doing this, doesn't this tech already exist? Isn't there a tech platform that does this already? >> Yes, and the mission of the entire tech industry is to build services that make it possible to compose and operate similar platforms and tools, but with the skills of mainstream developers in mainstream corporations, not the rocket scientists at Uber and Amazon. >> Okay, so we're talking about horizontally scaling across the industry, and actually giving a lot more organizations access to this technology. So by way of review, let's summarize the trend that's going on today in terms of the modern data stack that is propelling the likes of Databricks and Snowflake, which we just showed you in the ETR data and is really is a tailwind form. So the trend is toward this common repository for analytic data, that could be multiple virtual data warehouses inside of Snowflake, but you're in that Snowflake environment or Lakehouses from Databricks or multiple data lakes. And we've talked about what JP Morgan Chase is doing with the data mesh and gluing data lakes together, you've got various public clouds playing in this game, and then the data is annotated to have a common meaning. In other words, there's a semantic layer that enables applications to talk to the data elements and know that they have common and coherent meaning. So George, the good news is this approach is more effective than the legacy monolithic models that Mike Olson was talking about, so what's the problem with this in your view? >> So today's data platforms added immense value 'cause they connected the data that was previously locked up in these monolithic apps or on all these different microservices, and that supported traditional BI and AI/ML use cases. But now if we want to build apps like Uber or Amazon.com, where they've got essentially an autonomously running supply chain and e-commerce app where humans only care and feed it. But the thing is figuring out what to buy, when to buy, where to deploy it, when to ship it. We needed a semantic layer on top of the data. So that, as you were saying, the data that's coming from all those apps, the different apps that's integrated, not just connected, but it means the same. And the issue is whenever you add a new layer to a stack to support new applications, there are implications for the already existing layers, like can they support the new layer and its use cases? So for instance, if you add a semantic layer that embeds app logic with the data rather than vice versa, which we been talking about and that's been the case for 60 years, then the new data layer faces challenges that the way you manage that data, the way you analyze that data, is not supported by today's tools. >> Okay, so actually Alex, bring me up that last slide if you would, I mean, you're basically saying at the bottom here, today's repositories don't really do joins at scale. The future is you're talking about hundreds or thousands or millions of data connections, and today's systems, we're talking about, I don't know, 6, 8, 10 joins and that is the fundamental problem you're saying, is a new data error coming and existing systems won't be able to handle it? >> Yeah, one way of thinking about it is that even though we call them relational databases, when we actually want to do lots of joins or when we want to analyze data from lots of different tables, we created a whole new industry for analytic databases where you sort of mung the data together into fewer tables. So you didn't have to do as many joins because the joins are difficult and slow. And when you're going to arbitrarily join thousands, hundreds of thousands or across millions of elements, you need a new type of database. We have them, they're called graph databases, but to query them, you go back to the prerelational era in terms of their usability. >> Okay, so we're going to come back to that and talk about how you get around that problem. But let's first lay out what the ideal data platform of the future we think looks like. And again, we're going to come back to use this Uber example. In this graphic that George put together, awesome. We got three layers. The application layer is where the data products reside. The example here is drivers, rides, maps, routes, ETA, et cetera. The digital version of what we were talking about in the previous slide, people, places and things. The next layer is the data layer, that breaks down the silos and connects the data elements through semantics and everything is coherent. And then the bottom layers, the legacy operational systems feed that data layer. George, explain what's different here, the graph database element, you talk about the relational query capabilities, and why can't I just throw memory at solving this problem? >> Some of the graph databases do throw memory at the problem and maybe without naming names, some of them live entirely in memory. And what you're dealing with is a prerelational in-memory database system where you navigate between elements, and the issue with that is we've had SQL for 50 years, so we don't have to navigate, we can say what we want without how to get it. That's the core of the problem. >> Okay. So if I may, I just want to drill into this a little bit. So you're talking about the expressiveness of a graph. Alex, if you'd bring that back out, the fourth bullet, expressiveness of a graph database with the relational ease of query. Can you explain what you mean by that? >> Yeah, so graphs are great because when you can describe anything with a graph, that's why they're becoming so popular. Expressive means you can represent anything easily. They're conducive to, you might say, in a world where we now want like the metaverse, like with a 3D world, and I don't mean the Facebook metaverse, I mean like the business metaverse when we want to capture data about everything, but we want it in context, we want to build a set of digital twins that represent everything going on in the world. And Uber is a tiny example of that. Uber built a graph to represent all the drivers and riders and maps and routes. But what you need out of a database isn't just a way to store stuff and update stuff. You need to be able to ask questions of it, you need to be able to query it. And if you go back to prerelational days, you had to know how to find your way to the data. It's sort of like when you give directions to someone and they didn't have a GPS system and a mapping system, you had to give them turn by turn directions. Whereas when you have a GPS and a mapping system, which is like the relational thing, you just say where you want to go, and it spits out the turn by turn directions, which let's say, the car might follow or whoever you're directing would follow. But the point is, it's much easier in a relational database to say, "I just want to get these results. You figure out how to get it." The graph database, they have not taken over the world because in some ways, it's taking a 50 year leap backwards. >> Alright, got it. Okay. Let's take a look at how the current Databricks offerings map to that ideal state that we just laid out. So to do that, we put together this chart that looks at the key elements of the Databricks portfolio, the core capability, the weakness, and the threat that may loom. Start with the Delta Lake, that's the storage layer, which is great for files and tables. It's got true separation of compute and storage, I want you to double click on that George, as independent elements, but it's weaker for the type of low latency ingest that we see coming in the future. And some of the threats highlighted here. AWS could add transactional tables to S3, Iceberg adoption is picking up and could accelerate, that could disrupt Databricks. George, add some color here please? >> Okay, so this is the sort of a classic competitive forces where you want to look at, so what are customers demanding? What's competitive pressure? What are substitutes? Even what your suppliers might be pushing. Here, Delta Lake is at its core, a set of transactional tables that sit on an object store. So think of it in a database system, this is the storage engine. So since S3 has been getting stronger for 15 years, you could see a scenario where they add transactional tables. We have an open source alternative in Iceberg, which Snowflake and others support. But at the same time, Databricks has built an ecosystem out of tools, their own and others, that read and write to Delta tables, that's what makes the Delta Lake and ecosystem. So they have a catalog, the whole machine learning tool chain talks directly to the data here. That was their great advantage because in the past with Snowflake, you had to pull all the data out of the database before the machine learning tools could work with it, that was a major shortcoming. They fixed that. But the point here is that even before we get to the semantic layer, the core foundation is under threat. >> Yep. Got it. Okay. We got a lot of ground to cover. So we're going to take a look at the Spark Execution Engine next. Think of that as the refinery that runs really efficient batch processing. That's kind of what disrupted the DOOp in a large way, but it's not Python friendly and that's an issue because the data science and the data engineering crowd are moving in that direction, and/or they're using DBT. George, we had Tristan Handy on at Supercloud, really interesting discussion that you and I did. Explain why this is an issue for Databricks? >> So once the data lake was in place, what people did was they refined their data batch, and Spark has always had streaming support and it's gotten better. The underlying storage as we've talked about is an issue. But basically they took raw data, then they refined it into tables that were like customers and products and partners. And then they refined that again into what was like gold artifacts, which might be business intelligence metrics or dashboards, which were collections of metrics. But they were running it on the Spark Execution Engine, which it's a Java-based engine or it's running on a Java-based virtual machine, which means all the data scientists and the data engineers who want to work with Python are really working in sort of oil and water. Like if you get an error in Python, you can't tell whether the problems in Python or where it's in Spark. There's just an impedance mismatch between the two. And then at the same time, the whole world is now gravitating towards DBT because it's a very nice and simple way to compose these data processing pipelines, and people are using either SQL in DBT or Python in DBT, and that kind of is a substitute for doing it all in Spark. So it's under threat even before we get to that semantic layer, it so happens that DBT itself is becoming the authoring environment for the semantic layer with business intelligent metrics. But that's again, this is the second element that's under direct substitution and competitive threat. >> Okay, let's now move down to the third element, which is the Photon. Photon is Databricks' BI Lakehouse, which has integration with the Databricks tooling, which is very rich, it's newer. And it's also not well suited for high concurrency and low latency use cases, which we think are going to increasingly become the norm over time. George, the call out threat here is customers want to connect everything to a semantic layer. Explain your thinking here and why this is a potential threat to Databricks? >> Okay, so two issues here. What you were touching on, which is the high concurrency, low latency, when people are running like thousands of dashboards and data is streaming in, that's a problem because SQL data warehouse, the query engine, something like that matures over five to 10 years. It's one of these things, the joke that Andy Jassy makes just in general, he's really talking about Azure, but there's no compression algorithm for experience. The Snowflake guy started more than five years earlier, and for a bunch of reasons, that lead is not something that Databricks can shrink. They'll always be behind. So that's why Snowflake has transactional tables now and we can get into that in another show. But the key point is, so near term, it's struggling to keep up with the use cases that are core to business intelligence, which is highly concurrent, lots of users doing interactive query. But then when you get to a semantic layer, that's when you need to be able to query data that might have thousands or tens of thousands or hundreds of thousands of joins. And that's a SQL query engine, traditional SQL query engine is just not built for that. That's the core problem of traditional relational databases. >> Now this is a quick aside. We always talk about Snowflake and Databricks in sort of the same context. We're not necessarily saying that Snowflake is in a position to tackle all these problems. We'll deal with that separately. So we don't mean to imply that, but we're just sort of laying out some of the things that Snowflake or rather Databricks customers we think, need to be thinking about and having conversations with Databricks about and we hope to have them as well. We'll come back to that in terms of sort of strategic options. But finally, when come back to the table, we have Databricks' AI/ML Tool Chain, which has been an awesome capability for the data science crowd. It's comprehensive, it's a one-stop shop solution, but the kicker here is that it's optimized for supervised model building. And the concern is that foundational models like GPT could cannibalize the current Databricks tooling, but George, can't Databricks, like other software companies, integrate foundation model capabilities into its platform? >> Okay, so the sound bite answer to that is sure, IBM 3270 terminals could call out to a graphical user interface when they're running on the XT terminal, but they're not exactly good citizens in that world. The core issue is Databricks has this wonderful end-to-end tool chain for training, deploying, monitoring, running inference on supervised models. But the paradigm there is the customer builds and trains and deploys each model for each feature or application. In a world of foundation models which are pre-trained and unsupervised, the entire tool chain is different. So it's not like Databricks can junk everything they've done and start over with all their engineers. They have to keep maintaining what they've done in the old world, but they have to build something new that's optimized for the new world. It's a classic technology transition and their mentality appears to be, "Oh, we'll support the new stuff from our old stuff." Which is suboptimal, and as we'll talk about, their biggest patron and the company that put them on the map, Microsoft, really stopped working on their old stuff three years ago so that they could build a new tool chain optimized for this new world. >> Yeah, and so let's sort of close with what we think the options are and decisions that Databricks has for its future architecture. They're smart people. I mean we've had Ali Ghodsi on many times, super impressive. I think they've got to be keenly aware of the limitations, what's going on with foundation models. But at any rate, here in this chart, we lay out sort of three scenarios. One is re-architect the platform by incrementally adopting new technologies. And example might be to layer a graph query engine on top of its stack. They could license key technologies like graph database, they could get aggressive on M&A and buy-in, relational knowledge graphs, semantic technologies, vector database technologies. George, as David Floyer always says, "A lot of ways to skin a cat." We've seen companies like, even think about EMC maintained its relevance through M&A for many, many years. George, give us your thought on each of these strategic options? >> Okay, I find this question the most challenging 'cause remember, I used to be an equity research analyst. I worked for Frank Quattrone, we were one of the top tech shops in the banking industry, although this is 20 years ago. But the M&A team was the top team in the industry and everyone wanted them on their side. And I remember going to meetings with these CEOs, where Frank and the bankers would say, "You want us for your M&A work because we can do better." And they really could do better. But in software, it's not like with EMC in hardware because with hardware, it's easier to connect different boxes. With software, the whole point of a software company is to integrate and architect the components so they fit together and reinforce each other, and that makes M&A harder. You can do it, but it takes a long time to fit the pieces together. Let me give you examples. If they put a graph query engine, let's say something like TinkerPop, on top of, I don't even know if it's possible, but let's say they put it on top of Delta Lake, then you have this graph query engine talking to their storage layer, Delta Lake. But if you want to do analysis, you got to put the data in Photon, which is not really ideal for highly connected data. If you license a graph database, then most of your data is in the Delta Lake and how do you sync it with the graph database? If you do sync it, you've got data in two places, which kind of defeats the purpose of having a unified repository. I find this semantic layer option in number three actually more promising, because that's something that you can layer on top of the storage layer that you have already. You just have to figure out then how to have your query engines talk to that. What I'm trying to highlight is, it's easy as an analyst to say, "You can buy this company or license that technology." But the really hard work is making it all work together and that is where the challenge is. >> Yeah, and well look, I thank you for laying that out. We've seen it, certainly Microsoft and Oracle. I guess you might argue that well, Microsoft had a monopoly in its desktop software and was able to throw off cash for a decade plus while it's stock was going sideways. Oracle had won the database wars and had amazing margins and cash flow to be able to do that. Databricks isn't even gone public yet, but I want to close with some of the players to watch. Alex, if you'd bring that back up, number four here. AWS, we talked about some of their options with S3 and it's not just AWS, it's blob storage, object storage. Microsoft, as you sort of alluded to, was an early go-to market channel for Databricks. We didn't address that really. So maybe in the closing comments we can. Google obviously, Snowflake of course, we're going to dissect their options in future Breaking Analysis. Dbt labs, where do they fit? Bob Muglia's company, Relational.ai, why are these players to watch George, in your opinion? >> So everyone is trying to assemble and integrate the pieces that would make building data applications, data products easy. And the critical part isn't just assembling a bunch of pieces, which is traditionally what AWS did. It's a Unix ethos, which is we give you the tools, you put 'em together, 'cause you then have the maximum choice and maximum power. So what the hyperscalers are doing is they're taking their key value stores, in the case of ASW it's DynamoDB, in the case of Azure it's Cosmos DB, and each are putting a graph query engine on top of those. So they have a unified storage and graph database engine, like all the data would be collected in the key value store. Then you have a graph database, that's how they're going to be presenting a foundation for building these data apps. Dbt labs is putting a semantic layer on top of data lakes and data warehouses and as we'll talk about, I'm sure in the future, that makes it easier to swap out the underlying data platform or swap in new ones for specialized use cases. Snowflake, what they're doing, they're so strong in data management and with their transactional tables, what they're trying to do is take in the operational data that used to be in the province of many state stores like MongoDB and say, "If you manage that data with us, it'll be connected to your analytic data without having to send it through a pipeline." And that's hugely valuable. Relational.ai is the wildcard, 'cause what they're trying to do, it's almost like a holy grail where you're trying to take the expressiveness of connecting all your data in a graph but making it as easy to query as you've always had it in a SQL database or I should say, in a relational database. And if they do that, it's sort of like, it'll be as easy to program these data apps as a spreadsheet was compared to procedural languages, like BASIC or Pascal. That's the implications of Relational.ai. >> Yeah, and again, we talked before, why can't you just throw this all in memory? We're talking in that example of really getting down to differences in how you lay the data out on disk in really, new database architecture, correct? >> Yes. And that's why it's not clear that you could take a data lake or even a Snowflake and why you can't put a relational knowledge graph on those. You could potentially put a graph database, but it'll be compromised because to really do what Relational.ai has done, which is the ease of Relational on top of the power of graph, you actually need to change how you're storing your data on disk or even in memory. So you can't, in other words, it's not like, oh we can add graph support to Snowflake, 'cause if you did that, you'd have to change, or in your data lake, you'd have to change how the data is physically laid out. And then that would break all the tools that talk to that currently. >> What in your estimation, is the timeframe where this becomes critical for a Databricks and potentially Snowflake and others? I mentioned earlier midterm, are we talking three to five years here? Are we talking end of decade? What's your radar say? >> I think something surprising is going on that's going to sort of come up the tailpipe and take everyone by storm. All the hype around business intelligence metrics, which is what we used to put in our dashboards where bookings, billings, revenue, customer, those things, those were the key artifacts that used to live in definitions in your BI tools, and DBT has basically created a standard for defining those so they live in your data pipeline or they're defined in their data pipeline and executed in the data warehouse or data lake in a shared way, so that all tools can use them. This sounds like a digression, it's not. All this stuff about data mesh, data fabric, all that's going on is we need a semantic layer and the business intelligence metrics are defining common semantics for your data. And I think we're going to find by the end of this year, that metrics are how we annotate all our analytic data to start adding common semantics to it. And we're going to find this semantic layer, it's not three to five years off, it's going to be staring us in the face by the end of this year. >> Interesting. And of course SVB today was shut down. We're seeing serious tech headwinds, and oftentimes in these sort of downturns or flat turns, which feels like this could be going on for a while, we emerge with a lot of new players and a lot of new technology. George, we got to leave it there. Thank you to George Gilbert for excellent insights and input for today's episode. I want to thank Alex Myerson who's on production and manages the podcast, of course Ken Schiffman as well. Kristin Martin and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hof is our EIC over at Siliconangle.com, he does some great editing. Remember all these episodes, they're available as podcasts. Wherever you listen, all you got to do is search Breaking Analysis Podcast, we publish each week on wikibon.com and siliconangle.com, or you can email me at David.Vellante@siliconangle.com, or DM me @DVellante. Comment on our LinkedIn post, and please do check out ETR.ai, great survey data, enterprise tech focus, phenomenal. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching, and we'll see you next time on Breaking Analysis.

Published Date : Mar 10 2023

SUMMARY :

bringing you data-driven core elements of the Databricks portfolio and pervasiveness in the data and that was where you went for data. and Cloudera set out to fix that. the reason you see and the robustness of Databricks and their big challenge and the data locked into in the real world and decisions Yes, and the mission of that is propelling the likes that the way you manage that data, is the fundamental problem because the joins are difficult and slow. and connects the data and the issue with that is the fourth bullet, expressiveness and it spits out the and the threat that may loom. because in the past with Snowflake, Think of that as the refinery So once the data lake was in place, George, the call out threat here But the key point is, in sort of the same context. and the company that put One is re-architect the platform and architect the components some of the players to watch. in the case of ASW it's DynamoDB, and why you can't put a relational and executed in the data and manages the podcast, of

ENTITIES

Entity	Category	Confidence
Alex Myerson	PERSON	0.99+
David Floyer	PERSON	0.99+
Mike Olson	PERSON	0.99+
2014	DATE	0.99+
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
George	PERSON	0.99+
Cheryl Knight	PERSON	0.99+
Ken Schiffman	PERSON	0.99+
Andy Jassy	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Erik Bradley	PERSON	0.99+
Dave	PERSON	0.99+
Uber	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Sun Microsystems	ORGANIZATION	0.99+
50 years	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Bob Muglia	PERSON	0.99+
Gartner	ORGANIZATION	0.99+
Airbnb	ORGANIZATION	0.99+
60 years	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Ali Ghodsi	PERSON	0.99+
2010	DATE	0.99+
Databricks	ORGANIZATION	0.99+
Kristin Martin	PERSON	0.99+
Rob Hof	PERSON	0.99+
three	QUANTITY	0.99+
15 years	QUANTITY	0.99+
Databricks'	ORGANIZATION	0.99+
two places	QUANTITY	0.99+
Boston	LOCATION	0.99+
Tristan Handy	PERSON	0.99+
M&A	ORGANIZATION	0.99+
Frank Quattrone	PERSON	0.99+
second element	QUANTITY	0.99+
Daren Brabham	PERSON	0.99+
TechAlpha Partners	ORGANIZATION	0.99+
third element	QUANTITY	0.99+
Snowflake	ORGANIZATION	0.99+
50 year	QUANTITY	0.99+
40%	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
five years	QUANTITY	0.99+

Ali Ghodsi, Databricks | Cube Conversation Partner Exclusive

(outro music) >> Hey, I'm John Furrier, here with an exclusive interview with Ali Ghodsi, who's the CEO of Databricks. Ali, great to see you. Preview for reinvent. We're going to launch this story, exclusive Databricks material on the notes, after the keynotes prior to the keynotes and after the keynotes that reinvent. So great to see you. You know, you've been a partner of AWS for a very, very long time. I think five years ago, I think I first interviewed you, you were one of the first to publicly declare that this was a place to build a company on and not just post an application, but refactor capabilities to create, essentially a platform in the cloud, on the cloud. Not just an ISV; Independent Software Vendor, kind of an old term, we're talking about real platform like capability to change the game. Can you talk about your experience as an AWS partner? >> Yeah, look, so we started in 2013. I swiped my personal credit card on AWS and some of my co-founders did the same. And we started building. And we were excited because we just thought this is a much better way to launch a company because you can just much faster get time to market and launch your thing and you can get the end users much quicker access to the thing you're building. So we didn't really talk to anyone at AWS, we just swiped a credit card. And eventually they told us, "Hey, do you want to buy extra support?" "You're asking a lot of advanced questions from us." "Maybe you want to buy our advanced support." And we said, no, no, no, no. We're very advanced ourselves, we know what we're doing. We're not going to buy any advanced support. So, you know, we just built this, you know, startup from nothing on AWS without even talking to anyone there. So at some point, I think around 2017, they suddenly saw this company with maybe a hundred million ARR pop up on their radar and it's driving massive amounts of compute, massive amounts of data. And it took a little bit in the beginning just us to get to know each other because as I said, it's like we were not on their radar and we weren't really looking, we were just doing our thing. And then over the years the partnership has deepened and deepened and deepened and then with, you know, Andy (indistinct) really leaning into the partnership, he mentioned us at Reinvent. And then we sort of figured out a way to really integrate the two service, the Databricks platform with AWS . And today it's an amazing partnership. You know, we directly connected with the general managers for the services. We're connected at the CEO level, you know, the sellers get compensated for pushing Databricks, we're, we have multiple offerings on their marketplace. We have a native offering on AWS. You know, we're prominently always sort of marketed and you know, we're aligned also vision wise in what we're trying to do. So yeah, we've come a very, very long way. >> Do you consider yourself a SaaS app or an ISV or do you see yourself more of a platform company because you have customers. How would you categorize your category as a company? >> Well, it's a data platform, right? And actually the, the strategy of the Databricks is take what's otherwise five, six services in the industry or five, six different startups, but do them as part of one data platform that's integrated. So in one word, the strategy of data bricks is "unification." We call it the data lake house. But really the idea behind the data lake house is that of unification, or in more words it's, "The whole is greater than the sum of its parts." So you could actually go and buy five, six services out there or actually use five, six services from the cloud vendors, stitch it together and it kind of resembles Databricks. Our power is in doing those integrated, together in a way in which it's really, really easy and simple to use for end users. So yeah, we're a data platform. I wouldn't, you know, ISV that's a old term, you know, Independent Software Vendor. You know, I think, you know, we have actually a whole slew of ISVs on top of Databricks, that integrate with our platform. And you know, in our marketplace as well as in our partner connect, we host those ISVs that then, you know, work on top of the data that we have in the Databricks, data lake house. >> You know, I think one of the things your journey has been great to document and watch from the beginning. I got to give you guys credit over there and props, congratulations. But I think you're the poster child as a company to what we see enterprises doing now. So go back in time when you guys swiped a credit card, you didn't need attending technical support because you guys had brains, you were refactoring, rethinking. It wasn't just banging out software, you had, you were doing some complex things. It wasn't like it was just write some software hosted on server. It was really a lot more. And as a result your business worth billions of dollars. I think 38 billion or something like that, big numbers, big numbers of great revenue growth as well, billions in revenue. You have customers, you have an ecosystem, you have data applications on top of Databricks. So in a way you're a cloud on top of the cloud. So is there a cloud on top of the cloud? So you have ISVs, Amazon has ISVs. Can you take us through what this means and at this point in history, because this seems to be an advanced version of benefits of platforming and refactoring, leveraging say AWS. >> Yeah, so look, when we started, there was really only one game in town. It was AWS. So it was one cloud. And the strategy of the company then was, well Amazon had this beautiful set of services that they're building bottom up, they have storage, compute, networking, and then they have databases and so on. But it's a lot of services. So let us not directly compete with AWS and try to take out one of their services. Let's not do that because frankly we can't. We were not of that size. They had the scale, they had the size and they were the only cloud vendor in town. So our strategy instead was, let's do something else. Let's not compete directly with say, a particular service they're building, let's take a different strategy. What if we had a unified holistic data platform, where it's just one integrated service end to end. So think of it as Microsoft office, which contains PowerPoint, and Word, and Excel and even Access, if you want to use it. What if we build that and AWS has this really amazing knack for releasing things, you know services, lots of them, every reinvent. And they're sort of a DevOps person's dream and you can stitch these together and you know you have to be technical. How do we elevate that and make it simpler and integrate it? That was our original strategy and it resonated with a segment of the market. And the reason it worked with AWS so that we wouldn't butt heads with AWS was because we weren't a direct replacement for this service or for that service, we were taking a different approach. And AWS, because credit goes to them, they're so customer obsessed, they would actually do what's right for the customer. So if the customer said we want this unified thing, their sellers would actually say, okay, so then you should use Databricks. So they truly are customer obsessed in that way. And I really mean it, John. Things have changed over the years. They're not the only cloud anymore. You know, Azure is real, GCP is real, there's also Alibaba. And now over 70% of our customers are on more than one cloud. So now what we hear from them is, not only want, do we want a simplified, unified thing, but we want it also to work across the clouds. Because those of them that are seriously considering multiple clouds, they don't want to use a service on cloud one and then use a similar service on cloud two. But it's a little bit different. And now they have to do twice the work to make it work. You know, John, it's hard enough as it is, like it's this data stuff and analytics. It's not a walk in the park, you know. You hire an administrator in the back office that clicks a button and its just, now you're a data driven digital transformed company. It's hard. If you now have to do it again on the second cloud with different set of services and then again on a third cloud with a different set of services. That's very, very costly. So the strategy then has changed that, how do we take that unified simple approach and make it also the same and standardize across the clouds, but then also integrate it as far down as we can on each of the clouds. So that you're not giving up any of the benefits that the particular cloud has. >> Yeah, I think one of the things that we see, and I want get your reaction to this, is this rise of the super cloud as we call it. I think you were involved in the Sky paper that I saw your position paper came out after we had introduced Super Cloud, which is great. Congratulations to the Berkeley team, wearing the hat here. But you guys are, I think a driver of this because you're creating the need for these things. You're saying, okay, we went on one cloud with AWS and you didn't hide that. And now you're publicly saying there's other clouds too, increased ham for your business. And customers have multiple clouds in their infrastructure for the best of breed that they have. Okay, get that. But there's still a challenge around the innovation, growth that's still around the corner. We still have a supply chain problem, we still have skill gaps. You know, you guys are unique at Databricks as other these big examples of super clouds that are developing. Enterprises don't have the Databricks kind of talent. They need, they need turnkey solutions. So Adam and the team at Amazon are promoting, you know, more solution oriented approaches higher up on the stack. You're starting to see kind of like, I won't say templates, but you know, almost like application specific headless like, low code, no code capability to accelerate clients who are wanting to write code for the modern error. Right, so this kind of, and then now you, as you guys pointed out with these common services, you're pushing the envelope. So you're saying, hey, I need to compete, I don't want to go to my customers and have them to have a staff or this cloud and this cloud and this cloud because they don't have the staff. Or if they do, they're very unique. So what's your reaction? Because this kind is the, it kind of shows your leadership as a partner of AWS and the clouds, but also highlights I think what's coming. But you share your reaction. >> Yeah, look, it's, first of all, you know, I wish I could take credit for this but I can't because it's really the customers that have decided to go on multiple clouds. You know, it's not Databricks that you know, push this or some other vendor, you know, that, Snowflake or someone who pushed this and now enterprises listened to us and they picked two clouds. That's not how it happened. The enterprises picked two clouds or three clouds themselves and we can get into why, but they did that. So this largely just happened in the market. We as data platforms responded to what they're then saying, which is they're saying, "I don't want to redo this again on the other cloud." So I think the writing is on the wall. I think it's super obvious what's going to happen next. They will say, "Any service I'm using, it better work exactly the same on all the clouds." You know, that's what's going to happen. So in the next five years, every enterprise will say, "I'm going to use the service, but you better make sure that this service works equally well on all of the clouds." And obviously the multicloud vendors like us, are there to do that. But I actually think that what you're going to see happening is that you're going to see the cloud vendors changing the existing services that they have to make them work on the other clouds. That's what's goin to happen, I think. >> Yeah, and I think I would add that, first of all, I agree with you. I think that's going to be a forcing function. Because I think you're driving it. You guys are in a way, one, are just an actor in the driving this because you're on the front end of this and there are others and there will be people following. But I think to me, I'm a cloud vendor, I got to differentiate. Adam, If I'm Adam Saleski, I got to say, "Hey, I got to differentiate." So I don't wan to get stuck in the middle, so to speak. Am I just going to innovate on the hardware AKA infrastructure or am I going to innovate at the higher level services? So what we're talking about here is the tail of two clouds within Amazon, for instance. So do I innovate on the silicon and get low level into the physics and squeeze performance out of the hardware and infrastructure? Or do I focus on ease of use at the top of the stack for the developers? So again, there's a channel of two clouds here. So I got to ask you, how do they differentiate? Number one and number two, I never heard a developer ever say, "I want to run my app or workload on the slower cloud." So I mean, you know, back when we had PCs you wanted to go, "I want the fastest processor." So again, you can have common level services, but where is that performance differentiation with the cloud? What do the clouds do in your opinion? >> Yeah, look, I think it's pretty clear. I think that it's, this is, you know, no surprise. Probably 70% or so of the revenue is in the lower infrastructure layers, compute, storage, networking. And they have to win that. They have to be competitive there. As you said, you can say, oh you know, I guess my CPUs are slower than the other cloud, but who cares? I have amazing other services which only work on my cloud by the way, right? That's not going to be a winning recipe. So I think all three are laser focused on, we going to have specialized hardware and the nuts and bolts of the infrastructure, we can do it better than the other clouds for sure. And you can see lots of innovation happening there, right? The Graviton chips, you know, we see huge price performance benefits in those chips. I mean it's real, right? It's basically a 20, 30% free lunch. You know, why wouldn't you, why wouldn't you go for it there? There's no downside. You know, there's no, "got you" or no catch. But we see Azure doing the same thing now, they're also building their own chips and we know that Google builds specialized machine learning chips, TPU, Tenor Processing Units. So their legs are focused on that. I don't think they can give up that or focused on higher levels if they had to pick bets. And I think actually in the next few years, most of us have to make more, we have to be more deliberate and calculated in the picks we do. I think in the last five years, most of us have said, "We'll do all of it." You know. >> Well you made a good bet with Spark, you know, the duke was pretty obvious trend that was, everyone was shut on that bandwagon and you guys picked a big bet with Spark. Look what happened with you guys? So again, I love this betting kind of concept because as the world matures, growth slows down and shifts and that next wave of value coming in, AKA customers, they're going to integrate with a new ecosystem. A new kind of partner network for AWS and the other clouds. But with aws they're going to need to nurture the next Databricks. They're going to need to still provide that SaaS, ISV like experience for, you know, a basic software hosting or some application. But I go to get your thoughts on this idea of multiple clouds because if I'm a developer, the old days was, old days, within our decade, full stack developer- >> It was two years ago, yeah (John laughing) >> This is a decade ago, full stack and then the cloud came in, you kind had the half stack and then you would do some things. It seems like the clouds are trying to say, we want to be the full stack or not. Or is it still going to be, you know, I'm an application like a PC and a Mac, I'm going to write the same application for both hardware. I mean what's your take on this? Are they trying to do full stack and you see them more like- >> Absolutely. I mean look, of course they're going, they have, I mean they have over 300, I think Amazon has over 300 services, right? That's not just compute, storage, networking, it's the whole stack, right? But my key point is, I think they have to nail the core infrastructure storage compute networking because the three clouds that are there competing, they're formidable companies with formidable balance sheets and it doesn't look like any of them is going to throw in the towel and say, we give up. So I think it's going to intensify. And given that they have a 70% revenue on that infrastructure layer, I think they, if they have to pick their bets, I think they'll focus it on that infrastructure layer. I think the layer above where they're also placing bets, they're doing that, the full stack, right? But there I think the demand will be, can you make that work on the other clouds? And therein lies an innovator's dilemma because if I make it work on the other clouds, then I'm foregoing that 70% revenue of the infrastructure. I'm not getting it. The other cloud vendor is going to get it. So should I do that or not? Second, is the other cloud vendor going to be welcoming of me making my service work on their cloud if I am a competing cloud, right? And what kind of terms of service are I giving me? And am I going to really invest in doing that? And I think right now we, you know, most, the vast, vast, vast majority of the services only work on the one cloud that you know, it's built on. It doesn't work on others, but this will shift. >> Yeah, I think the innovators dilemma is also very good point. And also add, it's an integrators dilemma too because now you talk about integration across services. So I believe that the super cloud movement's going to happen before Sky. And I think what explained by that, what you guys did and what other companies are doing by representing advanced, I call platform engineering, refactoring an existing market really fast, time to value and CAPEX is, I mean capital, market cap is going to be really fast. I think there's going to be an opportunity for those to emerge that's going to set the table for global multicloud ultimately in the future. So I think you're going to start to see the same pattern of what you guys did get in, leverage the hell out of it, use it, not in the way just to host, but to refactor and take down territory of markets. So number one, and then ultimately you get into, okay, I want to run some SLA across services, then there's a little bit more complication. I think that's where you guys put that beautiful paper out on Sky Computing. Okay, that makes sense. Now if you go to today's market, okay, I'm betting on Amazon because they're the best, this is the best cloud win scenario, not the most robust cloud. So if I'm a developer, I want the best. How do you look at their bet when it comes to data? Because now they've got machine learning, Swami's got a big keynote on Wednesday, I'm expecting to see a lot of AI and machine learning. I'm expecting to hear an end to end data story. This is what you do, so as a major partner, how do you view the moves Amazon's making and the bets they're making with data and machine learning and AI? >> First I want to lift off my hat to AWS for being customer obsessed. So I know that if a customer wants Databricks, I know that AWS and their sellers will actually help us get that customer deploy Databricks. Now which of the services is the customer going to pick? Are they going to pick ours or the end to end, what Swami is going to present on stage? Right? So that's the question we're getting. But I wanted to start with by just saying, their customer obsessed. So I think they're going to do the right thing for the customer and I see the evidence of it again and again and again. So kudos to them. They're amazing at this actually. Ultimately our bet is, customers want this to be simple, integrated, okay? So yes there are hundreds of services that together give you the end to end experience and they're very customizable that AWS gives you. But if you want just something simply integrated that also works across the clouds, then I think there's a special place for Databricks. And I think the lake house approach that we have, which is an integrated, completely integrated, we integrate data lakes with data warehouses, integrate workflows with machine learning, with real time processing, all these in one platform. I think there's going to be tailwinds because I think the most important thing that's going to happen in the next few years is that every customer is going to now be obsessed, given the recession and the environment we're in. How do I cut my costs? How do I cut my costs? And we learn this from the customers they're adopting the lake house because they're thinking, instead of using five vendors or three vendors, I can simplify it down to one with you and I can cut my cost. So I think that's going to be one of the main drivers of why people bet on the lake house because it helps them lower their TCO; Total Cost of Ownership. And it's as simple as that. Like I have three things right now. If I can get the same job done of those three with one, I'd rather do that. And by the way, if it's three or four across two clouds and I can just use one and it just works across two clouds, I'm going to do that. Because my boss is telling me I need to cut my budget. >> (indistinct) (John laughing) >> Yeah, and I'd rather not to do layoffs and they're asking me to do more. How can I get smaller budgets, not lay people off and do more? I have to cut, I have to optimize. What's happened in the last five, six years is there's been a huge sprawl of services and startups, you know, you know most of them, all these startups, all of them, all the activity, all the VC investments, well those companies sold their software, right? Even if a startup didn't make it big, you know, they still sold their software to some vendors. So the ecosystem is now full of lots and lots and lots and lots of different software. And right now people are looking, how do I consolidate, how do I simplify, how do I cut my costs? >> And you guys have a great solution. You're also an arms dealer and a innovator. So I have to ask this question, because you're a professor of the industry as well as at Berkeley, you've seen a lot of the historical innovations. If you look at the moment we're in right now with the recession, okay we had COVID, okay, it changed how people work, you know, people working at home, provisioning VLAN, all that (indistinct) infrastructure, okay, yeah, technology and cloud health. But we're in a recession. This is the first recession where the Amazon and the other cloud, mainly Amazon Web Services is a major economic puzzle in the piece. So they were never around before, even 2008, they were too small. They're now a major economic enabler, player, they're serving startups, enterprises, they have super clouds like you guys. They're a force and the people, their customers are cutting back but also they can also get faster. So agility is now an equation in the economic recovery. And I want to get your thoughts because you just brought that up. Customers can actually use the cloud and Databricks to actually get out of the recovery because no one's going to say, stop making profit or make more profit. So yeah, cut costs, be more efficient, but agility's also like, let's drive more revenue. So in this digital transformation, if you take this to conclusion, every company transforms, their company is the app. So their revenue is tied directly to their technology deployment. What's your reaction and comment to that because this is a new historical moment where cloud and scale and data, actually could be configured in a way to actually change the nature of a business in such a short time. And with the recession looming, no one's got time to wait. >> Yeah, absolutely. Look, the secular tailwind in the market is that of, you know, 10 years ago it was software is eating the world, now it's AI's going to eat all of software software. So more and more we're going to have, wherever you have software, which is everywhere now because it's eaten the world, it's going to be eaten up by AI and data. You know, AI doesn't exist without data so they're synonymous. You can't do machine learning if you don't have data. So yeah, you're going to see that everywhere and that automation will help people simplify things and cut down the costs and automate more things. And in the cloud you can also do that by changing your CAPEX to OPEX. So instead of I invest, you know, 10 million into a data center that I buy, I'm going to have headcount to manage the software. Why don't we change this to OPEX? And then they are going to optimize it. They want to lower the TCO because okay, it's in the cloud. but I do want the costs to be much lower that what they were in the previous years. Last five years, nobody cared. Who cares? You know what it costs. You know, there's a new brave world out there. Now there's like, no, it has to be efficient. So I think they're going to optimize it. And I think this lake house approach, which is an integration of the lakes and the warehouse, allows you to rationalize the two and simplify them. It allows you to basically rationalize away the data warehouse. So I think much faster we're going to see the, why do I need the data warehouse? If I can get the same thing done with the lake house for fraction of the cost, that's what's going to happen. I think there's going to be focus on that simplification. But I agree with you. Ultimately everyone knows, everybody's a software company. Every company out there is a software company and in the next 10 years, all of them are also going to be AI companies. So that is going to continue. >> (indistinct), dev's going to stop. And right sizing right now is a key economic forcing function. Final question for you and I really appreciate you taking the time. This year Reinvent, what's the bumper sticker in your mind around what's the most important industry dynamic, power dynamic, ecosystem dynamic that people should pay attention to as we move from the brave new world of okay, I see cloud, cloud operations. I need to really make it structurally change my business. How do I, what's the most important story? What's the bumper sticker in your mind for Reinvent? >> Bumper sticker? lake house 24. (John laughing) >> That's data (indistinct) bumper sticker. What's the- >> (indistinct) in the market. No, no, no, no. You know, it's, AWS talks about, you know, all of their services becoming a lake house because they want the center of the gravity to be S3, their lake. And they want all the services to directly work on that, so that's a lake house. We're Bumper see Microsoft with Synapse, modern, you know the modern intelligent data platform. Same thing there. We're going to see the same thing, we already seeing it on GCP with Big Lake and so on. So I actually think it's the how do I reduce my costs and the lake house integrates those two. So that's one of the main ways you can rationalize and simplify. You get in the lake house, which is the name itself is a (indistinct) of two things, right? Lake house, "lake" gives you the AI, "house" give you the database data warehouse. So you get your AI and you get your data warehousing in one place at the lower cost. So for me, the bumper sticker is lake house, you know, 24. >> All right. Awesome Ali, well thanks for the exclusive interview. Appreciate it and get to see you. Congratulations on your success and I know you guys are going to be fine. >> Awesome. Thank you John. It's always a pleasure. >> Always great to chat with you again. >> Likewise. >> You guys are a great team. We're big fans of what you guys have done. We think you're an example of what we call "super cloud." Which is getting the hype up and again your paper speaks to some of the innovation, which I agree with by the way. I think that that approach of not forcing standards is really smart. And I think that's absolutely correct, that having the market still innovate is going to be key. standards with- >> Yeah, I love it. We're big fans too, you know, you're doing awesome work. We'd love to continue the partnership. >> So, great, great Ali, thanks. >> Take care (outro music)

Published Date : Nov 23 2022

SUMMARY :

after the keynotes prior to the keynotes and you know, we're because you have customers. I wouldn't, you know, I got to give you guys credit over there So if the customer said we So Adam and the team at So in the next five years, But I think to me, I'm a cloud vendor, and calculated in the picks we do. But I go to get your thoughts on this idea Or is it still going to be, you know, And I think right now we, you know, So I believe that the super cloud I can simplify it down to one with you and startups, you know, and the other cloud, And in the cloud you can also do that I need to really make it lake house 24. That's data (indistinct) of the gravity to be S3, and I know you guys are going to be fine. It's always a pleasure. We're big fans of what you guys have done. We're big fans too, you know,

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
Adam	PERSON	0.99+
AWS	ORGANIZATION	0.99+
2013	DATE	0.99+
Google	ORGANIZATION	0.99+
Alibaba	ORGANIZATION	0.99+
2008	DATE	0.99+
five vendors	QUANTITY	0.99+
Adam Saleski	PERSON	0.99+
five	QUANTITY	0.99+
John Furrier	PERSON	0.99+
Ali	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
three vendors	QUANTITY	0.99+
70%	QUANTITY	0.99+
Wednesday	DATE	0.99+
Excel	TITLE	0.99+
38 billion	QUANTITY	0.99+
four	QUANTITY	0.99+
Amazon Web Services	ORGANIZATION	0.99+
Word	TITLE	0.99+
three	QUANTITY	0.99+
two clouds	QUANTITY	0.99+
Andy	PERSON	0.99+
three clouds	QUANTITY	0.99+
10 million	QUANTITY	0.99+
PowerPoint	TITLE	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
twice	QUANTITY	0.99+
Second	QUANTITY	0.99+
over 300 services	QUANTITY	0.99+
one game	QUANTITY	0.99+
second cloud	QUANTITY	0.99+
Snowflake	ORGANIZATION	0.99+
Sky	ORGANIZATION	0.99+
one word	QUANTITY	0.99+
OPEX	ORGANIZATION	0.99+
two things	QUANTITY	0.98+
two years ago	DATE	0.98+
Access	TITLE	0.98+
over 300	QUANTITY	0.98+
six years	QUANTITY	0.98+
over 70%	QUANTITY	0.98+
five years ago	DATE	0.98+

Ali Ghosdi, Databricks | AWS Partner Exclusive

Published Date : Nov 23 2022

SUMMARY :

ENTITIES

Entity	Category	Confidence
John	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Ali Ghodsi	PERSON	0.99+
Adam	PERSON	0.99+
AWS	ORGANIZATION	0.99+
2013	DATE	0.99+
Google	ORGANIZATION	0.99+
Alibaba	ORGANIZATION	0.99+
2008	DATE	0.99+
Ali Ghosdi	PERSON	0.99+
five vendors	QUANTITY	0.99+
Adam Saleski	PERSON	0.99+
five	QUANTITY	0.99+
John Furrier	PERSON	0.99+
Ali	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
three vendors	QUANTITY	0.99+
70%	QUANTITY	0.99+
Wednesday	DATE	0.99+
Excel	TITLE	0.99+
38 billion	QUANTITY	0.99+
four	QUANTITY	0.99+
Amazon Web Services	ORGANIZATION	0.99+
Word	TITLE	0.99+
three	QUANTITY	0.99+
two clouds	QUANTITY	0.99+
Andy	PERSON	0.99+
three clouds	QUANTITY	0.99+
10 million	QUANTITY	0.99+
PowerPoint	TITLE	0.99+
one	QUANTITY	0.99+
two	QUANTITY	0.99+
twice	QUANTITY	0.99+
Second	QUANTITY	0.99+
over 300 services	QUANTITY	0.99+
one game	QUANTITY	0.99+
second cloud	QUANTITY	0.99+
Snowflake	ORGANIZATION	0.99+
Sky	ORGANIZATION	0.99+
one word	QUANTITY	0.99+
OPEX	ORGANIZATION	0.99+
two things	QUANTITY	0.98+
two years ago	DATE	0.98+
Access	TITLE	0.98+
over 300	QUANTITY	0.98+
six years	QUANTITY	0.98+
over 70%	QUANTITY	0.98+
five years ago	DATE	0.98+

Jack Andersen & Joel Minnick, Databricks | AWS Marketplace Seller Conference 2022

(upbeat music) >> Welcome back everyone to The Cubes coverage here in Seattle, Washington. For AWS's Marketplace Seller Conference. It's the big news within the Amazon partner network, combining with marketplace, forming the Amazon partner organization. Part of a big reorg as they grow to the next level, NextGen cloud, mid-game on the chessboard. Cube's got it covered. I'm John Furry, your host at Cube. Great guests here from Data bricks. Both cube alumni's. Jack Anderson, GM and VP of the Databricks partnership team for AWS. You handle that relationship and Joel Minick vice president of product and partner marketing. You guys have the keys to the kingdom with Databricks and AWS. Thanks for joining. Good to see you again. >> Thanks for having us back. >> Yeah, John, great to be here. >> So I feel like we're at Reinvent 2013. Small event, no stage, but there's a real shift happening with procurement. Obviously it's a no brainer on the micro, you know, people should be buying online. Self-service, Cloud Scale. But Amazon's got billions being sold through their marketplace. They've reorganized their partner network. You can see kind of what's going on. They've kind of figured it out. Like let's put everything together and simplify and make it less of a website, marketplace. Merge our partner organizations, have more synergy and frictionless experiences so everyone can make more money and customer's are going to be happier. >> Yeah, that's right. >> I mean, you're running relationship. You're in the middle of it. >> Well, Amazon's mental model here is that they want the world's best ISVs to operate on AWS so that we can collaborate and co architect on behalf of customers. And that's exactly what the APO and marketplace allow us to do, is to work with Amazon on these really, you know, unique use cases. >> You know, I interviewed Ali many times over the years. I remember many years ago, maybe six, seven years ago, we were talking. He's like, "we're all in on AWS." Obviously now the success of Databricks, you've got multiple clouds, see that. Customers have choice. But I remember the strategy early on. It was like, we're going to be deep. So this is, speaks volumes to the relationship you have. Years. Jack, take us through the relationship that Databricks has with AWS from a partner perspective. Joel, and from a product perspective. Because it's not like you guys are Johnny come lately, new to the scene. >> Right. >> You've been there, almost president creation of this wave. What's the relationship and how does it relate to what's going on today? >> So most people may not know that Databricks was born on AWS. We actually did our first $100 million of revenue on Amazon. And today we're obviously available on multiple clouds. But we're very fond of our Amazon relationship. And when you look at what the APN allows us to do, you know, we're able to expand our reach and co-sell with Amazon, and marketplace broadens our reach. And so, we think of marketplace in three different aspects. We've got the marketplace private offer business, which we've been doing for a number of years. Matter of fact, we were driving well over a hundred percent year over year growth in private offers. And we have a nine figure business. So it's a very significant business. And when a customer uses a private offer, that private offer counts against their private pricing agreement with AWS. So they get pricing power against their private pricing. So it's really important it goes on their Amazon bill. In may we launched our pay as you go, on demand offering. And in five short months, we have well over a thousand subscribers. And what this does, is it really reduces the barriers to entry. It's low friction. So anybody in an enterprise or startup or public sector company can start to use Databricks on AWS, in a consumption based model, and have it go against their monthly bill. And so we see customers, you know, doing rapid experimentation, pilots, POCs. They're really learning the value of that first, use case. And then we see rapid use case expansion. And the third aspect is the consulting partner, private offer, CPPO. Super important in how we involve our partner ecosystem of our consulting partners and our resellers that are able to work with Databricks on behalf of customers. >> So you got the big contracts with the private offer. You got the product market fit, kind of people iterating with data, coming in with the buyers you get. And obviously the integration piece all fitting in there. >> Exactly. >> Okay, so those are the offers, that's current, what's in marketplace today. Is that the products... What are people buying? >> Yeah. >> I mean, I guess what's the... Joel, what are people buying in the marketplace? And what does it mean for them? >> So fundamentally what they're buying is the ability to take silos out of their organization. And that is the problem that Databricks is out there to solve. Which is, when you look across your data landscape today, you've got unstructured data, you've got structured data, you've got real time streaming data. And your teams are trying to use all of this data to solve really complicated problems. And as Databricks, as the Lakehouse Company, what we're helping customers do is, how do they get into the new world? How do they move to a place where they can use all of that data across all of their teams? And so we allow them to begin to find, through the marketplace, those rapid adoption use cases where they can get rid of these data warehousing, data lake silos they've had in the past. Get their unstructured and structured data onto one data platform, an open data platform, that is no longer adherent to any proprietary formats and standards and something they can, very much, very easily, integrate into the rest of their data environment. Apply one common data governance layer on top of that. So that from the time they ingest that data, to the time they use that data, to the time they share that data, inside and outside of their organization, they know exactly how it's flowing. They know where it came from. They know who's using it. They know who has access to it. They know how it's changing. And then with that common data platform, with that common governance solution, they'd being able to bring all of those use cases together. Across their real time streaming, their data engineering, their BI, their AI. All of their teams working on one set of data. And that lets them move really, really fast. And it also lets them solve challenges they just couldn't solve before. A good example of this, you know, one of the world's now largest data streaming platforms runs on Databricks with AWS. And if you think about what does it take to set that up? Well, they've got all this customer data that was historically inside of data warehouses. That they have to understand who their customers are. They have all this unstructured data, they've built their data science model, so they can do the right kinds of recommendation engines and forecasting around. And then they've got all this streaming data going back and forth between click stream data, from what the customers are doing with their platform and the recommendations they want to push back out. And if those teams were all working in individual silos, building these kinds of platforms would be extraordinarily slow and complex. But by building it on Databricks, they were able to release it in record time and have grown at a record pace to now be the number one platform. >> And this product, it's impacting product development. >> Absolutely. >> I mean, this is like the difference between lagging months of product development, to like days. >> Yes. >> Pretty much what you're getting at. >> Yes. >> So total agility. >> Mm-hmm. >> I got that. Okay, now, I'm a customer I want to buy in the marketplace, but you got direct Salesforce up there. So how do you guys look at this? Is there channel conflict? Are there comp programs? Because one of the things I heard today in on the stage from AWS's leadership, Chris, was up there speaking, and Mona was, "Hey, he's a CRO conference chief revenue officer" conversation. Which means someone's getting compensated. So, if I'm the sales rep at Databricks, what's my motion to the customer? Do I get paid? Does Amazon sell it? Take us through that. Is there channel conflict? Or, how do you handle it? >> Well, I'd add what Joel just talked about with, you know, with the solution, the value of the solution our entire offering is available on AWS marketplace. So it's not a subset, it's the entire Data Bricks offering. And- >> The flagship, all the, the top stuff. >> Everything, the flagship, the complete offering. So it's not segmented. It's not a sub segment. >> Okay. >> It's, you know, you can use all of our different offerings. Now when it comes to seller compensation, we view this two different ways, right? One is that AWS is also incented, right? Versus selling a native service to recommend Databricks for the right situation. Same thing with Databricks, our sales force wants to do the right thing for the customer. If the customer wants to use marketplace as their procurement vehicle. And that really helps customers because if you get Databricks and five other ISVs together, and let's say each ISV is spending, you're spending a million dollars. You have $5 million of spend. You put that spend through the flywheel with AWS marketplace, and then you can use that in your negotiations with AWS to get better pricing overall. So that's how we view it. >> So customers are driving. This sounds like. >> Correct. For sure. >> So they're looking at this as saying, Hey, I'm going to just get purchasing power with all my relationships. Because it's a solution architectural market, right? >> Yeah. It makes sense. Because if most customers will have a primary and secondary cloud provider. If they can consolidate, you know, multiple ISV spend through that same primary provider, you get pricing power. >> Okay, Joel, we're going to date ourselves. At least I will. So back in the old days, (group laughter) It used to be, do a Barney deal with someone, Hey, let's go to market together. You got to get paper, you do a biz dev deal. And then you got to say, okay, now let's coordinate our sales teams, a lot of moving parts. So what you're getting at here is that the alternative for Databricks, or any company is, to go find those partners and do deals, versus now Amazon is the center point for the customer. So you can still do those joint deals, but this seems to be flipping the script a little bit. >> Well, it is, but we still have vars and consulting partners that are doing implementation work. Very valuable work, advisory work, that can actually work with marketplace through the CPPO offering. So the marketplace allows multiple ways to procure your solution. >> So it doesn't change your business structure. It just makes it more efficient. >> That's correct. >> That's a great way to say it. >> Yeah, that's great. >> Okay. So, that's it. So that's just makes it more efficient. So you guys are actually incented to point customers to the marketplace. >> Yes. >> Absolutely. >> Economically. >> Economically, it's the right thing to do for the customer. It's the right thing to do for our relationship with Amazon. Especially when it comes back to co-selling, right? Because Amazon now is leaning in with ISVs and making recommendations for, you know, an ISV solution. And our teams are working backwards from those use cases, you know, to collaborate and land them. >> Yeah. I want to get that out there. Go ahead, Joel. >> So one of the other things I might add to that too, you know, and why this is advantageous for companies like Databricks to work through the marketplace. Is it makes it so much easier for customers to deploy a solution. It's very, literally, one click through the marketplace to get Databricks stood up inside of your environment. And so if you're looking at how do I help customers most rapidly adopt these solutions in the AWS cloud, the marketplace is a fantastic accelerator to that. >> You know, it's interesting. I want to bring this up and get your reaction to it because to me, I think this is the future of procurement. So from a procurement standpoint, I mean, again, dating myself, EDI back in the old days, you know, all that craziness. Now this is all the internet, basically through the console. I get the infrastructure side, you know, spin up and provision some servers, all been good. You guys have played well there in the marketplace. But now as we get into more of what I call the business apps, and they brought this up on stage. A little nuanced. Most enterprises aren't yet there of integrating tech, on the business apps, into the stack. This is where I think you guys are a use case of success where you guys have been successful with data integration. It's an integrators dilemma, not an innovator's dilemma. So like, I want to integrate. So now I have integration points with Databricks, but I want to put an app in there. I want to provision an application, but it has to be built. It's not, you don't buy it. You build, you got to build stuff. And this is the nuance. What's your reaction to that? Am I getting this right? Or am I off because, no one's going to be buying software like they used to. They buy software to integrate it. >> Yeah, no- >> Because everything's integrated. >> I think AWS has done a great job at creating a partner ecosystem, right? To give customers the right tools for the right jobs. And those might be with third parties. Databricks is doing the same thing with our partner connect program, right? We've got customer partners like Five Tran and DBT that, you know, augment and enhance our platform. And so you're looking at multi ISV architectures and all of that can be procured through the AWS marketplace. >> Yeah. It's almost like, you know, bundling and un bundling. I was talking about this with, with Dave Alante about Supercloud. Which is why wouldn't a customer want the best solution in their architecture? Period. In its class. If someone's got API security or an API gateway. Well, you know, I don't want to be forced to buy something because it's part of a suite. And that's where you see things get sub optimized. Where someone dominates a category and they have, oh, you got to buy my version of this. >> Joel and I were talking, we were actually saying, what's really important about Databricks, is that customers control the data, right? You want to comment on that? >> Yeah. I was going to say, you know, what you're pushing on there, we think is extraordinarily, you know, the way the market is going to go. Is that customers want a lot of control over how they build their data stack. And everyone's unique in what tools are the right ones for them. And so one of the, you know, philosophically, I think, really strong places, Databricks and AWS have lined up, is we both take an approach that you should be able to have maximum flexibility on the platform. And as we think about the Lakehouse, one thing we've always been extremely committed to, as a company, is building the data platform on an open foundation. And we do that primarily through Delta Lake and making sure that, to Jack's point, with Databricks, the data is always in your control. And then it's always stored in a completely open format. And that is one of the things that's allowed Databricks to have the breadth of integrations that it has with all the other data tools out there. Because you're not tied into any proprietary format, but instead are able to take advantage of all the innovation that's happening out there in the open source ecosystem. >> When you see other solutions out there that aren't as open as you guys, you guys are very open by the way, we love that too. We think that's a great strategy, but what am I foreclosing if I go with something else that's not as open? What's the customer's downside as you think about what's around the corner in the industry? Because if you believe it's going to be open, open source, which I think open source software is the software industry, and integration is a big deal. Because software's going to be plentiful. >> Sure. >> Let's face it. It's a good time to be in software business. But Cloud's booming. So what's the downside, from your Databricks perspective? You see a buyer clicking on Databricks versus that alternative. What's potentially should they be a nervous about, down the road, if they go with a more proprietary or locked in approach? >> Yeah. >> Well, I think the challenge with proprietary ecosystems is you become beholden to the ability of that provider to both build relationships and convince other vendors that they should invest in that format. But you're also, then, beholden to the pace at which that provider is able to innovate. >> Mm-hmm. >> And I think we've seen lots of times over history where, you know, a proprietary format may run ahead, for a while, on a lot of innovation. But as that market control begins to solidify, that desire to innovate begins to degrade. Whereas in the open formats- >> So extract rents versus innovation. (John laughs) >> Exactly. Yeah, exactly. >> I'll say it. >> But in the open world, you know, you have to continue to innovate. >> Yeah. >> And the open source world is always innovating. If you look at the last 10 to 15 years, I challenge you to find, you know, an example where the innovation in the data and AI world is not coming from open source. And so by investing in open ecosystems, that means you are always going to be at the forefront of what is the latest. >> You know, again, not to date myself again, but you look back at the eighties and nineties, the protocol stacked with proprietary. >> Yeah. >> You know, SNA and IBM, deck net was digital. You know the rest. And then TCPIP was part of the open systems interconnect. >> Mm-hmm. >> Revolutionary (indistinct) a big part of that, as well as my school did. And so like, you know, that was, but it didn't standardize the whole stack. It stopped at IP and TCP. >> Yeah. >> But that helped inter operate, that created a nice defacto. So this is a big part of this mid game. I call it the chessboard, you know, you got opening game and mid-game, then you get the end game. You're not there at the end game yet at Cloud. But Cloud- >> There's, always some form of lock in, right? Andy Jazzy will address it, you know, when making a decision. But if you're going to make a decision you want to reduce- You don't want to be limited, right? So I would advise a customer that there could be limitations with a proprietary architecture. And if you look at what every customer's trying to become right now, is an AI driven business, right? And so it has to do with, can you get that data out of silos? Can you organize it and secure it? And then can you work with data scientists to feed those models? >> Yeah. >> In a very consistent manner. And so the tools of tomorrow will, to Joel's point, will be open and we want interoperability with those tools. >> And choice is a matter too. And I would say that, you know, the argument for why I think Amazon is not as locked in as maybe some other clouds, is that they have to compete directly too. Redshift competes directly with a lot of other stuff. But they can't play the bundling game because the customers are getting savvy to the fact that if you try to bundle an inferior product with something else, it may not work great at all. And they're going to be, they're onto it. This is the- >> To Amazon's credit by having these solutions that may compete with native services in marketplace, they are providing customers with choice, low price- >> And access to the core value. Which is the hardware- >> Exactly. >> Which is their platform. Okay. So I want to get you guys thought on something else I see emerging. This is, again, kind of Cube rumination moment. So on stage, Chris unpacked a lot of stuff. I mean this marketplace, they're touching a lot of hot buttons here, you know, pricing, compensation, workflows, services behind the curtain. And one of those things he mentioned was, they talk about resellers or channel partners, depending upon what you talk about. We believe, Dave and I believe on the Cube, that the entire indirect sales channel of the industry is going to be disrupted radically. Because those players were selling hardware in the old days and software. That game is going to change. You mentioned you guys have a program, let me get your thoughts on this. We believe that once this gets set up, they can play in this game and bring their services in. Which means that the old reseller channels are going to be rewritten. They're going to be refactored with this new kinds of access. Because you've got scale, you've got money and you've got product. And you got customers coming into the marketplace. So if you're like a reseller that sold computers to data centers or software, you know, a value added reseller or VAB or business. >> You've got to evolve. >> You got to, you got to be here. >> Yes. >> Yeah. >> How are you guys working with those partners? Because you say you have a product in your marketplace there. How do I make money if I'm a reseller with Databricks, with Amazon? Take me through that use case. >> Well I'll let Joel comment, but I think it's pretty straightforward, right? Customers need expertise. They need knowhow. When we're seeing customers do mass migrations to the cloud or Hadoop specific migrations or data transformation implementations. They need expertise from consulting and SI partners. If those consulting and SI partners happen to resell the solution as well. Well, that's another aspect of their business. But I really think it is the expertise that the partners bring to help customers get outcomes. >> Joel, channel big opportunity for Amazon to reimagine this. >> For sure. Yeah. And I think, you know, to your comment about how do resellers take advantage of that, I think what Jack was pushing on is spot on. Which is, it's becoming more and more about the expertise you bring to the table. And not just transacting the software. But now actually helping customers make the right choices. And we're seeing, you know, both SIs begin to be able to resell solutions and finding a lot of opportunity in that. >> Yeah. And I think we're seeing traditional resellers begin to move into that SI model as well. And that's going to be the evolution that this goes. >> At the end of the day, it's about services, right? >> For sure. Yeah. >> I mean... >> You've got a great service. You're going to have high gross profits. >> Yeah >> Managed service provider business is alive and well, right? Because there are a number of customers that want that type of a service. >> I think that's going to be a really hot, hot button for you guys. I think being the way you guys are open, this channel, partner services model coming in, to the fold, really kind of makes for kind of that Supercloud like experience, where you guys now have an ecosystem. And that's my next question. You guys have an ecosystem going on, within Databricks. >> For sure. >> On top of this ecosystem. How does that work? This is kind of like, hasn't been written up in business school and case studies yet. This is new. What is this? >> I think, you know, what it comes down to is, you're seeing ecosystems begin to evolve around the data platforms. And that's going to be one of the big, kind of, new horizons for us as we think about what drives ecosystems. It's going to be around, well, what's the data platform that I'm using? And then all the tools that have to encircle that to get my business done. And so I think there's, you know, absolutely ecosystems inside of the AWS business on all of AWS's services, across data analytics and AI. And then to your point, you are seeing ecosystems now arise around Databricks in its Lakehouse platform as well. As customers are looking at well, if I'm standing these Lakehouses up and I'm beginning to invest in this, then I need a whole set of tools that help me get that done as well. >> I mean you think about ecosystem theory, we're living a whole nother dream. And I'm not kidding. It hasn't yet been written up and for business school case studies is that, we're now in a whole nother connective tissue, ecology thing happening. Where you have dependencies and value proposition. Economics, connectedness. So you have relationships in these ecosystems. >> And I think one of the great things about the relationships with these ecosystems, is that there's a high degree of overlap. >> Yeah. >> So you're seeing that, you know, the way that the cloud business is evolving, the ecosystem partners of Databricks, are the same ecosystem partners of AWS. And so as you build these platforms out into the cloud, you're able to really take advantage of best of breed, the broadest set of solutions out there for you. >> Joel, Jack, I love it because you know what it means? The best ecosystem will win, if you keep it open. >> Sure, sure. >> You can see everything. If you're going to do it in the dark, you know, you don't know the outcome. I mean, this is really kind of what we're talking about. >> And John, can I just add that when I was at Amazon, we had a theory that there's buyers and builders, right? There's very innovative companies that want to build things themselves. We're seeing now that that builders want to buy a platform. Right? >> Yeah. >> And so there's a platform decision being made and that ecosystem is going to evolve around the platform. >> Yeah, and I totally agree. And the word innovation gets kicked around. That's why, you know, when we had our Supercloud panel, it was called the innovators dilemma, with a slash through it, called the integrater's dilemma. Innovation is the digital transformation. So- >> Absolutely. >> Like that becomes cliche in a way, but it really becomes more of a, are you open? Are you integrating? If APIs are connective tissue, what's automation, what's the service messages look like? I mean, a whole nother set of, kind of thinking, goes on in these new ecosystems and these new products. >> And that thinking is, has been born in Delta Sharing, right? So the idea that you can have a multi-cloud implementation of Databricks, and actually share data between those two different clouds, that is the next layer on top of the native cloud solution. >> Well, Databricks has done a good job of building on top of the goodness of, and the CapEx gift from AWS. But you guys have done a great job taking that building differentiation into the product. You guys have great customer base, great growing ecosystem. And again, I think a shining example of what every enterprise is going to do. Build on top of something, operating model, get that operating model, driving revenue. >> Mm-hmm. >> Yeah. >> Whether, you're Goldman Sachs or capital one or XYZ corporation. >> S and P global, NASDAQ. >> Yeah. >> We've got, you know, the biggest verticals in the world are solving tough problems with Databricks. I think we'd be remiss because if Ali was here, he would really want to thank Amazon for all of the investments across all of the different functions. Whether it's the relationship we have with our engineering and service teams. Our marketing teams, you know, product development. And we're going to be at Reinvent. A big presence at Reinvent. We're looking forward to seeing you there, again. >> Yeah. We'll see you guys there. Yeah. Again, good ecosystem. I love the ecosystem evolutions happening. This NextGen Cloud is here. We're seeing this evolve, kind of new economics, new value propositions kind of scaling up. Producing more. So you guys are doing a great job. Thanks for coming on the Cube and taking the time. Joel, great to see you at the check. >> Thanks for having us, John. >> Okay. Cube coverage here. The world's changing as APN comes together with the marketplace for a new partner organization at Amazon web services. The Cube's got it covered. This should be a very big, growing ecosystem as this continues. Billions of being sold through the marketplace. And of course the buyers are happy as well. So we've got it all covered. I'm John Furry. your host of the cube. Thanks for watching. (upbeat music)

Published Date : Oct 10 2022

SUMMARY :

You guys have the keys to the kingdom on the micro, you know, You're in the middle of it. you know, unique use cases. to the relationship you have. and how does it relate to And so we see customers, you know, And obviously the integration Is that the products... buying in the marketplace? And that is the problem that Databricks And this product, it's the difference between So how do you guys look at So it's not a subset, it's the Everything, the flagship, and then you can use So customers are driving. For sure. Hey, I'm going to just you know, multiple ISV spend here is that the alternative So the marketplace allows multiple ways So it doesn't change So you guys are actually incented It's the right thing to do for out there. the marketplace to get Databricks stood up I get the infrastructure side, you know, Databricks is doing the same thing And that's where you see And that is one of the things that aren't as open as you guys, down the road, if they go that provider is able to innovate. that desire to innovate begins to degrade. So extract rents versus innovation. Yeah, exactly. But in the open world, you know, And the open source the protocol stacked with proprietary. You know the rest. And so like, you know, that was, I call it the chessboard, you know, And if you look at what every customer's And so the tools of tomorrow And I would say that, you know, And access to the core value. to data centers or software, you know, How are you guys working that the partners bring to to reimagine this. And I think, you know, And that's going to be the Yeah. You're going to have high gross profits. that want that type of a service. I think being the way you guys are open, This is kind of like, And so I think there's, you know, So you have relationships And I think one of the great things And so as you build these because you know what it means? in the dark, you know, that want to build things themselves. to evolve around the platform. And the word innovation more of a, are you open? So the idea that you and the CapEx gift from AWS. Whether, you're Goldman for all of the investments across Joel, great to see you at the check. And of course the buyers

ENTITIES

Entity	Category	Confidence
David Nicholson	PERSON	0.99+
Chris	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Joel	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Peter	PERSON	0.99+
Mona	PERSON	0.99+
Dave Vellante	PERSON	0.99+
David Vellante	PERSON	0.99+
Keith	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
Kevin	PERSON	0.99+
Joel Minick	PERSON	0.99+
Andy	PERSON	0.99+
Ryan	PERSON	0.99+
Cathy Dally	PERSON	0.99+
Patrick	PERSON	0.99+
Greg	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Stephen	PERSON	0.99+
Kevin Miller	PERSON	0.99+
Marcus	PERSON	0.99+
Dave Alante	PERSON	0.99+
Eric	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Dan	PERSON	0.99+
Peter Burris	PERSON	0.99+
Greg Tinker	PERSON	0.99+
Utah	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
John	PERSON	0.99+
Raleigh	LOCATION	0.99+
Brooklyn	LOCATION	0.99+
Carl Krupitzer	PERSON	0.99+
Lisa	PERSON	0.99+
Lenovo	ORGANIZATION	0.99+
JetBlue	ORGANIZATION	0.99+
2015	DATE	0.99+
Dave	PERSON	0.99+
Angie Embree	PERSON	0.99+
Kirk Skaugen	PERSON	0.99+
Dave Nicholson	PERSON	0.99+
2014	DATE	0.99+
Simon	PERSON	0.99+
United	ORGANIZATION	0.99+
Stu Miniman	PERSON	0.99+
Southwest	ORGANIZATION	0.99+
Kirk	PERSON	0.99+
Frank	PERSON	0.99+
Patrick Osborne	PERSON	0.99+
1984	DATE	0.99+
China	LOCATION	0.99+
Boston	LOCATION	0.99+
California	LOCATION	0.99+
Singapore	LOCATION	0.99+

Jack Andersen & Joel Minnick, Databricks | AWS Marketplace Seller Conference 2022

>>Welcome back everyone to the cubes coverage here in Seattle, Washington, AWS's marketplace seller conference. It's the big news within the Amazon partner network, combining with marketplaces, forming the Amazon partner organization, part of a big reorg as they grow the next level NextGen cloud mid-game on the chessboard. Cube's got cover. I'm John fur, host of Cub, a great guests here from data bricks, both cube alumnis, Jack Anderson, GM of the and VP of the data bricks partnership team. For ADOS, you handle that relationship and Joel Minick vice president of product and partner marketing. You guys are the, have the keys to the kingdom with data, bricks, and AWS. Thanks for joining. Thanks for good to see you again. Thanks for >>Having us back. Yeah, John, great to be here. >>So I feel like we're at reinvent 2013 small event, no stage, but there's a real shift happening with procurement. Obviously it makes it's a no brainer on the micro, you know, people should be buying online self-service cloud scale, but Amazon's got billions being sold to their marketplace. They've reorganized their partner network. You can see kind of what's going on. They've kind of figured it out. Like let's put everything together and simplify and make it less of a website marketplace merge our partner to have more synergy and friction, less experiences so everyone can make more money and customer's gonna be happier. >>Yeah, that's right. >>I mean, you're run relationship. You're in the middle of it. >>Well, Amazon's mental model here is that they want the world's best ISVs to operate on AWS so that we can collaborate and co architect on behalf of customers. And that's exactly what the APO and marketplace allow us to do is to work with Amazon on these really, you know, unique use cases. >>You know, I interviewed Ali many times over the years. I remember many years ago, I think six, maybe six, seven years ago, we were talking. He's like, we're all in ons. Obviously. Now the success of data bricks, you've got multiple clouds. See that customers have choice, but I remember the strategy early on. It was like, we're gonna be deep. So this is speaks volumes to the, the relationship you have years. Jack take us through the relationship that data bricks has with AWS from a, from a partner perspective, Joel, and from a product perspective, because it's not like you got to Johnny come lately new to the new, to the scene, right? We've been there almost president creation of this wave. What's the relationship and has it relate to what's going on today? >>So, so most people may not know that data bricks was born on AWS. We actually did our first 100 million of revenue on Amazon. And today we're obviously available on multiple clouds, but we're very fond of our Amazon relationship. And when you look at what the APN allows us to do, you know, we're able to expand our reach and co-sell with Amazon and marketplace broadens our reach. And so we think of marketplace in three different aspects. We've got the marketplace, private offer business, which we've been doing for a number of years. Matter of fact, we we're driving well over a hundred percent year over year growth in private offers and we have a nine figure business. So it's a very significant business. And when a customer uses a private offer that private offer counts against their private pricing agreement with AWS. So they get pricing power against their, their private pricing. >>So it's really important. It goes on their Amazon bill in may. We launched our pay as you go on demand offering. And in five short months, we have well over a thousand subscribers. And what this does is it really reduces the barriers to entry it's low friction. So anybody in an enterprise or startup or public sector company can start to use data bricks on AWS and pay consumption based model and have it go against their monthly bill. And so we see customers, you know, doing rapid experimentation pilots, POCs, they're, they're really learning the value of that first use case. And then we see rapid use case expansion. And the third aspect is the consulting partner, private offers C P O super important in how we involve our partner ecosystem of our consulting partners and our resellers that are able to work with data bricks on behalf of customers. >>So you got the big contracts with the private offer. You got the product market fit, kind of people iterating with data coming in with, with the buyers you go. And obviously the integration piece all fitting in there. Exactly. Exactly. Okay. So that's that those are the offers that's current and what's in marketplace today. Is that the products, what are, what are people buying? I mean, I guess what's the Joel, what are, what are people buying in the marketplace and what does it mean for >>Them? So fundamentally what they're buying is the ability to take silos out of their organization. And that's, that is the problem that data bricks is out there to solve, which is when you look across your data landscape today, you've got unstructured data, you've got structured data, you've got real time streaming data, and your teams are trying to use all of this data to solve really complicated problems. And as data bricks as the lake house company, what we're helping customers do is how do they get into the new world? How do they move to a place where they can use all of that data across all of their teams? And so we allow them to begin to find through the marketplace, those rapid adoption use cases where they can get rid of these data, warehousing data lake silos they've had in the past, get their unstructured and structured data onto one data platform and open data platform that is no longer adherent to any proprietary formats and standards and something. >>They can very much, very easily integrate into the rest of their data environment, apply one common data governance layer on top of that. So that from the time they ingest that data to the time they use that data to the time they share that data inside and outside of their organization, they know exactly how it's flowing. They know where it came from. They know who's using it. They know who has access to it. They know how it's changing. And then with that common data platform with that common governance solution, they'd being able to bring all of those use cases together across their real time, streaming their data engineering, their BI, their AI, all of their teams working on one set of data. And that lets them move really, really fast. And it also lets them solve challenges. They just couldn't solve before a good example of this, you know, one of the world's now largest data streaming platforms runs on data bricks with AWS. >>And if you think about what does it take to set that up? Well, they've got all this customer data that was historically inside of data warehouses, that they have to understand who their customers are. They have all this unstructured data, they've built their data science model, so they can do the right kinds of recommendation engines and forecasting around. And then they've got all this streaming data going back and forth between click stream data from what the customers are doing with their platform and the recommendations they wanna push back out. And if those teams were all working in individual silos, building these kinds of platforms would be extraordinarily slow and complex, but by building it on data bricks, they were able to release it in record time and have grown at, at record pace >>To not be that's product platform that's impacting product development. Absolutely. I mean, this is like the difference between lagging months of product development to like days. Yes. Pretty much what you're getting at. Yeah. So total agility. I got that. Okay. Now I'm a customer I wanna buy in the marketplace, but I also, you got direct Salesforce up there. So how do you guys look at this? Is there channel conflict? Are there comp programs? Because one of the things I heard today in on the stage from a Davis's leadership, Chris was up there speaking and, and, and moment I was, Hey, he's a CRO conference, chief revenue officer conversation, which means someone's getting compensated. So if I'm the sales rep at data bricks, what's my motion to the customer. Do I get paid? Does Amazon sell it? Take us through that. Is there channel conflict? Is there or an audio lift? >>Well, I I'd add what Joel just talked about with, with, you know, what the solution, the value of the solution our entire offering is available on AWS marketplace. So it's not a subset, the entire data bricks offering and >>The flagship, all the, the top, >>Everything, the flagship, the complete offering. So it's not, it's not segmented. It's not a sub segment. It's it's, you know, you can use all of our different offerings. Now when it comes to seller compensation, we, we, we view this two, two different ways, right? One is that AWS is also incented, right? Versus selling a native service to recommend data bricks for the right situation. Same thing with data bricks. Our Salesforce wants to do the right thing for the customer. If the customer wants to use marketplace as their procurement vehicle. And that really helps customers because if you get data bricks and five other ISVs together, and let's say each ISV is spending, you're spending a million dollars, you have $5 million of spend, you put that spend through the flywheel with AWS marketplace. And then you can use that in your negotiations with AWS to get better pricing overall. So that's how we, >>We do it. So customers are driving. This sounds like, correct. For sure. So they're looking at this as saying, Hey, I'm gonna just get purchasing power with all my relationships because it's a solution architectural market, right? >>Yeah. It makes sense. Because if most customers will have a primary and secondary cloud provider, if they can consolidate, you know, multiple ISV spend through that same primary provider, you get pricing >>Power, okay, Jill, we're gonna date ourselves. At least I will. So back in the old days, it used to be, do a Barney deal with someone, Hey, let's go to market together. You gotta get paper, you do a biz dev deal. And then you gotta say, okay, now let's coordinate our sales teams, a lot of moving parts. So what you're getting at here is that the alternative for data bricks or any company is to go find those partners and do deals versus now Amazon is the center point for the customer so that you can still do those joint deals. But this seems to be flipping the script a little bit. >>Well, it is, but we still have VAs and consulting partners that are doing implementation work very valuable work advisory work that can actually work with marketplace through the C PPO offering. So the marketplace allows multiple ways to procure your >>Solution. So it doesn't change your business structure. It just makes it more efficient. That's >>Correct. >>That's a great way to say it. Yeah, >>That's great. So that's so that's it. So that's just makes it more efficient. So you guys are actually incented to point customers to the marketplace. >>Yes, >>Absolutely. Economically. Yeah. >>E economically it's the right thing to do for the customer. It's the right thing to do for our relationship with Amazon, especially when it comes back to co-selling right? Because Amazon now is leaning in with ISVs and making recommendations for, you know, an ISV solution and our teams are working backwards from those use cases, you know, to collaborate, land them. >>Yeah. I want, I wanna get that out there. Go ahead, Joel. >>So one of the other things I might add to that too, you know, and why this is advantageous for, for companies like data bricks to, to work through the marketplace, is it makes it so much easier for customers to deploy a solution. It's, it's very, literally one click through the marketplace to get data bricks stood up inside of your environment. And so if you're looking at how do I help customers most rapidly adopt these solutions in the AWS cloud, the marketplace is a fantastic accelerator to that. You >>Know, it's interesting. I wanna bring this up and get your reaction to it because to me, I think this is the future of procurement. So from a procurement standpoint, I mean, again, dating myself EDI back in the old days, you know, all that craziness. Now this is all the, all the internet, basically through the console, I get the infrastructure side, you know, spin up and provision. Some servers, all been good. You guys have played well there in the marketplace. But now as we get into more of what I call the business apps, and they brought this up on stage little nuance, most enterprises aren't yet there of integrating tech on the business apps, into the stack. This is where I think you guys are a use case of success where you guys have been successful with data integration. It's an integrator's dilemma, not an innovator's dilemma. So like, I want to integrate, so now I have integration points with data bricks, but I want to put an app in there. I want to provision an application, but it has to be built. It's not, you don't buy it. You build, you gotta build stuff. And this is the nuance. What's your reaction to that? Am I getting this right? Or, or am I off because no, one's gonna be buying software. Like they used to, they buy software to integrate it. >>Yeah, >>No, I, cause everything's integrated. >>I think AWS has done a great job at creating a partner ecosystem, right. To give customers the right tools for the right jobs. And those might be with third parties, data bricks is doing the same thing with our partner connect program. Right. We've got customer, customer partners like five tra and D V T that, you know, augment and enhance our platform. And so you, you're looking at multi ISV architectures and all of that can be procured through the AWS marketplace. >>Yeah. It's almost like, you know, bundling and unbundling. I was talking about this with, with Dave ante about Supercloud, which is why wouldn't a customer want the best solution in their architecture period. And it's class. If someone's got API security or an API gateway. Well, you know, I don't wanna be forced to buy something because it's part of a suite and that's where you see things get suboptimized where someone dominates a category and they have, oh, you gotta buy my version of this. Yeah. >>Joel, Joel. And that's Joel and I were talking, we're actually saying what what's really important about Databricks is that customers control the data. Right? You wanna comment on that? >>Yeah. I was say the, you know what you're pushing on there we think is extraordinarily, you know, the way the market is gonna go is that customers want a lot of control over how they build their data stack. And everyone's unique in what tools are the right ones for them. And so one of the, you know, philosophically I think really strong places, data, bricks, and AWS have lined up is we both take an approach that you should be able to have maximum flexibility on the platform. And as we think about the lake house, one thing we've always been extremely committed to as a company is building the data platform on an open foundation. And we do that primarily through Delta lake and making sure that to Jack's point with data bricks, the data is always in your control. And then it's always stored in a completely open format. And that is one of the things that's allowed data bricks to have the breadth of integrations that it has with all the other data tools out there, because you're not tied into any proprietary format, but instead are able to take advantage of all the innovation that's happening out there in the open source ecosystem. >>When you see other solutions out there that aren't as open as you guys, you guys are very open by the way, we love that too. We think that's a great strategy, but what's the, what am I foreclosing? If I go with something else that's not as open what what's the customer's downside as you think about what's around the corner in the industry. Cuz if you believe it's gonna be open, open source, which I think opens our software is the software industry and integration is a big deal, cuz software's gonna be plentiful. Let's face it. It's a good time to be in software business, but cloud's booming. So what's the downside from your data bricks perspective, you see a buyer clicking on data bricks versus that alternative what's potentially is should they be a nervous about down the road if they go with a more proprietary or locked in approach? Well, >>I think the challenge with proprietary ecosystems is you become beholden to the ability of that provider to both build relationships and convince other vendors that they should invest in that format. But you're also then beholden to the pace at which that provider is able to innovate. And I think we've seen lots of times over history where, you know, a proprietary format may run ahead for a while on a lot of innovation. But as that market control begins to solidify that desire to innovate begins to, to degrade, whereas in the open format. So >>Extract rents versus innovation. Exactly. >>Yeah, exactly. >>But >>I'll say it in the open world, you know, you have to continue to innovate. Yeah. And the open source world is always innovating. If you look at the last 10 to 15 years, I challenge you to find, you know, an example where the innovation in the data and AI world is not coming from open source. And so by investing in open ecosystems, that means you were always going to be at the forefront of what is the >>Latest, you know, again, not to date myself again, but you look back at the eighties and nineties, the protocol stacked for proprietary. Yeah. You know, SNA at IBM deck net was digital, you know, the rest is, and then TCP, I P was part of the open systems, interconnect, revolutionary Oly, a big part of that as well as my school did. And so like, you know, that was, but it didn't standardize the whole stack. It stopped at IP and TCP. Yeah. But that helped interoperate, that created a nice defacto. So this is a big part of this mid game. I call it the chessboard, you know, you got opening game and mid game. Then you got the end game and we're not there. The end game yet cloud the cloud. >>There's, there's always some form of lock in, right. Andy jazzy will, will address it, you know, when making a decision. But if you're gonna make a decision you want to reduce as you don't wanna be limited. Right. So I would advise a customer that there could be limitations with a proprietary architecture. And if you look at what every customer's trying to become right now is an AI driven business. Right? And so it has to do with, can you get that data outta silos? Can you, can you organize it and secure it? And then can you work with data scientists to feed those models? Yeah. In a, in a very consistent manner. And so the tools of tomorrow will to Joel's point will be open and we want interoperability with those >>Tools and, and choice is a matter too. And I would say that, you know, the argument for why I think Amazon is not as locked in as maybe some other clouds is that they have to compete directly too. Redshift competes directly with a lot of other stuff, but they can't play the bundling game because the customers are getting savvy to the fact that if you try to bundle an inferior product with something else, it may not work great at all. And they're gonna be they're onto it. This is >>The Amazon's credit by having these, these solutions that may compete with native services in marketplace, they are providing customers with choice, low >>Price and access to the S and access to the core value. Exactly. Which the >>Hardware, which is their platform. Okay. So I wanna get you guys thought on something else. I, I see emerging, this is again kind of cube rumination moment. So on stage Chris unpacked, a lot of stuff. I mean this marketplace, they're touching a lot of hot buttons here, you know, pricing compensation, workflows services behind the curtain. And one of the things he mentioned was they talk about resellers or channel partners, depending upon what you talk about. We believe Dave and I believe on the cube that the entire indirect sales channel of the industry is gonna be disrupted radically because those players were selling hardware in the old days and software, that game is gonna change. You know, you mentioned you guys have a program, want to get your thoughts on this. We believe that once this gets set up, they can play in this game and bring their services in which means that the old reseller channels are gonna be rewritten. They're gonna be refactored with this new kinds of access. Cuz you've got scale, you've got money and you've got product and you got customers coming into the marketplace. So if you're like a reseller that sold computers to data centers or software, you know, value added reseller or V or business, >>You've gotta evolve. >>You gotta, you gotta be here. Yes. How are you guys working with those partners? Cuz you say you have a part in your marketplace there. How do I make money? If I'm a reseller with data bricks with eight Amazon, take me through that use case. >>Well I'll let Joel comment, but I think it's, it's, it's pretty straightforward, right? Customers need expertise. They need knowhow. When we're seeing customers do mass migrations to the cloud or Hadoop specific migrations or data transformation implementations, they need expertise from consulting and SI partners. If those consulting SI partners happen to resell the solution as well. Well, that's another aspect of their business, but I really think it is the expertise that the partners bring to help customers get outcomes. >>Joel, channel big opportunity for re re Amazon to reimagine this. >>For sure. Yeah. And I think, you know, to your comment about how to resellers take advantage of that, I think what Jack was pushing on is spot on, which is it's becoming more about more and more about the expertise you bring to the table and not just transacting the software, but now actually helping customers make the right choices. And we're seeing, you know, both SI begin to be able to resell solutions and finding a lot of opportunity in that. Yeah. And I think we're seeing traditional resellers begin to move into that SI model as well. And that's gonna be the evolution that >>This gets at the end of the day. It's about services for sure, for sure. You've got a great service. You're gonna have high gross profits. And >>I think that the managed service provider business is alive and well, right? Because there are a number of customers that want that, that type of a service. >>I think that's gonna be a really hot, hot button for you guys. I think being the way you guys are open this channel partner services model coming in to the fold really kind of makes for kind of that super cloudlike experience where you guys now have an ecosystem. And that's my next question. You guys have an ecosystem going on within data bricks for sure. On top of this ecosystem, how does that work? This is kinda like hasn't been written up in business school and case studies yet this is new. What is this? >>I think, you know, what it comes down to is you're seeing ecosystems begin to evolve around the data platforms and that's gonna be one of the big kind of new horizons for us as we think about what drives ecosystems it's going to be around. Well, what is the, what's the data platform that I'm using and then all the tools that have to encircle that to get my business done. And so I think there's, you know, absolutely ecosystems inside of the AWS business on all of AWS's services, across data analytics and AI. And then to your point, you are seeing ecosystems now arise around data bricks in its Lakehouse platform, as well as customers are looking at well, if I'm standing these Lakehouse up and I'm beginning to invest in this, then I need a whole set of tools that help me get that done as well. >>I mean you think about ecosystem theory, we're living a whole nother dream and I'm, and I'm not kidding. It hasn't yet been written up and for business school case studies is that we're now in a whole nother connective tissue ecology thing happening where you have dependencies and value proposition economics connectedness. So you have relationships in these ecosystems. >>And I think one of the great things about relationships with these ecosystems is that there's a high degree of overlap. Yeah. So you're seeing that, you know, the way that the cloud business is evolving, the, the ecosystem partners of data bricks are the same ecosystem partners of AWS. And so as you build these platforms out into the cloud, you're able to really take advantage of best of breed, the broadest set of solutions out there for >>You. Joel, Jack, I love it because you know what it means the best ecosystem will win. If you keep it open. Sure. You can see everything. If you're gonna do it in the dark, you know, you don't know the outcome. I mean, this is really kind we're talking about. >>And John, can I just add that when I was in Amazon, we had a, a theory that there's buyers and builders, right? There's very innovative companies that want to build things themselves. We're seeing now that that builders want to buy a platform. Right? Yeah. And so there's a platform decision being made and that ecosystem gonna evolve around the >>Platform. Yeah. And I totally agree. And, and, and the word innovation get kicks around. That's why, you know, when we had our super cloud panel was called the innovators dilemma with a slash through it called the integrated dilemma, innovation is the digital transformation. So absolutely like that becomes cliche in a way, but it really becomes more of a, are you open? Are you integrating if APIs are the connective tissue, what's automation, what's the service message look like. I mean, a whole nother set of kind of thinking goes on and these new ecosystems and these new products >>And that, and that thinking is, has been born in Delta sharing. Right? So the idea that you can have a multi-cloud implementation of data bricks, and actually share data between those two different clouds, that is the next layer on top of the native cloud >>Solution. Well, data bricks has done a good job of building on top of the goodness of, and the CapEx gift from AWS. But you guys have done a great job taking that building differentiation into the product. You guys have great customer base, great grow ecosystem. And again, I think in a shining example of what every enterprise is going to do, build on top of something operating model, get that operating model, driving revenue. >>Yeah. >>Well we, whether whether you're Goldman Sachs or capital one or XYZ corporation >>S and P global NASDAQ, right. We've got, you know, these, the biggest verticals in the world are solving tough problems with data breaks. I think we'd be remiss cuz if Ali was here, he would really want to thank Amazon for all of the investments across all of the different functions, whether it's the relationship we have with our engineering and service teams. Yeah. Our marketing teams, you know, product development and we're gonna be at reinvent the big presence of reinvent. We're looking forward to seeing you there again. >>Yeah. We'll see you guys there. Yeah. Again, good ecosystem. I love the ecosystem evolutions happening this next gen cloud is here. We're seeing this evolve kind of new economics, new value propositions kind of scaling up, producing more so you guys are doing a great job. Thanks for coming on the Cuban, taking time. Chill. Great to see you at the check. Thanks for having us. Thanks. Going. Okay. Cube coverage here. The world's changing as APN comes to give the marketplace for a new partner organization at Amazon web services, the Cube's got a covered. This should be a very big growing ecosystem as this continues, billions of being sold through the marketplace. Of course the buyers are happy as well. So we've got it all covered. I'm John furry, your host of the cube. Thanks for watching.

Published Date : Sep 21 2022

SUMMARY :

Thanks for good to see you again. Yeah, John, great to be here. Obviously it makes it's a no brainer on the micro, you know, You're in the middle of it. you know, unique use cases. So this is speaks volumes to the, the relationship you have years. And when you look at what the APN allows us to do, And so we see customers, you know, doing rapid experimentation pilots, POCs, So you got the big contracts with the private offer. And that's, that is the problem that data bricks is out there to solve, They just couldn't solve before a good example of this, you know, And if you think about what does it take to set that up? So how do you guys look at this? Well, I I'd add what Joel just talked about with, with, you know, what the solution, the value of the solution our entire offering And that really helps customers because if you get data bricks So they're looking at this as saying, you know, multiple ISV spend through that same primary provider, you get pricing And then you gotta say, okay, now let's coordinate our sales teams, a lot of moving parts. So the marketplace allows multiple ways to procure your So it doesn't change your business structure. Yeah, So you guys are actually incented to Yeah. It's the right thing to do for our relationship with Amazon, So one of the other things I might add to that too, you know, and why this is advantageous for, I get the infrastructure side, you know, spin up and provision. you know, augment and enhance our platform. you know, I don't wanna be forced to buy something because it's part of a suite and the data. And that is one of the things that's allowed data bricks to have the breadth of integrations that it has with When you see other solutions out there that aren't as open as you guys, you guys are very open by the I think the challenge with proprietary ecosystems is you become beholden to the Exactly. I'll say it in the open world, you know, you have to continue to innovate. I call it the chessboard, you know, you got opening game and mid game. And so it has to do with, can you get that data outta silos? And I would say that, you know, the argument for why I think Amazon Price and access to the S and access to the core value. So I wanna get you guys thought on something else. You gotta, you gotta be here. If those consulting SI partners happen to resell the solution as well. And we're seeing, you know, both SI begin to be This gets at the end of the day. I think that the managed service provider business is alive and well, right? I think being the way you guys are open this channel I think, you know, what it comes down to is you're seeing ecosystems begin to evolve around So you have relationships in And so as you build these platforms out into the cloud, you're able to really take advantage you don't know the outcome. And John, can I just add that when I was in Amazon, we had a, a theory that there's buyers and builders, That's why, you know, when we had our super cloud panel So the idea that you can have a multi-cloud implementation of data bricks, and actually share data But you guys have done a great job taking that building differentiation into the product. We're looking forward to seeing you there again. Great to see you at the check.

ENTITIES

Entity	Category	Confidence
Chris	PERSON	0.99+
Joel Minick	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Joel	PERSON	0.99+
Ali	PERSON	0.99+
Jack Anderson	PERSON	0.99+
Dave	PERSON	0.99+
$5 million	QUANTITY	0.99+
Jack	PERSON	0.99+
two	QUANTITY	0.99+
Goldman Sachs	ORGANIZATION	0.99+
XYZ	ORGANIZATION	0.99+
Joel Minnick	PERSON	0.99+
Jack Andersen	PERSON	0.99+
Andy jazzy	PERSON	0.99+
third aspect	QUANTITY	0.99+
John fur	PERSON	0.99+
NASDAQ	ORGANIZATION	0.99+
Barney	ORGANIZATION	0.99+
both	QUANTITY	0.99+
five short months	QUANTITY	0.99+
One	QUANTITY	0.99+
APO	ORGANIZATION	0.99+
today	DATE	0.99+
IBM	ORGANIZATION	0.99+
first 100 million	QUANTITY	0.98+
tomorrow	DATE	0.98+
one	QUANTITY	0.98+
billions	QUANTITY	0.98+
Johnny	PERSON	0.97+
Davis	PERSON	0.97+
a million dollars	QUANTITY	0.96+
Salesforce	ORGANIZATION	0.96+
data bricks	ORGANIZATION	0.95+
each ISV	QUANTITY	0.95+
Seattle, Washington	LOCATION	0.95+
two different ways	QUANTITY	0.95+
one data platform	QUANTITY	0.95+
seven years ago	DATE	0.94+

Breaking Analysis Further defining Supercloud W/ tech leaders VMware, Snowflake, Databricks & others

from the cube studios in palo alto in boston bringing you data driven insights from the cube and etr this is breaking analysis with dave vellante at our inaugural super cloud 22 event we further refined the concept of a super cloud iterating on the definition the salient attributes and some examples of what is and what is not a super cloud welcome to this week's wikibon cube insights powered by etr you know snowflake has always been what we feel is one of the strongest examples of a super cloud and in this breaking analysis from our studios in palo alto we unpack our interview with benoit de javille co-founder and president of products at snowflake and we test our super cloud definition on the company's data cloud platform and we're really looking forward to your feedback first let's examine how we defl find super cloudant very importantly one of the goals of super cloud 22 was to get the community's input on the definition and iterate on previous work super cloud is an emerging computing architecture that comprises a set of services which are abstracted from the underlying primitives of hyperscale clouds we're talking about services such as compute storage networking security and other native tooling like machine learning and developer tools to create a global system that spans more than one cloud super cloud as shown on this slide has five essential properties x number of deployment models and y number of service models we're looking for community input on x and y and on the first point as well so please weigh in and contribute now we've identified these five essential elements of a super cloud let's talk about these first the super cloud has to run its services on more than one cloud leveraging the cloud native tools offered by each of the cloud providers the builder of the super cloud platform is responsible for optimizing the underlying primitives of each cloud and optimizing for the specific needs be it cost or performance or latency or governance data sharing security etc but those primitives must be abstracted such that a common experience is delivered across the clouds for both users and developers the super cloud has a metadata intelligence layer that can maximize efficiency for the specific purpose of the super cloud i.e the purpose that the super cloud is intended for and it does so in a federated model and it includes what we call a super pass this is a prerequisite that is a purpose-built component and enables ecosystem partners to customize and monetize incremental services while at the same time ensuring that the common experiences exist across clouds now in terms of deployment models we'd really like to get more feedback on this piece but here's where we are so far based on the feedback we got at super cloud 22. we see three deployment models the first is one where a control plane may run on one cloud but supports data plane interactions with more than one other cloud the second model instantiates the super cloud services on each individual cloud and within regions and can support interactions across more than one cloud with a unified interface connecting those instantiations those instances to create a common experience and the third model superimposes its services as a layer or in the case of snowflake they call it a mesh on top of the cloud on top of the cloud providers region or regions with a single global instantiation a single global instantiation of those services which spans multiple cloud providers this is our understanding from a comfort the conversation with benoit dejaville as to how snowflake approaches its solutions and for now we're going to park the service models we need to more time to flesh that out and we'll propose something shortly for you to comment on now we peppered benoit dejaville at super cloud 22 to test how the snowflake data cloud aligns to our concepts and our definition let me also say that snowflake doesn't use the term data cloud they really want to respect and they want to denigrate the importance of their hyperscale partners nor do we but we do think the hyperscalers today anyway are building or not building what we call super clouds but they are but but people who bar are building super clouds are building on top of hyperscale clouds that is a prerequisite so here are the questions that we tested with snowflake first question how does snowflake architect its data cloud and what is its deployment model listen to deja ville talk about how snowflake has architected a single system play the clip there are several ways to do this you know uh super cloud as as you name them the way we we we picked is is to create you know one single system and that's very important right the the the um [Music] there are several ways right you can instantiate you know your solution uh in every region of a cloud and and you know potentially that region could be a ws that region could be gcp so you are indeed a multi-cloud solution but snowflake we did it differently we are really creating cloud regions which are superposed on top of the cloud provider you know region infrastructure region so we are building our regions but but where where it's very different is that each region of snowflake is not one in instantiation of our service our service is global by nature we can move data from one region to the other when you land in snowflake you land into one region but but you can grow from there and you can you know exist in multiple clouds at the same time and that's very important right it's not one single i mean different instantiation of a system is one single instantiation which covers many cloud regions and many cloud providers snowflake chose the most advanced level of our three deployment models dodgeville talked about too presumably so it could maintain maximum control and ensure that common experience like the iphone model next we probed about the technical enablers of the data cloud listen to deja ville talk about snow grid he uses the term mesh and then this can get confusing with the jamaicani's data mesh concept but listen to benoit's explanation well as i said you know first we start by building you know snowflake regions we have today furry region that spawn you know the world so it's a worldwide worldwide system with many regions but all these regions are connected together they are you know meshed together with our technology we name it snow grid and that makes it hard because you know regions you know azure region can talk to a ws region or gcp regions and and as a as a user of our cloud you you don't see really these regional differences that you know regions are in different you know potentially clown when you use snowflake you can exist your your presence as an organization can be in several regions several clouds if you want geographic and and and both geographic and cloud provider so i can share data irrespective of the the cloud and i'm in the snowflake data cloud is that correct i can do that today exactly and and that's very critical right what we wanted is to remove data silos and and when you instantiate a system in one single region and that system is locked in that region you cannot communicate with other parts of the world you are locking the data in one region right and we didn't want to do that we wanted you know data to be distributed the way customer wants it to be distributed across the world and potentially sharing data at world scale now maybe there are many ways to skin the other cat meaning perhaps if a platform does instantiate in multiple places there are ways to share data but this is how snowflake chose to approach the problem next question how do you deal with latency in this big global system this is really important to us because while snowflake has some really smart people working as engineers and and the like we don't think they've solved for the speed of light problem the best people working on it as we often joke listen to benoit deja ville's comments on this topic so yes and no the the way we do it it's very expensive to do that because generally if you want to join you know data which is in which are in different regions and different cloud it's going to be very expensive because you need to move you know data every time you join it so the way we do it is that you replicate the subset of data that you want to access from one region from other regions so you can create this data mesh but data is replicated to make it very cheap and very performant too and is the snow grid does that have the metadata intelligence yes to actually can you describe that a little bit yeah snow grid is both uh a way to to exchange you know metadata about so each region of snowflake knows about all the other regions of snowflake every time we create a new region diary you know the metadata is distributed over our data cloud not only you know region knows all the regions but knows you know every organization that exists in our clouds where this organization is where data can be replicated by this organization and then of course it's it's also used as a way to uh uh exchange data right so you can exchange you know beta by scale of data size and we just had i was just receiving an email from one of our customers who moved more than four petabytes of data cross-region cross you know cloud providers in you know few days and you know it's a lot of data so it takes you know some time to move but they were able to do that online completely online and and switch over you know to the diff to the other region which is failover is very important also so yes and no probably means typically no he says yes and no probably means no so it sounds like snowflake is selectively pulling small amounts of data and replicating it where necessary but you also heard him talk about the metadata layer which is one of the essential aspects of super cloud okay next we dug into security it's one of the most important issues and we think one of the hardest parts related to deploying super cloud so we've talked about how the cloud has become the first line of defense for the cso but now with multi-cloud you have multiple first lines of defense and that means multiple shared responsibility models and multiple tool sets from different cloud providers and an expanded threat surface so listen to benoit's explanation here please play the clip this is a great question uh security has always been the most important aspect of snowflake since day one right this is the question that every customer of ours has you know how you can you guarantee the security of my data and so we secure data really tightly in region we have several layers of security it starts by by encrypting it every data at rest and that's very important a lot of customers are not doing that right you hear these attacks for example on on cloud you know where someone left you know their buckets uh uh open and then you know you can access the data because it's a non-encrypted uh so we are encrypting everything at rest we are encrypting everything in transit so a region is very secure now you know you never from one region you never access data from another region in snowflake that's why also we replicate data now the replication of that data across region or the metadata for that matter is is really highly secure so snow grits ensure that everything is encrypted everything is you know we have multiple you know encryption keys and it's you know stored in hardware you know secure modules so we we we built you know snow grids such that it's secure and it allows very secure movement of data so when we heard this explanation we immediately went to the lowest common denominator question meaning when you think about how aws for instance deals with data in motion or data and rest it might be different from how another cloud provider deals with it so how does aws uh uh uh differences for example in the aws maturity model for various you know cloud capabilities you know let's say they've got a faster nitro or graviton does it do do you have to how does snowflake deal with that do they have to slow everything else down like imagine a caravan cruising you know across the desert so you know every truck can keep up let's listen it's a great question i mean of course our software is abstracting you know all the cloud providers you know infrastructure so that when you run in one region let's say aws or azure it doesn't make any difference as far as the applications are concerned and and this abstraction of course is a lot of work i mean really really a lot of work because it needs to be secure it needs to be performance and you know every cloud and it has you know to expose apis which are uniform and and you know cloud providers even though they have potentially the same concept let's say blob storage apis are completely different the way you know these systems are secure it's completely different the errors that you can get and and the retry you know mechanism is very different from one cloud to the other performance is also different we discovered that when we were starting to port our software and and and you know we had to completely rethink how to leverage blob storage in that cloud versus that cloud because just of performance too so we had you know for example to you know stripe data so all this work is work that's you know you don't need as an application because our vision really is that applications which are running in our data cloud can you know be abstracted of all this difference and and we provide all the services all the workload that this application need whether it's transactional access to data analytical access to data you know managing you know logs managing you know metrics all of these is abstracted too such that they are not you know tied to one you know particular service of one cloud and and distributing this application across you know many regions many cloud is very seamless so from that answer we know that snowflake takes care of everything but we really don't understand the performance implications in you know in that specific case but we feel pretty certain that the promises that snowflake makes around governance and security within their data sharing construct construct will be kept now another criterion that we've proposed for super cloud is a super pass layer to create a common developer experience and an enabler for ecosystem partners to monetize please play the clip let's listen we build it you know a custom build because because as you said you know what exists in one cloud might not exist in another cloud provider right so so we have to build you know on this all these this components that modern application mode and that application need and and and and that you know goes to machine learning as i say transactional uh analytical system and the entire thing so such that they can run in isolation basically and the objective is the developer experience will be identical across those clouds yes right the developers doesn't need to worry about cloud provider and actually our system we have we didn't talk about it but the marketplace that we have which allows actually to deliver we're getting there yeah okay now we're not going to go deep into ecosystem today we've talked about snowflakes strengths in this regard but snowflake they pretty much ticked all the boxes on our super cloud attributes and definition we asked benoit dejaville to confirm that this is all shipping and available today and he also gave us a glimpse of the future play the clip and we are still developing it you know the transactional you know unistore as we call it was announced in last summit so so they are still you know working properly but but but that's the vision right and and and that's important because we talk about the infrastructure right you mentioned a lot about storage and compute but it's not only that right when you think about application they need to use the transactional database they need to use an analytical system they need to use you know machine learning so you need to provide also all these services which are consistent across all the cloud providers so you can hear deja ville talking about expanding beyond taking advantage of the core infrastructure storage and networking et cetera and bringing intelligence to the data through machine learning and ai so of course there's more to come and there better be at this company's valuation despite the recent sharp pullback in a tightening fed environment okay so i know it's cliche but everyone's comparing snowflakes and data bricks databricks has been pretty vocal about its open source posture compared to snowflakes and it just so happens that we had aligotsy on at super cloud 22 as well he wasn't in studio he had to do remote because i guess he's presenting at an investor conference this week so we had to bring him in remotely now i didn't get to do this interview john furrier did but i listened to it and captured this clip about how data bricks sees super cloud and the importance of open source take a listen to goatzee yeah i mean let me start by saying we just we're big fans of open source we think that open source is a force in software that's going to continue for you know decades hundreds of years and it's going to slowly replace all proprietary code in its way we saw that you know it could do that with the most advanced technology windows you know proprietary operating system very complicated got replaced with linux so open source can pretty much do anything and what we're seeing with the data lake house is that slowly the open source community is building a replacement for the proprietary data warehouse you know data lake machine learning real-time stack in open source and we're excited to be part of it for us delta lake is a very important project that really helps you standardize how you lay out your data in the cloud and with it comes a really important protocol called delta sharing that enables you in an open way actually for the first time ever share large data sets between organizations but it uses an open protocol so the great thing about that is you don't need to be a database customer you don't even like databricks you just need to use this open source project and you can now securely share data sets between organizations across clouds and it actually does so really efficiently just one copy of the data so you don't have to copy it if you're within the same cloud so the implication of ellie gotzi's comments is that databricks with delta sharing as john implied is playing a long game now i don't know if enough about the databricks architecture to comment in detail i got to do more research there so i reached out to my two analyst friends tony bear and sanji mohan to see what they thought because they cover these companies pretty closely here's what tony bear said quote i've viewed the divergent lake house strategies of data bricks and snowflake in the context of their roots prior to delta lake databrick's prime focus was the compute not the storage layer and more specifically they were a compute engine not a database snowflake approached from the opposite end of the pool as they originally fit the mold of the classic database company rather than a specific compute engine per se the lake house pushes both companies outside of their original comfort zones data bricks to storage snowflake to compute engine so it makes perfect sense for databricks to embrace the open source narrative at the storage layer and for snowflake to continue its walled garden approach but in the long run their strategies are already overlapping databricks is not a 100 open source company its practitioner experience has always been proprietary and now so is its sql query engine likewise snowflake has had to open up with the support of iceberg for open data lake format the question really becomes how serious snowflake will be in making iceberg a first-class citizen in its environment that is not necessarily officially branding a lake house but effectively is and likewise can databricks deliver the service levels associated with walled gardens through a more brute force approach that relies heavily on the query engine at the end of the day those are the key requirements that will matter to data bricks and snowflake customers end quote that was some deep thought by by tony thank you for that sanjay mohan added the following quote open source is a slippery slope people buy mobile phones based on open source android but it's not fully open similarly databricks delta lake was not originally fully open source and even today its photon execution engine is not we are always going to live in a hybrid world snowflake and databricks will support whatever model works best for them and their customers the big question is do customers care as deeply about which vendor has a higher degree of openness as we technology people do i believe customers evaluation criteria is far more nuanced than just to decipher each vendor's open source claims end quote okay so i had to ask dodgeville about their so-called wall garden approach and what their strategy is with apache iceberg here's what he said iceberg is is very important so just to to give some context iceberg is an open you know table format right which was you know first you know developed by netflix and netflix you know put it open source in the apache community so we embrace that's that open source standard because because it's widely used by by many um many you know companies and also many companies have you know really invested a lot of effort in building you know big data hadoop solution or data like solution and they want to use snowflake and they couldn't really use snowflake because all their data were in open you know formats so we are embracing icebergs to help these companies move through the cloud but why we have been relentless with direct access to data direct access to data is a little bit of a problem for us and and the reason is when you direct access to data now you have direct access to storage now you have to understand for example the specificity of one cloud versus the other so as soon as you start to have direct access to data you lose your you know your cloud diagnostic layer you don't access data with api when you have direct access to data it's very hard to secure data because you need to grant access direct access to tools which are not you know protected and you see a lot of you know hacking of of data you know because of that so so that was not you know direct access to data is not serving well our customers and that's why we have been relented to do that because it's it's cr it's it's not cloud diagnostic it's it's you you have to code that you have to you you you need a lot of intelligence while apis access so we want open apis that's that's i guess the way we embrace you know openness is is by open api versus you know you access directly data here's my take snowflake is hedging its bets because enough people care about open source that they have to have some open data format options and it's good optics and you heard benoit deja ville talk about the risks of directly accessing the data and the complexities it brings now is that maybe a little fud against databricks maybe but same can be said for ollie's comments maybe flooding the proprietaryness of snowflake but as both analysts pointed out open is a spectrum hey i remember unix used to equal open systems okay let's end with some etr spending data and why not compare snowflake and data bricks spending profiles this is an xy graph with net score or spending momentum on the y-axis and pervasiveness or overlap in the data set on the x-axis this is data from the january survey when snowflake was holding above 80 percent net score off the charts databricks was also very strong in the upper 60s now let's fast forward to this next chart and show you the july etr survey data and you can see snowflake has come back down to earth now remember anything above 40 net score is highly elevated so both companies are doing well but snowflake is well off its highs and data bricks has come down somewhat as well databricks is inching to the right snowflake rocketed to the right post its ipo and as we know databricks wasn't able to get to ipo during the covet bubble ali gotzi is at the morgan stanley ceo conference this week they got plenty of cash to withstand a long-term recession i'm told and they've started the message that they're a billion dollars in annualized revenue i'm not sure exactly what that means i've seen some numbers on their gross margins i'm not sure what that means i've seen some numbers on their net retention revenue or net revenue retention again i'll reserve judgment until we see an s1 but it's clear both of these companies have momentum and they're out competing in the market well as always be the ultimate arbiter different philosophies perhaps is it like democrats and republicans well it could be but they're both going after a solving data problem both companies are trying to help customers get more value out of their data and both companies are highly valued so they have to perform for their investors to paraphrase ralph nader the similarities may be greater than the differences okay that's it for today thanks to the team from palo alto for this awesome super cloud studio build alex myerson and ken shiffman are on production in the palo alto studios today kristin martin and sheryl knight get the word out to our community rob hoff is our editor-in-chief over at siliconangle thanks to all please check out etr.ai for all the survey data remember these episodes are all available as podcasts wherever you listen just search breaking analysis podcasts i publish each week on wikibon.com and siliconangle.com and you can email me at david.vellante at siliconangle.com or dm me at devellante or comment on my linkedin posts and please as i say etr has got some of the best survey data in the business we track it every quarter and really excited to be partners with them this is dave vellante for the cube insights powered by etr thanks for watching and we'll see you next time on breaking analysis [Music] you

Published Date : Aug 14 2022

SUMMARY :

and and the retry you know mechanism is

ENTITIES

Entity	Category	Confidence
netflix	ORGANIZATION	0.99+
john furrier	PERSON	0.99+
palo alto	ORGANIZATION	0.99+
tony bear	PERSON	0.99+
boston	LOCATION	0.99+
sanji mohan	PERSON	0.99+
ken shiffman	PERSON	0.99+
both	QUANTITY	0.99+
today	DATE	0.99+
ellie gotzi	PERSON	0.99+
VMware	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
siliconangle.com	OTHER	0.99+
more than four petabytes	QUANTITY	0.99+
first point	QUANTITY	0.99+
kristin martin	PERSON	0.99+
both companies	QUANTITY	0.99+
first question	QUANTITY	0.99+
rob hoff	PERSON	0.99+
more than one	QUANTITY	0.99+
second model	QUANTITY	0.98+
alex myerson	PERSON	0.98+
third model	QUANTITY	0.98+
one region	QUANTITY	0.98+
one copy	QUANTITY	0.98+
one region	QUANTITY	0.98+
five essential elements	QUANTITY	0.98+
android	TITLE	0.98+
100	QUANTITY	0.98+
first line	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
sheryl	PERSON	0.98+
more than one cloud	QUANTITY	0.98+
first	QUANTITY	0.98+
iphone	COMMERCIAL_ITEM	0.98+
super cloud 22	EVENT	0.98+
each cloud	QUANTITY	0.98+
each	QUANTITY	0.97+
sanjay mohan	PERSON	0.97+
john	PERSON	0.97+
republicans	ORGANIZATION	0.97+
this week	DATE	0.97+
hundreds of years	QUANTITY	0.97+
siliconangle	ORGANIZATION	0.97+
each week	QUANTITY	0.97+
data lake house	ORGANIZATION	0.97+
one single region	QUANTITY	0.97+
january	DATE	0.97+
dave vellante	PERSON	0.96+
each region	QUANTITY	0.96+
one	QUANTITY	0.96+
dave vellante	PERSON	0.96+
tony	PERSON	0.96+
above 80 percent	QUANTITY	0.95+
more than one cloud	QUANTITY	0.95+
more than one cloud	QUANTITY	0.95+
data lake	ORGANIZATION	0.95+
five essential properties	QUANTITY	0.95+
democrats	ORGANIZATION	0.95+
first time	QUANTITY	0.95+
july	DATE	0.94+
linux	TITLE	0.94+
etr	ORGANIZATION	0.94+
devellante	ORGANIZATION	0.93+
dodgeville	ORGANIZATION	0.93+
each vendor	QUANTITY	0.93+
super cloud 22	ORGANIZATION	0.93+
delta lake	ORGANIZATION	0.92+
three deployment models	QUANTITY	0.92+
first lines	QUANTITY	0.92+
dejaville	LOCATION	0.92+
day one	QUANTITY	0.92+

Ali Ghodsi, Databricks | Supercloud22

(light hearted music) >> Okay, welcome back to Supercloud '22. I'm John Furrier, host of theCUBE. We got Ali Ghodsi here, co-founder and CEO of Databricks. Ali, Great to see you. Thanks for spending your valuable time to come on and talk about Supercloud and the future of all the structural change that's happening in cloud computing. >> My pleasure, thanks for having me. >> Well, first of all, congratulations. We've been talking for many, many years, and I still go back to the video that we have in archive, you talking about cloud. And really, at the beginning of the big reboot, I called the post Hadoop, a revitalization of data. Congratulations, you've been cloud-first, now on multiple clouds. Congratulations to you and your team for achieving what looks like a billion dollars in annualized revenue as reported by the Wall Street Journal, so first, congratulations. >> Thank you so much, appreciate it. >> So I was talking to some young developers and I asked a random poll, what do you think about Databricks? Oh, we love those guys, they're AI and ML-native, and that's their advantage over the competition. So I pressed why. I don't think they knew why, but that's an interesting perspective. This idea of cloud native, AI/ML-native, ML Ops, this has been a big trend and it's continuing. This is a big part of how this change and this structural change is happening. How do you react to that? And how do you see Databricks evolving into this new Supercloud-like multi-cloud environment? >> Yeah, look, I think it's a continuum. It starts with having data, but they want to clean it, you know, and they want to get insights out of it. But then, eventually, you'd like to start asking questions, doing reports, maybe ask questions about what was my revenue yesterday, last week, but soon you want to start using the crystal ball, predictive technology. Okay, but what will my revenue be next week? Next quarter? Who's going to churn? And if you can finally automate that completely so that you can act on the predictions, right? So this credit card that got swiped, the AI thinks it's fraud, we're going to deny it. That's when you get real value. So we're trying to help all these organizations move through this data AI maturity curve, all the way to that, the prescriptive, automated AI machine learning. That's when you get real competitive advantage. And you know, we saw that with the fans, right? I mean, Google wouldn't be here today if it wasn't for AI. You know, we'd be using AltaVista or something. We want to help all organizations to be able to leverage data and AI that way that the fans did. >> One of the things we're looking at with supercloud and why we call it supercloud versus other things like multi-cloud is that today a lot of the successful companies have started in the cloud have been successful, but have realized and even enterprises who have gotten by accident, and maybe have done nothing with cloud have just some cloud projects on multiple clouds. So, people have multiple cloud operational things going on but it hasn't necessarily been a strategy per se. It's been more of kind of a default reaction to things but the ones that are innovating have been successful in one native cloud because the use cases that drove that got scale got value, and then they're making that super by bringing it on premise, putting in a modern data stack, for the modern application development, and kind of dealing with the things that you guys are in the middle of with data bricks is that, that is where the action is, and they don't want to go, lose the trajectory in all the economies of scale. So we're seeing another structural change where the evolutionary nature of the cloud has solved a bunch of use cases, but now other use cases are emerging that's on premises and edge that have been driven by applications because of the developer boom, that's happening. You guys are in the middle of it. What is happening with this structural change? Are people looking for the modern data stack? Are they looking for more AI? What's the, what's your perspective on this supercloud kind of position? >> Look, it started with not AR on multiple clouds, right? So multi-cloud has been a thing. It became a thing 70, 80% of our customers when you ask them, they're more than one cloud. But then soon to start realizing that, hey, you know, if I'm on multiple clouds, this data stuff is hard enough as it is. Do I want to redo it again and again with different proprietary technologies, on each of the clouds. And that's when I started thinking about let's standardize this, let's figure out a way which just works across them. That's where I think open source comes in, becomes really important. Hey, can we leverage open standards because then we can make it work in these different environments, as we said so that we can actually go super, as you said, that's one. The second thing is, can we simplify it? You know, and I think today, the data landscape is complicated. Conceptually it's simple. You have data which is essentially customer data that you have, maybe employee data. And you want to get some kind of insights from that. But how you do that is very complicated. You have to buy data warehouse, hire data analysts. You have to buy, store stuff in the Delta Lake you know, get your data engineers. If you want streaming real time thing that's another complete different set of technologies you have to buy. And then you have to stitch all these together, and you have to do again and again on every cloud. So they just want simplification. So that's why we're big believers in this Delta Lakehouse concept. Which is an open standard to simplifying this data stack and help people to just get value out of their data in any environment. So they can do that in this sort of supercloud as you call it. >> You know, we've been talking about that in previous interviews, do the heavy lifting let them get the value. I have to ask you about how you see that going forward, Because if I'm a customer, I have a lot of operational challenges. Cause the developers are are kicking butt right now. We see that clearly. Open sources growing at, and continue to be great. But ops and security teams they really care about this stuff. And most companies don't want to spin up multiple ops teams to deal with different stacks. This is one big problem that I think that's leading into the multi-cloud viability. How do you guys deal with that? How do you talk to customers when they say, I want to have less complications on operations? >> Yeah, you're absolutely right. You know, it's easy for a developer to adopt all these technologies and new things are coming out all the time. The ops teams are the ones that have to make sure this works. Doing that in multiple different environments is super hard. especially when there's a proprietary stack in each environment that's different. So they just want standardization. They want open source, that's super important. We hear that all the time from them. They want open the source technologies. They believe in the communities around it. You know, they know that source code is open. So you can also see if there's issues with it. If there's security breaches, those kind of things that they can have a community around it. So they can actually leverage that. So they're the ones that are really pushing this, and we're seeing it across the board. You know, it starts first with the digital natives you know, the companies that are, but slowly it's also now percolating to the other organizations, we're hearing across the board. >> Where are we, Ali on the innovation strategies for customers? Where are they on the trajectory around how they're building out their teams? How are they looking at the open source? How are they extending the value proposition of Databricks, and data at scale, as they start to build out their teams and operations, because some are like kind of starting, crawl, walk, run, kind of vibe. Some are big companies, they're dealing with data all the time. Where are they in their journey? What's the core issues that they're solving? What are some of the use cases that you see that are most pressing in customer? >> Yeah, what I've seen, that's really exciting about this Delta Lakehouse concept is that we're now seeing a lot of use cases around real time. So real time fraud detection, real time stock ticker pricing, anyone that's doing trading, they want that to work real time. Lots of use cases around that. Lots of use cases around how do we in real time drive more engagement on our web assets if we're a media company, right? We have all these assets how do we get people to get engaged? Stay on our sites. Continue engaging with the material we have. Those are real time use cases. And the interesting thing is, they're real time. So, you know, it's really important that you that now you don't want to recommend someone, hey, you should go check out this restaurant if they just came from that restaurant, half an hour ago. So you want it to be real time, but B, that it's also all based on machine learning. These are a lot of this is trying to predict what you want to see, what you want to do, is it fraudulent? And that's also interesting because basically more and more machine learning is coming in. So that's super exciting to see, the combination of real time and machine learning on the Lakehouse. And finally, I would say the Lakehouse is really important for this because that's where the data is flowing in. If they have to take that data that's flowing into the lake and actually copy it into a separate warehouse, that delays the real time use cases. And then it can't hit those real time deadlines. So that's another catalyst for this Lakehouse pattern. >> Would that be an example of how the metrics are changing? Cause I've been looking at some people saying, well you can tell if someone's doing well there's a lot of data being transferred. And then I was saying, well, wait a minute. Data transfer costs money, right? And time. So this is interesting dynamic, in a way you don't want to have a lot of movement, right? >> Yeah, movement actually decreases for a lot of these real time use cases. 'Cause what we saw in the past was that they would run a batch processing to process all the data. So once they process all the data. But actually if you look at the things that have changed since the data that we have yesterday it's actually not that much. So if you can actually incrementally process it in real time, you can actually reduce the cost of transfers and storage and processing. So that's actually a great point. That's also one of the main things that we're seeing with the use cases, the bill shrinks and the cost goes down, and they can process less. >> Yeah, and it'd be interesting to see how those KPIs evolve into industry metrics down the road around the supercloud of evolution. I got to ask you about the open source concept of data platforms. You guys have been a pioneer in there doing great work, kind of picking the baton off where the Hadoop World left off as Dave Vellante always points out. But if working across clouds is super important. How are you guys looking at the ability to work across the different clouds with data bricks? Are you going to build that abstraction yourself? Does data sharing and model sharing kind of come into play there? How do you see this data bricks capability across the clouds? >> Yeah, I mean, let me start by saying, we just we're big fans of open source. We think that open source is a force in software. That's going to continue for, decades, hundreds of years, and it's going to slowly replace all proprietary code in its way. We saw that, it could do that with the most advanced technology. Windows, you know proprietary operating system, very complicated, got replaced with Linux. So open source can pretty much do anything. And what we're seeing with the Delta Lakehouse is that slowly the open source community is building a replacement for the proprietary data warehouse, Delta Lake, machine learning, real time stack in open source. And we're excited to be part of it. For us, Delta Lake is a very important project that really helps you standardize how you layout your data in the cloud. And when it comes a really important protocol called data sharing, that enables you in a open way actually for the first time ever share large data sets between organizations, but it uses an open protocol. So the great thing about that is you don't need to be a Databricks customer. You don't need to even like Databricks, you just need to use this open source project and you can now securely share data sets between organizations across clouds. And it actually does so really efficiently just one copy of the data. So you don't have to copy it if you're within the same cloud. >> So you're playing the long game on open source. >> Absolutely. I mean, this is a force it's going to be there if if you deny it, before you know it there's going to be, something like Linux, that is going to be a threat to your propriety. >> I totally agree by the way. I was just talking to somebody the other day and they're like hey, the software industry someone made the comment, the software industry, the software industry is open source. There's no more software industry, it's called open source. It's integrations that become interesting. And I was looking at integrations now is really where the action is. And we had a panel with the Clouderati we called it, the people have been around for a long time. And it was called the innovator's dilemma. And one of the comments was it's the integrator's dilemma, not the innovator's dilemma. And this is a big part of this piece of supercloud. Can you share your thoughts on how cloud and integration need to be tightened up to really make it super? >> Actually that's a great point. I think the beauty of this is, look the ecosystem of data today is vast, there's this picture that someone puts together every year of all the different vendors and how they relate, and it gets bigger and bigger and messy and messier. So, we see customers use all kinds of different aspects of what's existing in the ecosystem and they want it to be integrated in whatever you're selling them. And that's where I think the power of open source comes in. Open source, you get integrations that people will do without you having to push it. So us, Databricks as a vendor, we don't have to go tell people please integrate with Databricks. The open source technology that we contribute to, automatically, people are integrating with it. Delta Lake has integrations with lots of different software out there and Databricks as a company doesn't have to push that. So I think open source is also another thing that really helps with the ecosystem integrations. Many of these companies in this data space actually have employees that are full-time dedicated to make sure make sure our software works well with Spark. Make sure our software works well with Delta and they contribute back to that community. And that's the way you get this sort of ecosystem to further sort of flourish. >> Well, I really appreciate your time. And I, my final question for you is, as we're kind of unpack and and kind of shape and frame supercloud for the future, how would you see a roadmap or architecture or outcome for companies that are going to clearly be in the cloud where it's open source is going to be dominating. Integrations has got to be seamless and frictionless. Abstraction layer make things super easy and take away the complexity. What is supercloud to them? What does the outcome look like? How would you define a supercloud environment for an enterprise? >> Yeah, for me, it's the simplification that you get where you standardize an open source. You get your data in one place, in one format in one standardized way, and then you can get your insights from it, without having to buy lots of different idiosyncratic proprietary software from different vendors. That's different in each environment. So it's this slow standardization that's happening. And I think it's going to happen faster than we think. And I think in a couple years it's going to be a requirement that, does your software work on all these different departments? Is it based on open source? Is it using this Delta Lake house pattern? And if it's not, I think they're going to demand it. >> Yeah, I feel like we're close to some sort of defacto standard coming and you guys are a big part of it, once that clicks in, it's going to highly accelerate in the open, and I think it's going to be super valuable. Ali, thank you so much for your time, and congratulations to you and your team. Like we've been following you guys since the beginning. Remember the early days and look how far it's come. And again, you guys are really making a big difference in making a super cool environment out there. Thanks for coming on sharing. >> Thank you so much John. >> Okay, this is supercloud 22. I'm John Furrier stay with more for more coverage and more commentary after this break. (light hearted music)

Published Date : Aug 7 2022

SUMMARY :

and the future of all Congratulations to you and your team And how do you see Databricks evolving And if you can finally One of the things we're And then you have to I have to ask you about how We hear that all the time from them. What are some of the use cases that delays the real time use cases. in a way you don't want to So if you can actually incrementally I got to ask you about So you don't have to copy it So you're playing the that is going to be a And one of the comments was And that's the way you and take away the complexity. simplification that you get and congratulations to you and your team. Okay, this is supercloud 22.

ENTITIES

Entity	Category	Confidence
Ali Ghodsi	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Google	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
John	PERSON	0.99+
last week	DATE	0.99+
next week	DATE	0.99+
Ali	PERSON	0.99+
Next quarter	DATE	0.99+
yesterday	DATE	0.99+
John Furrier	PERSON	0.99+
Delta	ORGANIZATION	0.99+
one format	QUANTITY	0.99+
first	QUANTITY	0.99+
today	DATE	0.98+
second thing	QUANTITY	0.98+
one	QUANTITY	0.98+
Linux	TITLE	0.98+
one copy	QUANTITY	0.98+
Delta Lakehouse	ORGANIZATION	0.98+
supercloud 22	ORGANIZATION	0.98+
more than one cloud	QUANTITY	0.98+
each environment	QUANTITY	0.98+
Clouderati	ORGANIZATION	0.98+
Supercloud22	ORGANIZATION	0.98+
hundreds of years	QUANTITY	0.97+
Delta Lake	LOCATION	0.97+
one big problem	QUANTITY	0.97+
70, 80%	QUANTITY	0.97+
Windows	TITLE	0.96+
one place	QUANTITY	0.96+
first time	QUANTITY	0.96+
billion dollars	QUANTITY	0.95+
decades	QUANTITY	0.95+
Delta Lake	ORGANIZATION	0.95+
One	QUANTITY	0.94+
supercloud	ORGANIZATION	0.94+
Supercloud	ORGANIZATION	0.94+
half an hour ago	DATE	0.93+
Delta Lake	TITLE	0.92+
Lakehouse	ORGANIZATION	0.92+
Spark	TITLE	0.91+
each	QUANTITY	0.91+
a minute	QUANTITY	0.85+
one of	QUANTITY	0.73+
one native	QUANTITY	0.72+
supercloud	TITLE	0.7+
couple years	QUANTITY	0.66+
AltaVista	ORGANIZATION	0.65+
Wall Street Journal	ORGANIZATION	0.63+
theCUBE	ORGANIZATION	0.63+
Lakehouse	TITLE	0.51+
Lake	LOCATION	0.46+
Hadoop World	TITLE	0.41+
'22	EVENT	0.24+

Greg Rokita, Edmunds.com & Joel Minnick, Databricks | AWS re:Invent 2021

>>We'll come back to the cubes coverage of AWS reinvent 2021, the industry's most important hybrid event. Very few hybrid events, of course, in the last two years. And the cube is excited to be here. Uh, this is our ninth year covering AWS reinvent this the 10th reinvent we're here with Joel Minnick, who the vice president of product and partner marketing at smoking hot company, Databricks and Greg Rokita, who is executive director of technology at Edmonds. If you're buying a car or leasing a car, you gotta go to Edmund's. We're gonna talk about busting data silos, guys. Great to see you again. >>Welcome. Welcome. Glad to be here. >>All right. So Joel, what the heck is a lake house? This is all over the place. Everybody's talking about lake house. What is it? >>And it did well in a nutshell, a Lakehouse is the ability to have one unified platform to handle all of your traditional analytics workloads. So your BI and reporting Trisha, the lake, the workloads that you would have for your data warehouse on the same platform as the workloads that you would have for data science and machine learning. And so if you think about kind of the way that, uh, most organizations have built their infrastructure in the cloud today, what we have is generally customers will land all their data in a data lake and a data lake is fantastic because it's low cost, it's open. It's able to handle lots of different kinds of data. Um, but the challenges that data lakes have is that they don't necessarily scale very well. It's very hard to govern data in a data lake house. It's very hard to manage that data in a data lake, sorry, in a, in a data lake. >>And so what happens is that customers then move the data out of a data lake into downstream systems and what they tend to move it into our data warehouses to handle those traditional reporting kinds of workloads that they have. And they do that because data warehouses are really great at being able to have really great scale, have really great performance. The challenge though, is that data warehouses really only work for structured data. And regardless of what kind of data warehouse you adopt, all data warehouse and platforms today are built on some kind of proprietary format. So once you've put that data into the data warehouse, that's, that is kind of what you're locked into. The promise of the data lake house was to say, look, what if we could strip away all of that complexity and having to move data back and forth between all these different systems and keep the data exactly where it is today and where it is today is in the data lake. >>And then being able to apply a transaction layer on top of that. And the Databricks case, we do that through a technology and open source technology called data lake, or sorry, Delta lake. And what Delta lake allows us to do is when you need it, apply that performance, that reliability, that quality, that scale that you would expect out of a data warehouse directly on your data lake. And if I can do that, then what I'm able to do now is operate from one single source of truth that handles all of my analytics workloads, both my traditional analytics workloads and my data science and machine learning workloads, and being able to have all of those workloads on one common platform. It means that now not only do I get much, much more simple in the way that my infrastructure works and therefore able to operate at much lower costs, able to get things to production much, much faster. >>Um, but I'm also able to now to leverage open source in a much bigger way being that lake house is inherently built on an open platform. Okay. So I'm no longer locked into any kind of data format. And finally, probably one of the most, uh, lasting benefits of a lake house is that all the roles that have to take that have to touch my data for my data engineers, to my data analyst, my data scientists, they're all working on the same data, which means that collaboration that has to happen to go answer really hard problems with data. I'm now able to do much, much more easy because those silos that traditionally exist inside of my environment no longer have to be there. And so Lakehouse is that is the promise to have one single source of truth, one unified platform for all of my data. Okay, >>Great. Thank you for that very cogent description of what a lake house is now. Let's I want to hear from the customer to see, okay, this is what he just said. True. So actually, let me ask you this, Greg, because the other problem that you, you didn't mention about the data lake is that with no schema on, right, it gets messy and Databricks, I think, correct me if I'm wrong, has begun to solve that problem, right? Through series of tooling and AI. That's what Delta liked us. It's a man, like it's a managed service. Everybody thought you were going to be like the cloud era of spark and Brittany Britain, a brilliant move to create a managed service. And it's worked great. Now everybody has a managed service, but so can you paint a picture at Edmonds as to what you're doing with, maybe take us through your journey the early days of a dupe, a data lake. Oh, that sounds good. Throw it in there, paint a picture as to how you guys are using data and then tie it into what y'all just said. >>As Joel said, that they'll the, it simplifies the architecture quite a bit. Um, in a modern enterprise, you have to deal with a variety of different data sources, structured semi-structured and unstructured in the form of images and videos. And with Delta lake and built a lake, you can have one system that handles all those data sources. So what that does is that basically removes the issue of multiple systems that you have to administer. It lowers the cost, and it provides consistency. If you have multiple systems that deal with data, you always arise as the issue as to which data has to be loaded into which system. And then you have issues with consistency. Once you have issues with consistency, business users, as analysts will stop trusting your data. So that was very critical for us to unify the system of data handling in the one place. >>Additionally, you have a massive scalability. So, um, I went to the talk with from apple saying that, you know, he can process two years worth of data. Instead of just two days in an Edmonds, we have this use case of backfilling the data. So often we changed the logic and went to new. We need to reprocess massive amounts of data with the lake house. We can reprocess months worth of data in, in a matter of minutes or hours. And additionally at the data lake houses based on open, uh, open standards, like parquet that allowed us, allowed us to basically hope open source and third-party tools on top of the Delta lake house. Um, for example, a Mattson, we use a Matson for data discovery, and finally, uh, the lake house approach allows us for different skillsets of people to work on the same source data. We have analysts, we have, uh, data engineers, we have statisticians and data scientists using their own programming languages, but working on the same core of data sets without worrying about duplicating data and consistency issues between the teams. >>So what, what is, what are the primary use cases where you're using house Lakehouse Delta? >>So, um, we work, uh, we have several use cases, one of them more interesting and important use cases as vehicle pricing, you have used the Edmonds. So, you know, you go to our website and you use it to research vehicles, but it turns out that pricing and knowing whether you're getting a good or bad deal is critical for our, uh, for our business. So with the lake house, we were able to develop a data pipeline that ingests the transactions, curates the transactions, cleans them, and then feeds that curated a curated feed into the machine learning model that is also deployed on the lake house. So you have one system that handles this huge complexity. And, um, as you know, it's very hard to find unicorns that know all those technologies, but because we have flexibility of using Scala, Java, uh, Python and SQL, we have different people working on different parts of that pipeline on the same system and on the same data. So, um, having Lakehouse really enabled us to be very agile and allowed us to deploy new sources easily when we, when they arrived and fine tune the model to decrease the error rates for the price prediction. So that process is ongoing and it's, it's a very agile process that kind of takes advantage of the, of the different skill sets of different people on one system. >>Because you know, you guys democratized by car buying, well, at least the data around car buying because as a consumer now, you know, I know what they're paying and I can go in of course, but they changed their algorithms as well. I mean, the, the dealers got really smart and then they got kickbacks from the manufacturer. So you had to get smarter. So it's, it's, it's a moving target, I guess. >>Great. The pricing is actually very complex. Like I, I don't have time to explain it to you, but knowing, especially in this crazy market inflationary market where used car prices are like 38% higher year over year, and new car prices are like 10% higher and they're changing rapidly. So having very responsive pricing model is, is extremely critical. Uh, you, I don't know if you're familiar with Zillow. I mean, they almost went out of business because they mispriced their, uh, their houses. So, so if you own their stock, you probably under shorthand of it, but, you know, >>No, but it's true because I, my lease came up in the middle of the pandemic and I went to Edmonds, say, what's this car worth? It was worth like $7,000. More than that. Then the buyout costs the residual value. I said, I'm taking it, can't pass up that deal. And so you have to be flexible. You're saying the premises though, that open source technology and Delta lake and lake house enabled that flexible. >>Yes, we are able to ingest new transactions daily recalculate our model within less than an hour and deploy the new model with new pricing, you know, almost real time. So, uh, in this environment, it's very critical that you kind of keep up to date and ingest their latest transactions as they prices change and recalculate your model that predicts the future prices. >>Because the business lines inside of Edmond interact with the data teams, you mentioned data engineers, data scientists, analysts, how do the business people get access to their data? >>Originally, we only had a core team that was using Lakehouse, but because the usage was so powerful and easy, we were able to democratize it across our units. So other teams within software engineering picked it up and then analysts picked it up. And then even business users started using the dashboarding and seeing, you know, how the price has changed over time and seeing other, other metrics within the, >>What did that do for data quality? Because I feel like if I'm a business person, I might have context of the data that an analyst might not have. If they're part of a team that's servicing all these lines of business, did you find that the data quality, the collaboration affected data? >>Th the biggest thing for us was the fact that we don't have multiple systems now. So you don't have to load the data. Whenever you have to load the data from one system to another, there is always a lag. There's always a delay. There is always a problematic job that didn't do the copy correctly. And the quality is uncertain. You don't know which system tells you the truth. Now we just have one layer of data. Whether you do reports, whether you're data processing or whether you do modeling, they all read the same data. And the second thing is that with the dashboarding capabilities, people that were not very technical that before we could only use Tableau and Tableau is not the easiest thing to use as if you're not technical. Now they can use it. So anyone can see how our pricing data looks, whether you're an executive, whether you're an analyst or a casual business users, >>But Hey, so many questions, you guys are gonna have to combat. I'm gonna run out of time, but you now allow a consumer to buy a car directly. Yes. Right? So that's a new service that you launched. I presume that required new data. We give, we >>Give consumers offers. Yes. And, and that offer you >>Offered to buy my league. >>Exactly. And that offer leverages the pricing that we develop on top of the lake house. So the most important thing is accurately giving you a very good offer price, right? So if we give you a price, that's not so good. You're going to go somewhere else. If we give you price, that's too high, we're going to go bankrupt like Zillow debt, right. >>It took to enable that you're working off the same dataset. Yes. You're going to have to spin up a, did you have to inject new data? Was there a new data source that we're working on? >>Once we curate the data sources and once we clean it, we see the directly to the model. And all of those components are running on the lake house, whether you're curating the data, cleaning it or running the model. The nice thing about lake house is that machine learning is the first class citizen. If you use something like snowflake, I'm not going to slam snowflake here, but you >>Have two different use case. You have >>To, you have to load it into a different system later. You have to load it into a different system. So like good luck doing machine learning on snowflake. Right. >>Whereas, whereas Databricks, that's kind of your raison d'etre >>So what are your, your, your data engineer? I feel like I should be a salesman or something. Yeah. I'm not, I'm not saying that. Just, just because, you know, I was told to, like, I'm saying it because of that's our use case, >>Your use case. So question for each of you, what, what business results did you see when you went to kind of pre lake house, post lake house? What are the, any metrics you can share? And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, what can you tell us? Well, >>Uh, before their lake house, we had two different systems. We had one for processing, which was still data breaks. And the second one for serving and we iterated over Nateeza or Redshift, but we figured that maintaining two different system and loading data from one to the other was a huge overhead administration security costs. Um, the fact that you had to consistency issues. So the fact that you can have one system, um, with, uh, centralized data, solves all those issues. You have to have one security mechanism, one administrative mechanism, and you don't have to load the data from one system to the other. You don't have to make compromises. >>It's scale is not a problem because of the cloud, >>Because you can spend clusters at will for different use cases. So your clusters are independent. You have processing clusters that are not affecting your serving clusters. So, um, in the past, if you were running a serving, say on Nateeza or Redshift, if you were doing heavy processing, your reports would be affected, but now all those clusters are separated. So >>Consumer data consumer can take that data from the producer independ >>Using its own cluster. Okay. >>Yeah. I'll give you the final word, Joel. I know it's been, I said, you guys got to come back. This is what have you seen broadly? >>Yeah. Well, I mean, I think Greg's point about scale. It's an interesting one. So if you look at cross the entire Databricks platform, the platform is launching 9 million VMs every day. Um, and we're in total processing over nine exabytes a month. So in terms of just how much data the platform is able to flow through it, uh, and still maintain a extremely high performance is, is bar none out there. And then in terms of, if you look at just kind of the macro environment of what's happening out there, you know, I think what's been most exciting to watch or what customers are experiencing traditionally or, uh, on the traditional data warehouse and kinds of workloads, because I think that's where the promise of lake house really comes into its own is saying, yes, I can run these traditional data warehousing workloads that require a high concurrency high scale, high performance directly on my data lake. >>And, uh, I think probably the two most salient data points to raise up there is, uh, just last month, Databricks announced it's set the world record for the, for the, uh, TPC D S 100 terabyte benchmark. So that is a place where Databricks at the lake house architecture, that benchmark is built to measure data warehouse performance and the lake house beat data warehouse and sat their own game in terms of overall performance. And then what's that spends from a price performance standpoint, it's customers on Databricks right now are able to enjoy that level of performance at 12 X better price performance than what cloud data warehouses provide. So not only are we jumping on this extremely high scale and performance, but we're able to do it much, much more efficiently. >>We're gonna need a whole nother section second segment to talk about benchmarking that guys. Thanks so much, really interesting session and thank you and best of luck to both join the show. Thank you for having us. Very welcome. Okay. Keep it right there. Everybody you're watching the cube, the leader in high-tech coverage at AWS reinvent 2021

Published Date : Nov 30 2021

SUMMARY :

Great to see you again. Glad to be here. This is all over the place. and reporting Trisha, the lake, the workloads that you would have for your data warehouse on And regardless of what kind of data warehouse you adopt, And what Delta lake allows us to do is when you need it, that all the roles that have to take that have to touch my data for as to how you guys are using data and then tie it into what y'all just said. And with Delta lake and built a lake, you can have one system that handles all Additionally, you have a massive scalability. So you have one system that So you had to get smarter. So, so if you own their stock, And so you have to be flexible. less than an hour and deploy the new model with new pricing, you know, you know, how the price has changed over time and seeing other, other metrics within the, lines of business, did you find that the data quality, the collaboration affected data? So you don't have to load But Hey, so many questions, you guys are gonna have to combat. So the most important thing is accurately giving you a very good offer did you have to inject new data? I'm not going to slam snowflake here, but you You have To, you have to load it into a different system later. Just, just because, you know, I was told to, And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, So the fact that you can have one system, So, um, in the past, if you were running a serving, Okay. This is what have you seen broadly? So if you look at cross the entire So not only are we jumping on this extremely high scale and performance, but we're able to do it much, Thanks so much, really interesting session and thank you and best of luck to both join the show.

ENTITIES

Entity	Category	Confidence
Joel	PERSON	0.99+
Greg	PERSON	0.99+
Joel Minnick	PERSON	0.99+
$7,000	QUANTITY	0.99+
Greg Rokita	PERSON	0.99+
38%	QUANTITY	0.99+
two days	QUANTITY	0.99+
10%	QUANTITY	0.99+
Java	TITLE	0.99+
Databricks	ORGANIZATION	0.99+
two years	QUANTITY	0.99+
one system	QUANTITY	0.99+
one	QUANTITY	0.99+
Scala	TITLE	0.99+
apple	ORGANIZATION	0.99+
Python	TITLE	0.99+
SQL	TITLE	0.99+
ninth year	QUANTITY	0.99+
last month	DATE	0.99+
lake house	ORGANIZATION	0.99+
two different systems	QUANTITY	0.99+
Tableau	TITLE	0.99+
2021	DATE	0.99+
9 million VMs	QUANTITY	0.99+
second thing	QUANTITY	0.99+
less than an hour	QUANTITY	0.99+
Lakehouse	ORGANIZATION	0.98+
12 X	QUANTITY	0.98+
Delta	ORGANIZATION	0.98+
Delta lake house	ORGANIZATION	0.98+
one layer	QUANTITY	0.98+
one common platform	QUANTITY	0.98+
both	QUANTITY	0.97+
AWS	ORGANIZATION	0.97+
Zillow	ORGANIZATION	0.97+
Brittany Britain	PERSON	0.97+
Edmunds.com	ORGANIZATION	0.97+
two different system	QUANTITY	0.97+
Edmonds	ORGANIZATION	0.97+
over nine exabytes a month	QUANTITY	0.97+
today	DATE	0.96+
Lakehouse Delta	ORGANIZATION	0.96+
Delta lake	ORGANIZATION	0.95+
Trisha	PERSON	0.95+
data lake	ORGANIZATION	0.94+
Mattson	ORGANIZATION	0.92+
second segment	QUANTITY	0.92+
each	QUANTITY	0.92+
Matson	ORGANIZATION	0.91+
two most salient data points	QUANTITY	0.9+
Edmonds	LOCATION	0.89+
100 terabyte	QUANTITY	0.87+
one single source	QUANTITY	0.86+
first class	QUANTITY	0.85+
Nateeza	TITLE	0.85+
one security	QUANTITY	0.85+
Redshift	TITLE	0.84+

Ali Ghodsi, Databricks | Informatica World 2019

>> Live from Las Vegas, it's theCUBE, covering Informatica World 2019. Brought to you by Informatica. >> Welcome back everyone to theCUBE's live coverage of Informatica World 2019. I'm your host Rebecca Knight, along with my co-host John Furrier. We're joined by Ali Ghodsi, he is the CEO of Databricks, thank you so much for coming on, for returning to theCUBE. You're a CUBE veteran. >> Yes, thank you for having me. >> So I want to pick up on something that you said up on the main stage, and that is that every enterprise on the planet wants to add AI capabilities, but the hardest part of AI is not AI, it's the data. >> Yeah. >> Can you riff on that a little bit for our viewers? Elaborate? >> Yeah, actually, the interesting part is that, if you look at the company that succeeded with AI, the actual AI algorithms they're using, are actually algorithms from the 70s, you know, they're actually developed in the 70s, that's 50 years ago. So then how come they're succeeding now? When actually the same algorithms weren't working in the 70s, so people gave up on them. Like, these things called neural nets, right? Now they're en vogue and they're, you know, super successful. The reason is you have to apply orders of magnitude more data. If you feed those algorithms that we thought were broken orders of magnitude more data, you actually get great results, but that's actually hard. You know, dealing with petabyte scale data and cleaning it, making sure that it's actually the right data for the task at hand is not easy. So that's the part that people are struggling with. >> I saw you up on stage, I'm like ah, Ali's here, Databricks is here, that's awesome. Psyched that you stopped by theCUBE. Been a while. I wanted to get a quick update, 'cause you guys have been on a tear, doing some great work at Cal, we were just told before we came on camera. But what are you doing here? What's the, is there any announcements or news with Informatica? What's the story? >> Yeah, it's, we're doing partnership around Delta Lake, which is our next generation engine that we built, so we're super excited about that. It integrates with all of the Informatica platform. So their ingestion tools, their transformation tools, and the catalog that they also have. So we think together, this can actually really help enterprises make that transition into the AI era. >> So you know, we've been followers, our 10th year, so remember when we were in the cloud era office of Mike Olsen and Amr Awadallah when we first started and now, Hadoop movement started, and then the cloud came along. Right when you guys started your company, the cloud growth took off. You guys were instrumental in changing the equation in dealing with data, data lakes, whatever they're calling it back then. So now, data, holistically, is a systems architecture. On premise it's a huge challenge, cloud native, well no real challenge, people love that. Data feeds AI, lot of risk taking, lot of reward. We're seeing the SaaS business explode, Zoom communications. The list goes on and on. Do you know, enterprise that's trying to be SAS is hard. So you can't just take data from an enterprise and make it SaaS-ified. You really got to think differently. What are you guys doing? How have you guys evolved and vectored into that challenge, because this is where your core value proposition initially started change. Take us through that Databricks story and how you're solving that problem today. >> Yeah, it's a great question. Really what happened is that people started collecting a lot of our data about a decade ago. And the promise was, you can do great things with this. There are all these aspirational use cases around machine learning, real time, it's going to be amazing. Right? So people started collecting it. They started storing one petabytes, two petabytes, and they kept going back to their boss and saying this project is real successful I now have five petabytes in it. But at some point the business said, okay that's great but what can you do with it? What business problems are you actually addressing? What are you solving? And so, in the last couple years there's been a push towards let's prove the value of these data lakes. And actually, many of these projects are falling short. Many are failing. And the reason is, people have just been dumping this data into data lakes without thinking about, the structure, the quality, how it's going to be used. The use cases have been an afterthought. So the number one thing in the top of mind for everyone right now is how do we make these data lakes that we have successful so we can prove some business value to our management? Towards this, this is the main problem that we're focusing on. Towards this, we built something called Delta Lake. It's something you situate on top of your data lake. And what it does is it increases the quality, the reliability, the performance, and the scale of your data lake. >> (John) So it's like a filter. >> Yeah. >> The cream rises to the top. >> (Ari) Exactly. >> Let's the sludge, the data swamp stay below the clean water, if you will. >> Exactly actually you nailed it. So basically, we look at the data as it comes in, filter as you said, and then look at, if there's any quality issues we then put it back in the data lake. It's fine, it can stay there. We'll figure out how to get value out of it later. But if it makes it into the Delta Lake, it will have high quality. Right? So that's great. And since we're anyway already looking at all the data as it's coming in, we might as well also store a lot of inducees and a lot of things that let us performance optimize it later on. So that, later, when people are actually trying to use that data they get really high performance, they get really good quality. And we also added asset transactions to it so that now you're also getting all those transactional use cases working on your existing data lake. >> I saw, at my daughter's graduation in Cal Berkley this weekend and yesterday, people around with Databricks backpacks. Very popular in academic. You guys got the young generation coming in. What's the update on the company? How many employees? What's the traction? Give us a quick business update. >> Yeah we're about 800 employees now. About 100 people in Europe, I would say, and maybe 40-50 people in Asiapac. We're expanding the ME and the Asia business. >> (John) Growth mode. >> Yeah, growth mode. So it's expanding as fast as possible. I mean, I actually, as a CEO, I try to always, slow the hiring down to make sure that we keep the quality bars. So that's actually top of mind for me. But yeah we're-- >> (John) You did Delta Lake on that one. >> Yeah (laughing) >> Exactly. Yeah and we're super excited about working with these universities. We get a lot of graduate students from top universities-- >> And Cal had the first ever class in college of data analytics, what was that? Data analytics are the first inagaural class graduated. Shows how early it is. >> Yeah, yeah, yeah. And actually used Databricks, the community edition, for a class of over a thousand students at Cal used the platform. So they're going to be trained in data science as they come out. >> So I want to ask about that because as you said you're trying to slow down the hiring to make sure that you are maintaining a high bar for your new hires. But yet, I'm sure there's a huge demand because you are in growth mode. So what are you doing? You said you're working with universities to make sure that the next generation is trained up and is capable of performing at Databricks. So tell us more about those efforts. >> Yeah I mean, so, obviously university recruiting is big for us. Cal, I think Databricks has the longest line of all the companies that come there on the career fair day. So, we work very closely with these universities. I think, next generation, as they come out, this generation that's coming out today actually is data science trained. So it's a big difference. There is a huge skills gap out there. Every big enterprise you talk tells you my biggest problem is actually, I don't have skilled people. Can you help me hire people? I say, hey we're not in the recruiting business. But, the good news is, if you look at the universities, they're all training thousands and thousands of data scientists every year now. I can tell you just at Cal, because, I happpen to be on the faculty there, is, almost every applicant now, to grad school, wants to do something AI related. Which has actually led to, if you look at all the programs in universities today, people used to do networking, professors used to do networking, say we do intelligent networks. People who do databases say, we do intelligent databases. People who do systems research say, hey we do intelligent systems, right? So what that means is, in a couple years you'll have lots of students coming out and these companies, that are now struggling hiring, then will be able to hire this talent and will actually succeed better with these AI projects. >> As they say in Berkley, nothing like a good revolution once in a while. AI is kind of changing everyone over. I got to ask you for the young kids out there, and parents who have kids either in elementary school or high school, everyone is trying to figure out, and there's no yet clear playbook, we're starting to see first generation training, but is there a skill set, because there's a range in surface area, you got hardcore coding to ethics, and everything in between from visualization, multiple dimensions of opportunities. What skills do you that people could hone or tweak that may not be on a curriculum that they could get, or pieces of different curriculums in school that would be a good foundation for folks learning and wanting to jump in to data and data value, whether it's coding to ethics? >> Yeah, just looking at my own background and seeing how, what I got to learn in school, the thing that was lacking, compared to what's needed today, is statistics. Understanding of statistics, statistical knowledge, That I think, it's going to be pervasive. So I think, 10, 15 years from now, no matter which field you're in, actually whatever job you have, you have to have some basic level of statistical understanding 'cause the systems you're working with will be, they'll be spitting out statistics and numbers and you need to understand what is false positives, what is this, what is the sample, what is that? What do these things mean? So that's one thing that's definitely missing and actually it's coming, that's one. The second is computing will continue being important. So, in the intersection of those two is, I think a lot of those jobs. >> In all fields, we were talking about earlier, biology, everything's intersecting, biochemistry to whatever right? >> (Ali) Yeah. >> I got to ask you about, well I'm a little old school, I'm 53 years old but I remember when I broke into the business coding, I used to walk into departments, they were called DP, data processing. So we're getting into the data processing world now, you've got statistics, you've got pipeline, these are data concepts. So I got to ask you as companies that are in the enterprise may be slower to move to the cutting edge like you guys are, they got to figure out where to store the data. So can you share your opinion or view on how customers are thinking and how they maybe should be architecting data on premise, in the cloud. Certainly cloud's great, if you're getting cloud native for pure SAS, and born in the cloud like a start-up. But if you're a large enterprise, and you want to be SAS-like, to have all that benefit, take the risk with the reward of being agile, you got to have data because if you don't the data into the machine learning or AI, you're not going to have good AI. So you need to get that data feeding in fast. And if it's constrained with regulation compliance you're screwed. So what's your view on this? Where should it be stored? What's your opinion? >> Yeah, we've had the same opinion for five, six years, right? Which is the data belongs in the cloud. Don't try to do this yourself. Don't try to do this on prem. Don't store it in, at Duke, it's not built for this. Store it in the cloud. In the cloud, first of all, you get a lot of security benefits that the cloud vendors are already working on. So that's one good thing about it. Second, you get it, it's realiable. You get the 10, 11 lines of availability, so that's great, you get that. Start collecting data there. Another reason you want to do it in the cloud is that a lot of the data sets that you need to actually get good quality results, are available in the cloud. Often times what happens with AI is, you build a predictive model, but actually, it's terrible. It didn't work well. So you go back, and then the main trick, the first tricks you use to increase the quality is actually augmenting that data with other data sets. You might purchase those data sets from other vendors. You don't want to be shipping hard drives around or, you know, getting that into your data center. Those will be available in the cloud, so you can augment that data. So we're big fans of storing your data in data lakes, in the cloud. We obviously believe that you need to make that data high quality and reliable. With that we believe the Delta Lake platform, open-source project that we created is a great vehicle for that. But I think moving to the cloud is the number one thing. >> (John) And hybrid works with that if you need to have something on premise? >> In my opinion the two worlds are so different, that it's hard. You hear a lot of vendors that say we're the hybrid solution that works on both and so on. But the two models are so different, fundamentally, that it's hard to actually make them work well. I have not yet seen a customer yet or enterprise. You see a lot of offerings, where people say hybrid is the way. Of course, a lot of on prem vendors are now saying, hey, we're the hybrid solution. I haven't actually seen that be successful to be frank. Maybe someone will crack that nut but-- >> I think it's an operational question to see who can make it work. Ali, congratulations on all your success. Great to see you. >> Yeah it's been great having you on the show. >> Thank you so much for having me. >> You are watching theCUBE, Informatica 2019. I'm Rebecca Knight, for John Furrier, stay tuned.

Published Date : May 21 2019

SUMMARY :

Brought to you by Informatica. thank you so much for coming on, for returning to theCUBE. So I want to pick up on something that you said So that's the part that people are struggling with. Psyched that you stopped by theCUBE. and the catalog that they also have. So you know, we've been followers, our 10th year, And the promise was, you can do great things with this. the clean water, if you will. But if it makes it into the Delta Lake, You guys got the young generation coming in. We're expanding the ME and the Asia business. slow the hiring down to make sure that Yeah and we're super excited about And Cal had the first ever class in So they're going to be trained in data science the hiring to make sure that you are But, the good news is, if you look at the I got to ask you for the young kids out there, and numbers and you need to understand So I got to ask you as companies that are in the enterprise is that a lot of the data sets that you need But the two models are so different, fundamentally, to see who can make it work. You are watching theCUBE,

ENTITIES

Entity	Category	Confidence
Rebecca Knight	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
10	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
John Furrier	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
first	QUANTITY	0.99+
five	QUANTITY	0.99+
Cal	ORGANIZATION	0.99+
Ali	PERSON	0.99+
John	PERSON	0.99+
two	QUANTITY	0.99+
two models	QUANTITY	0.99+
thousands	QUANTITY	0.99+
one petabytes	QUANTITY	0.99+
10th year	QUANTITY	0.99+
Second	QUANTITY	0.99+
yesterday	DATE	0.99+
two petabytes	QUANTITY	0.99+
70s	DATE	0.99+
six years	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
Duke	ORGANIZATION	0.99+
five petabytes	QUANTITY	0.99+
Delta Lake	LOCATION	0.99+
both	QUANTITY	0.99+
Delta Lake	ORGANIZATION	0.99+
second	QUANTITY	0.98+
first tricks	QUANTITY	0.98+
Berkley	LOCATION	0.98+
40-50 people	QUANTITY	0.98+
two worlds	QUANTITY	0.98+
one good thing	QUANTITY	0.98+
one	QUANTITY	0.98+
Asia	LOCATION	0.98+
50 years ago	DATE	0.98+
CUBE	ORGANIZATION	0.97+
Cal Berkley	LOCATION	0.97+
over a thousand students	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.96+
15 years	QUANTITY	0.96+
today	DATE	0.96+
Asiapac	LOCATION	0.96+
Mike Olsen	PERSON	0.96+
Amr Awadallah	PERSON	0.96+
About 100 people	QUANTITY	0.96+
53 years old	QUANTITY	0.95+
about 800 employees	QUANTITY	0.95+
first generation	QUANTITY	0.92+
11 lines	QUANTITY	0.92+
one thing	QUANTITY	0.91+
2019	DATE	0.89+
Informatica World 2019	EVENT	0.88+
SaaS	TITLE	0.86+
a decade ago	DATE	0.85+
thousands of data scientists	QUANTITY	0.84+
SAS	ORGANIZATION	0.84+
this weekend	DATE	0.82+
last couple years	DATE	0.81+
Informatica World	TITLE	0.62+

Ali Ghodsi, Databricks - #SparkSummit - #theCUBE

>> Narrator: Live from San Francisco, it's the Cube. Covering Sparks Summit 2017. Brought to you by Databricks. (upbeat music) >> Welcome back to the Cube, day two at Sparks Summit. It's very exciting. I can't wait to talk to this gentleman. We have the CEO from Databricks, Ali Ghodsi, joining us. Ali, welcome to the show. >> Thank you so much. >> David: Well we sat here and watched the keynote this morning with Databricks and you delivered. Some big announcements. Before we get into some of that, I want to ask you, it's been about a year and a half since you transitioned from VP Products and Engineering into a CEO role. What's the most fun part of that and maybe what's the toughest part? >> Oh, I see. That's a good question and that's a tough question too. Most fun part is... You know, you touch many more facets of the business. So in engineering, it's all the tech and you're dealing only with engineers, mostly. Customers are one hop away, there's a product management layer between you and the customers. So you're very inwards focused. As a CEO you're dealing with marketing, finance, sales, these different functions. And then, externally with media, with stakeholders, a lot of customer calls. There's many, many more facets of the business that you're seeing. And it also gives you a preview and it also gives you a perspective that you couldn't have before. You see how the pieces fit together so you actually can have a better perspective and see further out than you could before. Before, I was more in my own pick situation where I was seeing sort of just the things relating to engineering so that's the best part. >> You're obviously working close with customers. You introduced a few customers this morning up on stage. But after the keynote, did you hear any reactions from people? What are they saying? >> Yes the keynote was recently so on my way here I've had multiple people sort of... A couple people that high-fived just before I got up on stage here. On several softwaring, people are really excited about that. Less devops, less configuration, let them focus on the innovation, they want that. So that's something that's celebrated. Yesterday-- >> Recap that real quickly for our audience here, what the server-less operating is. >> Absolutely, so it's very simple. We want lots of data scientists to be able to do machine learning without have to worry about the infrastructure underneath it. So we have something called server-less pools and server-less pools you can just have lots of data scientists use it. Under the hood, this pool of resources shrinks and expands automatically. It adds storage, if needed. And you don't have to worry about the configuration of it. And it also makes sure that it's isolating the different data scientists. So if one data scientist happened to run something that takes much more resources, it won't effect the other data scientists that are sharing that. So the short story of it is you cut costs significantly, you can now have 3000 people share the same resources and it enables them to move faster because they don't have to worry about all the devops that they otherwise have to do. >> George, is that a really big deal? >> Well we know whenever there's infrastructure that gets between a developer, data science, and their outcomes, that's friction. I'd be curious to say let's put that into a bigger perspective, which is if you go back several years, what were the class of apps that Spark was being used for, and in conjunction with what other technologies. Then bring us forward to today and then maybe look out three years. >> Ali: Yeah, that's a great question. So from the very beginning, data is key for any of these predictive analytics that we are doing. So that was always a key thing. But back then we saw more Hadoop data lakes. There more data lakes, data reservoirs, data marks that people were building out. We saw also a lot of traditional data warehousing. These days, we see more and more things moving to cloud. The Hadoop data lake received, often times at enterprises, being transformed into a cloud blob storage. That's cheaper, it's dual-up replicated, it's on many continents. That's something that we've seen happen. And we work across any of these, frankly. We, from the very beginning, Spark, one of its strengths is it integrates really well wherever your data is. And there's a huge community of developers around it, over 1000 people now that have contributed to it. Many of these people are in other organizations, they're employed by other companies and their job is to make sure that Databricks or Spark works really, really well with, say, Cassandra or with S3. That's a shift that we're seeing. In terms of applications people are building it's moving more into production. Four years ago much more of it was interactive exploratory. Now we're seeing production use cases. The fraud analytics use case that I mentioned, that's running continuously and the requirements there are different. You can't go down for ten minutes on a Saturday morning at 4 a.m. when you're doing credit card fraud because that's a lot of fraud and that affects the business of, say, Capital One. So that's much more crucial for them. >> So what would be the surrounding infrastructure and applications to make that whole solution work? Would you plug into a traditional system of record at the sales order entry kind of process point? Are you working off sort of semi-real-time or near real-time data? And did you train the models on the data lake? How did the pieces fit together? >> Unfortunately the answers depends on the particular architecture that the customer has. Every enterprise is slightly different. But it's not uncommon that the data is coming in. They're using Spark structured streaming in Databricks to get it into S3, so that's one piece of the puzzle. Then when it ends up there, from then on it funnels out to many different use cases. It could be a data warehousing use case, where they're just using interactive sequel on it. So that's the traditional interactive use case, but it could be a real-time use case, where it's actually taking those data that it's processed and it's detecting anomalies and putting triggers in other systems and then those systems downstream will react to those triggers for anomalies. But it could also be that it's periodically training models and storing the models somewhere. Often times it might be in a Cassandra, or in a Redis, or something of that sort. It will store the model there and then some web application can then take it from there, do point queries to it and say okay, I have a particular user that came in here George now, quickly look up what is his feature vector, figure out what the product recommendations we should show to this person and then it takes it from there. >> So in those cases, Cassandra or Redis, they're playing the serving layer. But generating the prediction model is coming from you and they're just doing the inferencing, the prediction itself. So if you look out several years, without asking you the roadmap, which you can feel free to answer, how do you see that scope of apps expanding or the share of an existing app like that? >> Yeah, I think two interesting trends that I believe in, I'll be foolish enough to make predictions. One is that I think that data warehousing, as we know it today, will continue to exist. However, it will be transformed and all the data warehousing solutions that we have today will add predictive capabilities or it will disappear. So let me motivate that. If you have a data warehouse with customer data in it and a fact table, you have all your transactions there, you have all your products there. Today, you can plug in BI tools and on top of that you can see what's my business health today and yesterday. But you can't ask it: tell me about tomorrow. Why not? The data is there, why can I not ask it this customer data, you tell me which of these customers are going to turn, or which one of them should I reach out to because I can possibly upsell these? Why wouldn't I want to do that? I think everyone would want to do that and everyday a warehousing solution in ten years will have these capabilities. Now with Spark sequel you can do that and the announcement yesterday showed you also how you can bake models, machinery models, and export them so a sequel analyst can just act system directly with no machine learning experience. It's just a simple function call and it just works. So that's one prediction I'll make. The second prediction I'll make is that we're going to see lots of revolutions in different industries, beyond the traditional 'get people to click on ads' and understand social behavior. We're going to go beyond that. So for those use cases it will be closer to the things I mentioned like Shell and what you need to do there is involve these domain experts. The domain experts will come in, the doctors, or the machine specialists, you have to involve them in the loop. And they'll be able to transform, maybe much less exotic applications, it's not the super high-tech Silicon Valley stuff, but it's nevertheless extremely important to every enterprise, to every protocol, on the planet. That's, I think, the exciting part of where predictions will go in the next decade or two. >> If I were to try and pick out the most man-bytes dug kind of observation in there, you know, it's supposed to be the unexpected thing, I would say where you said all data warehouses are going to become predictive services. Because what we've been hearing, it's sort of the other side of that coin which is all the operational databases will get all the predictive capabilities. But you said something very different. I guess my question is are you seeing the advanced analytics going to the data warehouse because the repository of data is going to be bigger there and so you can either build better models or because it's not burdened with transaction SLAs that you can serve up predictions quicker? >> The data warehousing has been about basic statistics. It's been a sequel that the language that is used is to get descriptive statistics. Tables with averages and medians, that's statistics. Why wouldn't you want to have advanced statistics which now does predictions on it. It just so happens that sequel is not the right interface for that. So it's going to be very natural that people who are already asking statistical questions for the last 30 years from their customer data, these massive throes of data that they have stored. Why wouldn't they want to also say, 'okay now give me more advanced statistics?' I'm not an expert on advanced statistics but you the system. Tell me what I should watch out for. Which of these customers do I talk to? Which of the products are in trouble? Which of the products are not, or which parts of my business are not doing well now? Predict the future for me. >> George: When you're doing that though, you're now doing it on data that has a fair amount of latency built into it. Because that's how it got into the data warehouse. Where if it's in the operational database, it's really low latency, typically low latency stuff. Where and why do you see that distinction? >> I do think also that we'll see more and more real-time engines take over. If you do things in real-time you can do it for a fraction of the cost. So we'll also see those capabilities come in. So you don't have to... Your question is, why would you want to once a week batch everything into a central warehouse and I agree with that. It will be streaming in live and then you can on that, do predictions, you can do basic analytics. I think basically the lines will blur between all these technologies that we're seeing. In some sense, Spark actually was the precursor to all that. So Spark already was unifying machine learning, sequel, ETL, real-time, and you're going to see that everywhere appear. >> You mentioned Shell as an example, one of your customers, you also had HP, Capital One, and you developed this unified analytics platform, that's solving some of their common problems. Now that you're in the mood to make predictions, what do you think are going to be the most compelling use cases or industries where you're going to see Databricks going in the future? >> That's a hard one. Right now, I think healthcare. There's a lot of data sets, there's a lot of gene sequencing data. They want to be able to use machine learning. In fact, I think those industries being transformed slowly from using classical statistics into machine learning. We've actually helped some of these companies do that. We've set up workshops and they've gotten people trained. And now they're hiring machine learning experts that are coming in. So that's one I think in the healthcare industry, whether it's for drug-testing, clinical-trials, even diagnosis, that's a big one, I do think industrial IT. These are big companies with lots of equipment, they have tons of sensor data, massive data sets. There's a lot of predictions that they can do on that. So that's a second one I would say. Financial industry, they've always been about predictions, so it makes a lot of sense that they continue doing that. Those are the biggest ones for Databricks. But I think now also as slowly, other verticals are moving into the cloud. We'll see more of other use cases as well. But those are the biggest ones I see right now. It's hard to say where it will be ten years from now, or 15. Things are going so fast that it's hard to even predict six months. >> David: Do you believe IOT is going to be a big business driver? >> Yes, absolutely. >> I want to circle back where you said that we've got different types of databases but we're going to unify the capabilities. Without saying, it's not like one wins, one loses. >> Ali: Yes, I didn't want to do that. >> So describe maybe the characteristics of what a database that compliments Sparks really well might look like. >> That's hard for me to say. The capabilities of Spark, I think, are here to stay. The ability to be able to ETL variety of data that doesn't have structure, so Structured Query Language, SQL, is not fit for it, that is really important and it's going to become more important since data is the new oil, as they said. Well, then it's going to be very important to be able to work with all kinds of data and getting that into the systems. There's more things everyday being created. Devices, IOT, whatever it is that are spewing out this data in different forms and shapes. So being able to work with that variety, that's going to be an important property. So they'll have to do that. That's the ETL portion or the ELT portion. The real-time portion, not having to do this in a batch manner once a week because now time is a competitive advantage. So if I'm one week behind you that means I'm going to lose out. So doing that in real-time, or near human-time or human real-time, that's going to be really important. So that's going to come as well, I think, and people will demand that. That's going to be a competitive advantage. Wherever you can add that secret sauce it's going to add value to the customers. And then finally the predictive stuff, adding the predictive stuff. But I think people will want to continue to also do all the old stuff they've been doing. I don't think that's going to go away. Those bring value to customers, they want to do all those traditional use cases as well. >> So what about now where customers expect to have some, not clear how much, un-Primmed application platform like Spark. Some in the cloud that now that you've totally reordered the TCO equation. But then also at the edge for IOT-type use cases, do you have to slim down Spark to work at the edge? If you have server-less working in the cloud, does that mean you have to change the management paradigm on Prim. What does that mix look like? How does someone, you know how does a Fortune 200 company, get their arms around that? >> Ali: Yeah, this is a surprising thing, most surprising thing for me in the last year, is how many of those Fortune 200's that I was talking to three years ago and they were saying 'no way, we're not going into the cloud. You don't understand the regulations that we are facing or the amount of data that we have.' Or 'we can do it better,' or 'the security requirements that we have, no one can match that.' To now, those very same companies are saying 'absolutely, we're going.' It's not about if, it's about when. Now I would be hard-pressed to find any enterprise that says 'no, we're not going to go, ever.' And some companies we've even seen go from the cloud to on Prim, and then now back. Because the prices are getting more competitive in the cloud. Because now there's three, at least, major players that are competing and they're well-funded companies. In some sense, you have ad money and office money and retail money being thrown at this problem. Prices are getting competitive. Very soon, most IT folks will realize, there's no way we can do this faster, or better, or more reliable secure ourselves. >> David: We've got just a minute to go here before the break so we're going to kind of wrap it up here. And we got over 3000 people here at Spark Summit so it's the Spark community. I want you to talk to them for a moment. What problems do you want them to work on the most? And what are we going to be talking about a year from now at this table? >> The second one is harder. So I think the Spark community is doing a phenomenal job. I'm not going to tell them what to do. They should continue doing what they are doing already which is integrating Spark in the ecosystem, adding more and more integrations with the greatest technologies that are happening out there. Continue the innovation and we're super happy to have them here. We'll continue it as well, we'll continue to host this event and look forward to also having a Spark Summit in Europe, and also the East Coast soon. >> David: Okay, so I'm not going to ask you to make any more predictions. >> Alright, excellent. >> David: Ali this is great stuffy today. Thank you so much for taking some time and giving us more insight after the keynote this morning. Good luck with the rest of the show. >> Thank you. >> Thanks, Ali. And thank you for watching. That's Ali Ghodsi CEO from Databricks. We are Spark Summit 2017 here, on the Cube. Thanks for watching, stay with us. (upbeat mustic)

Published Date : Jun 8 2017

SUMMARY :

Brought to you by Databricks. We have the CEO from Databricks, Ali Ghodsi, joining us. the keynote this morning with Databricks and you delivered. that you couldn't have before. But after the keynote, did you Yes the keynote was recently so on my way here Recap that real quickly for our audience here, and server-less pools you can just have into a bigger perspective, which is if you go back So from the very beginning, So that's the traditional interactive use case, But generating the prediction model is coming from you and the announcement yesterday showed you also and so you can either build better models It's been a sequel that the language that is used Where and why do you see that distinction? and then you can on that, do predictions, what do you think are going to be It's hard to say where it will be ten years from now, or 15. I want to circle back where you said So describe maybe the characteristics of what a database and getting that into the systems. does that mean you have to change or the amount of data that we have.' I want you to talk to them for a moment. and also the East Coast soon. David: Okay, so I'm not going to ask you Thank you so much for taking some time And thank you for watching.

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
David	PERSON	0.99+
HP	ORGANIZATION	0.99+
Ali Ghodsi	PERSON	0.99+
Europe	LOCATION	0.99+
Ali	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Capital One	ORGANIZATION	0.99+
three	QUANTITY	0.99+
Today	DATE	0.99+
one week	QUANTITY	0.99+
tomorrow	DATE	0.99+
last year	DATE	0.99+
ten years	QUANTITY	0.99+
yesterday	DATE	0.99+
three years	QUANTITY	0.99+
3000 people	QUANTITY	0.99+
One	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
Four years ago	DATE	0.99+
three years ago	DATE	0.99+
next decade	DATE	0.99+
six months	QUANTITY	0.99+
Yesterday	DATE	0.98+
over 1000 people	QUANTITY	0.98+
East Coast	LOCATION	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
one prediction	QUANTITY	0.98+
second prediction	QUANTITY	0.98+
Silicon Valley	LOCATION	0.97+
Spark Summit 2017	EVENT	0.97+
Spark	TITLE	0.97+
once a week	QUANTITY	0.97+
Sparks Summit	EVENT	0.97+
Fortune 200	ORGANIZATION	0.96+
over 3000 people	QUANTITY	0.96+
about a year and a half	QUANTITY	0.95+
Shell	ORGANIZATION	0.95+
Spark	ORGANIZATION	0.95+
Sparks	TITLE	0.94+
IOT	ORGANIZATION	0.94+
day two	QUANTITY	0.94+
Sparks Summit 2017	EVENT	0.94+
this morning	DATE	0.93+
second one	QUANTITY	0.93+
S3	TITLE	0.85+
one data scientist	QUANTITY	0.85+
15	QUANTITY	0.85+
Saturday morning at	DATE	0.84+
tons	QUANTITY	0.83+
S3	ORGANIZATION	0.8+
one piece of the puzzle	QUANTITY	0.79+
couple people	QUANTITY	0.77+
Prim	ORGANIZATION	0.76+
several years	QUANTITY	0.75+

Reynold Xin, Databricks - #Spark Summit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.

Published Date : Jun 7 2017

SUMMARY :

Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Reynold	PERSON	0.99+
Ali	PERSON	0.99+
David	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
David Goad	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
North Pole	LOCATION	0.99+
San Francisco	LOCATION	0.99+
Reynold Xin	PERSON	0.99+
last month	DATE	0.99+
10X	QUANTITY	0.99+
two questions	QUANTITY	0.99+
three trillion records	QUANTITY	0.99+
second area	QUANTITY	0.99+
today	DATE	0.99+
last week	DATE	0.99+
Spark	TITLE	0.99+
Spark Summit 2017	EVENT	0.99+
first direction	QUANTITY	0.99+
One	QUANTITY	0.99+
James Bond	PERSON	0.98+
Spark	ORGANIZATION	0.98+
both	QUANTITY	0.98+
first	QUANTITY	0.98+
one	QUANTITY	0.98+
Tungsten	ORGANIZATION	0.98+
two focuses	QUANTITY	0.97+
three directions	QUANTITY	0.97+
one minute	QUANTITY	0.97+
one area	QUANTITY	0.96+
three	QUANTITY	0.96+
about five	QUANTITY	0.96+
DBIO	ORGANIZATION	0.96+
six years	QUANTITY	0.95+
one thing	QUANTITY	0.94+
over a hundred experiments	QUANTITY	0.94+
Oracle	ORGANIZATION	0.92+
Theano	TITLE	0.92+
single note	QUANTITY	0.91+
Intel	ORGANIZATION	0.91+
one route	QUANTITY	0.89+
theCUBE	ORGANIZATION	0.88+
Office	TITLE	0.87+
TensorFlow	TITLE	0.87+
S3	TITLE	0.87+
MXNet	TITLE	0.85+

Matei Zaharia, Databricks - #SparkSummit - #theCUBE

>> Narrator: Live from San Francisco, it's theCUBE. Covering Spark Summit2017, brought to you by Databricks. (upbeat music) >> Welcome back to Spark Summit 2017, you're watching theCUBE and we have an honored guest here today, his name is Matei Zaharia and Matei is the creator of Spark, Chief Technologist, and Co-Founder of Databricks, did I get all that right? >> Yeah, thanks a lot for having me again. Excited to be here. >> Yeah Matei we were watching your keynote this morning and we're all excited to hear about better support for deep learning, about some of the structured streaming apps now being in production. I want to ask you what happened after the keynote? What kind of feedback have you heard from people in the hallways? >> Yeah definitely, so the feedback has definitely been super positive. I think people really like the direction that we're moving in with Apache Spark and with this library, such as a deep learning pipeline one. So we've gotten a lot of questions about the deep learning library, when will it support more types and so on. It's really good at supporting images right now. And also with streaming, I think people are just excited to try out the low latency streaming. >> Any other priorities people asked you about that maybe you haven't focused on yet? >> That I haven't focused on in the keynote, so I think that's a good question, I think overall some of the things we keep seeing are people just want to make it easier to just operate Spark on it under that scale and simplify things like monitoring and debugging and so on, so that's a constant theme that we're seeing. And then another thing that's generally been going on, I didn't focus on it this time, is increasing usage by Python and R users. So there's a lot of work in the latest release to continue improving that, to make it easier to use in those languages. >> Okay, we were watching the demo, the impressive demos, this morning, in fact George was watching the keynote, he was the one millisecond latency, he said wow. George, you want to ask a little more about that? >> So yeah let's talk about, 'cause there's this rise of continuous apps, which I think you guys named. >> Matei: Yeah. >> And resonates with everyone to go along with batch and request response. And in the past, so people were saying, well Spark was doing many micro batches, latency was couple hundred milliseconds. So now that you're down at one millisecond, what does that change in terms the class of apps that you're appropriate for or you know, some people have talked about criticality of vent processing. Where is Spark on that now? >> Yeah definitely, so yeah, so the goal of this is exactly to support the full range of latency, possible all the way down to sub-millisecond latency. And give users the same programming model for them so they don't have to use a different system or a lower level programming model to get that low latency. And so basically since we began structured streaming, we moved, we tried to make sure the API is not tied in with micro-batching in anyway. And so this is the next step to actually eliminate that from the engine and be able to execute these computations. And what are the new applications? So I think this really enables two types of things we've seen. One is kind of automated decision making system, so this would be something, it could be even on say, a website or you know, say when someone's applying for a loan or something like that, could be making decisions but it could even be an even lower latency, like say stock market style of place or internet of things, or like industrial monitoring, and making decisions there. That's one thing. And then the other thing we see people doing is a lot of kind of stream to stream ETL, which is a bit more boring in some way, but as you set that up, it's nice to have this very low latency transformations that can produce new streams from an existing one, because then nothing downstream from them is effected in terms of latency. >> So in this last example, it's sort of to help build microservice type applications. >> Yeah, exactly, yeah. Well in general, there's this whole, basically this whole architecture of saying all my data will be streamed and then I'll have some applications that just produce a new stream. And then later that stuff can go into a data link or into a real time system or whatever. So it's basically keeping it low latency while it remains in stream form. >> So we were talking earlier and we've been talking to the Snappy Data folks at the place machine folks. And they built Spark into a DBMS. So that like it's immutable. I'm sorry, mutable. >> Matei: Mutable, yeah. >> Like a data frame is updateable. So what does that make possible, even if you can do the same things with Spark, without it? What does it make easier? >> So that's also in the same spirit of continuous applications, it's saying you should have a single programming model and interface for doing both your transactional work and your analytics after, and then maybe serving the results of the analytics. So that makes a lot of sense and an example of that would be, you know, I keep going back to say the financial or credit card type of use cases, but it would be something where users are conducting transactions and maybe you learn stuff about them from that. You say okay, here's where they're located, now here's what they're purchasing, whatever. And then you also want to know, I'll have to make a decision. For example, do I allow them to go past the limit on their credit card or something like that. Or is this a normal use of it or is this a fraudulent one? So that's where it helps to integrate these and you can do these things. So there are products like Snappy Data That integrate a specific database with Spark. And we're also trying to make sure in Spark, the API, so that people can integrate their own system, whatever database or key value store they want. >> So would you have to jump through hoops if you didn't want to integrate any other store other than talking to a file system, or? >> Yeah if you want to do these transactions on a file system, there will be basically some performance constraints to doing that. It depends on the weight, it's definitely the simplest thing and if you have a low enough rate of up data it could actually be fine. But if you want more fine grained ones, then it becomes a problem. >> It would seem like if you tack on a product for ingest, not that you really want to get into that, think Kafka, which could also stretch into the transforms on some basic analytics. And you mentioned, I think on the Spark East keynote, Redis for serving, you've got like now a multi sort of vendor product stack. And so there's complexity to that. >> Matei: Yeah definitely yeah. >> Do you foresee a scenario where you could see that as a high volume solution and it's something that you would take ownership of? >> I see, so well, do you mean from the Apache Spark side or from the Databricks side? >> George: Actually either. >> Yeah so I think from the Spark side, basically so far the project doesn't provide storage, it just provides computation and it plugs into different storage engines. And so it would be kind of a big shift, it might be possible, but it would be kind of a big shift to say, okay well also provide persistent storage. I think the more likely thing that will happen is better and better integrations with the most widely used open source storage systems. So Redis is one. Apache Kafka, there's a lot of work on integrating that better and so on. From the Databricks side, that is different because that is a fully managed cloud service and it definitely makes sense there that'd you have a turnkey solution for that. Right now we actually built that for people who want that we can build it, sometimes with other vendors or with just services built into Amazon, but that makes a lot of sense. >> And Matei, something I read a press release on, but I didn't hear it in the keynote this morning. I hate to steal thunder from tomorrow, but can you give us a sneak preview on serverless apps? What's that about? >> Yeah, so this is actually we put out a press release today and we'll actually put out, well we'll have a full keynote tomorrow morning and also a lot more details on our website. So this is a Databricks serverless. It's basically a serverless platform for adding Apache Spark and data science. So not to steal away too much thunder, but you know serverless computing is this idea of users can just submit query or computation, they don't have to configure the hardware at all and they just get high performance and they get results. And so far it's been very successful with stateless workloads such as Sequel or Amazon Lambda, which is you know just functions serving a webpage or something like that. So this is going to be the first offering that actually extends that model to data science and in general to Spark workloads. So you can have machine learning users, you can have these streaming applications, all these things, on that kind of environment. So yeah, we'll have a lot more detail on that tomorrow, it's something that we're excited about. >> I want to circle back to IoT apps. You know there's sort of, beyond an emerging consensus, that we're going to do a lot of training in the cloud 'cause we have access to big compute and lots of data. But then the issue on the edge, in the near to medium term, the footprint, like a lot of people are telling us high volume devices will have 3 megs of memory and a gateway server would have like two gigs and two cores. So can you carve Spark up into fitting on one of the... >> That's a good question, I think for that, it's again, the most likely way that would happen is through data sources. For example, there are these projects like Apache knife and other projects as well that let you build up a data pipeline from IoT devices all the way to the cloud. And you can imagine some computation through those. So I think, yeah I don't have a very concrete answer, I think here it is something that's coming up a bunch though, so we do want to support this type of like splitting the computation. >> But in terms of splitting the computation, you could take a trained model, model training is fat compute and then the trained model-- >> You can definitely push the model and do inference. >> Would that inference thing have to happen in a Spark run time or could it be somewhere? >> I think it could happen anywhere else also. And actually like we do see a lot of people wanting to export basically machine learning pipelines or models from Spark into another environment. So it can happen somewhere else too. Yeah and then the other aspect of it is also data collection. So if you can push something that says here is when the data is exciting, like when the data is interesting you should remember these and send them on. That would also help, because otherwise you know, say it's like a video camera or something, most of the time it's looking at nothing. I mean you don't want to send all that back. >> That's actually a key point, which is some folks like especially in the IT ops area where you know, training wheels for IoT 'cause they're doing machine learning on infrastructure. >> Matei: Yeah which is there. >> Yeah, they say oh anything outside, two standard deviations of the band of exhortations, but there's more of an answer to that, I gather, from what you're saying. >> Yeah I mean I think you can create, for example, you can create a small machine learning model that decides whether what it's seeing is unusual and sends it back or you can even make it query specific, like you can count, like I want to find this type of object that's going by the camera. And try to find that. So I think there's a lot of room to improve that. >> Okay, well we have just a couple of minutes left here, want to draw into the future a little bit. And there's been some great progress since the summit last year to this one. What would you say is the next boundary that needs to be pushed to get Spark to the next level, whatever that may be? >> Yeah definitely yeah, well okay so again on the, so first of all in terms of the project today I think the big workload is that we are seeing come up all the time, are deep learning and stream processing. These are the big emerging ones. I mean there's still a lot of data warehousing, ETL and so on, that's still there. But these are the new ones, so that's what we're focusing on on our team at least. And we'll continue building out the stuff that you saw announced today. I think beyond that, I do think that part of the problem and this is more on the Databricks side, part of the problem is also just making it much easier for teams or businesses to begin using these technologies at all. And that's where we think cloud computing or software as a service is the way because you just turn it on and you can immediately start doing things. But that's basically, the way that I view that, is right now the barrier to do any project with data science or machine learning, or even like simple kind of analytics and unstructured data, the barrier is really high. So companies can only do it on a few projects. There might be like a 100 things they could be trying, but they can only afford to spend up two or three of them. So if you lower that barrier, there'll be a lot more of them and everyone will be able to quickly try one of these applications and see whether it actually works. >> And this ties into some of you graduate studies, like with model management and things like that? >> Yeah, so on the research side. So I'm also you know, doing research at Stanford and on that side we have this lab called Dawn, which is about usable machine learning. It's exactly these things. Like how do you enable an order of magnitude of more people to try to do things with machine learning. So actually we're also doing the video push down thing I mentioned, that's one thing we're looking at. A bunch of other stuff as well. >> Matei we could talk to you all day, but we don't have all day. We're up against the break here, but I wanted to thank you very much for coming and sharing a few moments here and look forward to seeing you in the hallways here at Spark right? >> Yeah thanks again for having me. >> Thanks for joining us and thank you all for watching, here we are at theCUBE at Spark 2017, thanks for watching. (upbeat music)

Published Date : Jun 6 2017

SUMMARY :

Covering Spark Summit2017, brought to you by Databricks. Excited to be here. I want to ask you what happened after the keynote? Yeah definitely, so the feedback has definitely That I haven't focused on in the keynote, George, you want to ask a little more about that? of continuous apps, which I think you guys named. And in the past, so people were saying, And so this is the next step to actually eliminate So in this last example, it's sort of to help build So it's basically keeping it low latency So that like it's immutable. even if you can do the same things with Spark, And then you also want to know, the simplest thing and if you have a low for ingest, not that you really want to get into that, and it definitely makes sense there that'd you have I hate to steal thunder from tomorrow, but can you give us So you can have machine learning users, So can you carve Spark up into fitting on And you can imagine some computation through those. You can definitely push the model So if you can push something that says like especially in the IT ops area where you know, but there's more of an answer to that, I gather, Yeah I mean I think you can create, for example, What would you say is the next boundary So if you lower that barrier, there'll be a lot So I'm also you know, doing research at Stanford and look forward to seeing you in the hallways Thanks for joining us and thank you all for watching,

ENTITIES

Entity	Category	Confidence
George	PERSON	0.99+
Matei	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
one millisecond	QUANTITY	0.99+
two gigs	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
3 megs	QUANTITY	0.99+
two cores	QUANTITY	0.99+
tomorrow morning	DATE	0.99+
three	QUANTITY	0.99+
today	DATE	0.99+
tomorrow	DATE	0.99+
100 things	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
Python	TITLE	0.99+
Spark	TITLE	0.99+
last year	DATE	0.99+
two	QUANTITY	0.98+
San Francisco	LOCATION	0.98+
Spark Summit 2017	EVENT	0.98+
two types	QUANTITY	0.98+
Spark	ORGANIZATION	0.98+
One	QUANTITY	0.98+
both	QUANTITY	0.98+
Apache	ORGANIZATION	0.97+
Stanford	ORGANIZATION	0.97+
first offering	QUANTITY	0.97+
one thing	QUANTITY	0.96+
this morning	DATE	0.96+
couple hundred milliseconds	QUANTITY	0.95+
Lambda	TITLE	0.94+
Spark Summit2017	EVENT	0.93+
one	QUANTITY	0.89+
two standard	QUANTITY	0.87+
#theCUBE	ORGANIZATION	0.81+
single programming model	QUANTITY	0.8+
Databricks	PERSON	0.78+
R	TITLE	0.78+
Snappy Data	ORGANIZATION	0.77+
of minutes	QUANTITY	0.67+
first	QUANTITY	0.66+
Spark East	ORGANIZATION	0.63+
Kafka	TITLE	0.62+
Apache Spark	TITLE	0.61+
Sequel	TITLE	0.6+
Spark 2017	EVENT	0.58+
Narrator:	TITLE	0.57+
theCUBE	ORGANIZATION	0.56+
Redis	TITLE	0.55+
Redis	ORGANIZATION	0.5+
theCUBE	TITLE	0.46+
#SparkSummit	TITLE	0.35+

Ion Stoica, Databricks - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston Massachusetts. This is theCUBE. Covering Sparks Summit East 2017. Brought to you by Databricks. Now here are your hosts, Dave Vellante and George Gilbert. >> [Dave] Welcome back to Boston everybody, this is Spark Summit East #SparkSummit And this is theCUBE. Ion Stoica is here. He's Executive Chairman of Databricks and Professor of Computer Science at UCal Berkeley. The smarts is rubbing off on me. I always feel smart when I co-host with George. And now having you on is just a pleasure, so thanks very much for taking the time. >> [Ion] Thank you for having me. >> So loved the talk this morning, we learned about RISELabs, we're going to talk about that. Which is the son of AMP. You may be the father of those two, so. Again welcome. Give us the update, great keynote this morning. How's the vibe, how are you feeling? >> [Ion] I think it's great, you know, thank you and thank everyone for attending the summit. It's a lot of energy, a lot of interesting discussions, and a lot of ideas around. So I'm very happy about how things are going. >> [Dave] So let's start with RISELabs. Maybe take us back, to those who don't understand, so the birth of AMP and what you were trying to achieve there and what's next. >> Yeah, so the AMP was a six-year Project at Berkeley, and it involved around eight faculties and over the duration of the lab around 60 students and postdocs, And the mission of the AMPLab was to make sense of big data. AMPLab started in 2009, at the end of 2009, and the premise is that in order to make sense of this big data, we need a holistic approach, which involves algorithms, in particular machine-learning algorithms, machines, means systems, large-scale systems, and people, crowd sourcing. And more precisely the goal was to build a stack, a data analytic stack for interactive analytics, to be used across industry and academia. And, of course, being at Berkeley, it has to be open source. (laugh) So that's basically what was AMPLab and it was a birthplace for Apache Spark that's why you are all here today. And a few other open-source systems like Mesos, Apache Mesos, and Alluxio which was previously called Tachyon. And so AMPLab ended in December last year and in January, this January, we started a new lab which is called RISE. RISE stands for Real-time Intelligent Secure Execution. And the premise of the new lab is that actually the real value in the data is the decision you can make on the data. And you can see this more and more at almost every organization. They want to use their data to make some decision to improve their business processes, applications, services, or come up with new applications and services. But then if you think about that, what does it mean that the emphasis is on the decision? Then it means that you want the decision to be fast, because fast decisions are better than slower decisions. You want decisions to be on fresh data, on live data, because decisions on the data I have right now are original but those are decisions on the data from yesterday, or last week. And then you also want to make targeted, personalized decisions. Because the decisions on personal information are better than aggregate information. So that's the fundamental premise. So therefore you want to be on platforms, tools and algorithms to enable intelligent real-time decisions on live data with strong security. And the security is a big emphasis of the lab because it means to provide privacy, confidentiality and integrity, and as you hear about data breaches or things like that every day. So for an organization, it is extremely important to provide privacy and confidentiality to their users and it's not only because the users want that, but it also indirectly can help them to improve their service. Because if I guarantee your data is confidential with me, you are probably much more willing to share some of your data with me. And if you share some of the data with me, I can build and provide better services. So that's basically in a nutshell what the lab is and what the focus is. >> [Dave] Okay, so you said three things: fast, live and targeted. So fast means you can affect the outcome. >> Yes. Live data means it's better quality. And then targeted means it's relevant. >> Yes. >> Okay, and then my question on security, I felt like when cloud and Big Data came to fore, security became a do-over. (laughter) Is that a fair assessment? Are you doing it over? >> [George] Or as Bill Clinton would call it, a Mulligan. >> Yeah, if you get a Mulligan on security. >> I think security is, it's always a difficult topic because it means so many things for so many people. >> Hmm-mmm. >> So there are instances and actually cloud is quite secure. It's actually cloud can be more secure than some on-prem deployments. In fact, if you hear about these data leaks or security breaches, you don't hear them happening in the cloud. And there is some reason for that, right? It is because they have trained people, you know, they are paranoid about this, they do a specification maybe much more often and things like that. But still, you know, the state of security is not that great. Right? For instance, if I compromise your operating system, whether it's in cloud or in not in the cloud, I can't do anything. Right? Or your VM, right? On all this cloud you run on a VM. And now you are going to allow on some containers. Right? So it's a lot of attacks, or there are attacks, sophisticated attacks, which means your data is encrypted, but if I can look at the access patterns, how much data you transferred, or how much data you access from memory, then I can infer something about what you are doing about your queries, right? If it's more data, maybe it's a query on New York. If it's less data it's probably maybe something smaller, like maybe something at Berkeley. So you can infer from multiple queries just looking at the access. So it's a difficult problem. But fortunately again, there are some new technologies which are developed and some new algorithms which gives us some hope. One of the most interesting technologies which is happening today is hardware enclaves. So with hardware enclaves you can execute the code within this enclave which is hardware protected. And even if your operating system or VM is compromised, you cannot access your code which runs into this enclave. And Intel has Intell SGX and we are working and collaborating with them actively. ARM has TrustZone and AMB also announced they are going to have a similar technology in their chips. So that's kind of a very interesting and very promising development. I think the other aspect, it's a focus of the lab, is that even if you have the enclaves, it doesn't automatically solve the problem. Because the code itself has a vulnerability. Yes, I can run the code in hardware enclave, but the code can send out >> Right. >> data outside. >> Right, the enclave is a more granular perimeter. Right? >> Yeah. So yeah, so you are looking and the security expert is in your lab looking at this, maybe how to split the application so you run only a small part in the enclave, which is a critical part, and you can make sure that also the code is secure, and the rest of the code you run outside. But the rest of the code, it's only going to work on data which is encrypted. Right? So there is a lot of interesting research but that's good. >> And does Blockchain fit in there as well? >> Yeah, I think Blockchain it's a very interesting technology. And again it's real-time and the area is also very interesting directions. >> Yeah, right. >> Absolutely. >> So you guys, I want George, you've shared with me sort of what you were calling a new workload. So you had batch and you have interactive and now you've got continuous- >> Continuous, yes. >> And I know that's a topic that you want to discuss and I'd love to hear more about that. But George, tee it up. >> Well, okay. So we were talking earlier and the objective of RISE is fast and continuous-type decisions. And this is different from the traditional, you either do it batch or you do it interactive. So maybe tell us about some applications where that is one workload among the other traditional workloads. And then let's unpack that a little more. >> Yeah, so I'll give you a few applications. So it's more than continuously interacting with the environment continuously, but you also learn continuously. I'll give you some examples. So for instance in one example, think about you want to detect a network security attack, and respond and diagnose and defend in the real time. So what this means is that you need to continuously get logs from the network and from the more endpoints you can get the better. Right? Because more data will help you to detect things faster. But then you need to detect the new pattern and you need to learn the new patterns. Because new security attacks, which are the ones that are effective, are slightly different from the past one because you hope that you already have the defense in place for the past ones. So now you are going to learn that and then you are going to react. You may push patches in real time. You may push filters, installing new filters to firewalls. So that's kind of one application that's going in real time. Another application can be about self driving. Now self driving has made tremendous strides. And a lot of algorithms you know, very smart algorithms now they are implemented on the cars. Right? All the system is on the cars. But imagine now that you want to continuously get the information from this car, aggregate and learn and then send back the information you learned to the cars. Like for instance if it's an accident or a roadblock an object which is dropped on the highway, so you can learn from the other cars what they've done in that situation. It may mean in some cases the driver took an evasive action, right? Maybe you can monitor also the cars which are not self-driving, but driven by the humans. And then you learn that in real time and then the other cars which follow through the same, confronted with the same situation, they now know what to do. Right? So this is again, I want to emphasize this. Not only continuous sensing environment, and making the decisions, but a very important components about learning. >> Let me take you back to the security example as I sort of process the auto one. >> Yeah, yeah. >> So in the security example, it doesn't sound like, I mean if you have a vast network, you know, end points, software, infrastructure, you're not going to have one God model looking out at everything. >> Yes. >> So I assume that means there are models distributed everywhere and they don't know what a new, necessarily but an entirely new attack pattern looks like. So in other words, for that isolated model, it doesn't know what it doesn't know. I don't know if that's what Rumsfeld called it. >> Yes (laughs). >> How does it know what to pass back for retraining? >> Yes. Yes. Yes. So there are many aspects and there are many things you can look at. And it's again, it's a research problem, so I cannot give you the solution now, I can hypothesize and I give you some examples. But for instance, you can look about, and you correlate by observing the affect. Some of the affects of the attack are visible. In some cases, denial of service attack. That's pretty clear. Even the And so forth, they maybe cause computers to crash, right? So once you see some of this kind of anomaly, right, anomalies on the end devices, end host and things like that. Maybe reported by humans, right? Then you can try to correlate with what kind of traffic you've got. Right? And from there, from that correlation, probably you can, and hopefully, you can develop some models to identify what kind of traffic. Where it comes from. What is the content, and so forth, which causes behavior, anomalous behavior. >> And where is that correlation happening? >> I think it will happen everywhere, right? Because- >> At the edge and at the center. >> Absolutely. >> And then I assume that it sounds like the models both at the edge and at the center are ensemble models. >> Yes. >> Because you're tracking different behavior. >> Yes. You are going to track different behavior and you are going to, I think that's a good hypothesis. And then you are going to assemble them, assemble to come up with the best decision. >> Okay, so now let's wind forward to the car example. >> Yeah. >> So it sound like there's a mesh network, at least, Peter Levine's sort of talk was there's near-local compute resources and you can use bitcoin to pay for it or Blockchain or however it works. But that sort of topology, we haven't really encountered before in computing, have we? And how imminent is that sort of ... >> I think that some of the stuff you can do today in the cloud. I think if you're on super-low latency probably you need to have more computation towards the edges, but if I'm thinking that I want kind of reactions on tens, hundreds of milliseconds, in theory you can do it today with the cloud infrastructure we have. And if you think about in many cases, if you can't do it within a few hundredths of milliseconds, it's still super useful. Right? To avoid this object which has dropped on the highway. You know, if I have a few hundred milliseconds, many cars will effectively avoid that having that information. >> Let's have that conversation about the edge a little further. The one we were having off camera. So there's a debate in our community about how much data will stay at the edge, how much will go into the cloud, David Flores said 90% of it will stay at the edge. Your comment was, it depends on the value. What do you mean by that? >> I think that that depends who am I and how I perceive the value of the data. And, you know, what can be the value of the data? This is what I was saying. I think that value of the data is fundamentally what kind of decisions, what kind of actions it will enable me to take. Right? So here I'm not just talking about you know, credit card information or things like that, even exactly there is an action somebody's going to take on that. So if I do believe that the data can provide me with ability to take better actions or make better decisions I think that I want to keep it. And it's not, because why I want to keep it, because also it's not only the decision it enables me now, but everyone is going to continuously improve their algorithms. Develop new algorithms. And when you do that, how do you test them? You test on the old data. Right? So I think that for all these reasons, a lot of data, valuable data in this sense, is going to go to the cloud. Now, is there a lot of data that should remain on the edges? And I think that's fair. But it's, again, if a cloud provider, or someone who provides a service in the cloud, believes that the data is valuable. I do believe that eventually it is going to get to the cloud. >> So if it's valuable, it will be persisted and will eventually get to the cloud? And we talked about latency, but latency, the example of evasive action. You can't send the back to the cloud and make the decision, you have to make it real time. But eventually that data, if it's important, will go back to the cloud. The other question of all this data that we are now processing on a continuous basis, how much actually will get persisted, most of it, much of it probably does not get persisted. Right? Is that a fair assumption? >> Yeah, I think so. And probably all the data is not equal. All right? It's like you want to maybe, even if you take a continuous video, all right? On the cars, they continuously have videos from multiple cameras and radar and lidar, all of this stuff. This continuous. And if you think about this one, I would assume that you don't want to send all the data to the cloud. But the data around the interesting events, you may want to do, right? So before and after the car has a near-accident, or took an evasive action, or the human had to intervene. So in all these cases, probably I want to send the data to the cloud. But for the most cases, probably not. >> That's good. We have to leave it there, but I'll give you the last word on things that are exciting you, things you're working on, interesting projects. >> Yeah, so I think this is what really excites me is about how we are going to have this continuous application, you are going to continuously interact with the environment. You are going to continuously learn and improve. And here there are many challenges. And I just want to say a few more there, and which we haven't discussed. One, in general it's about explainability. Right? If these systems augment the human decision process, if these systems are going to make decisions which impact you as a human, you want to know why. Right? Like I gave this example, assuming you have machine-learning algorithms, you're making a diagnosis on your MRI, or x-ray. You want to know why. What is in this x-ray causes that decision? If you go to the doctor, they are going to point and show you. Okay, this is why you have this condition. So I think this is very important. Because as a human you want to understand. And you want to understand not only why the decision happens, but you want also to understand what you have to do, you want to understand what you need to do to do better in the future, right? Like if your mortgage application is turned down, I want to know why is that? Because next time when I apply to the mortgage, I want to have a higher chance to get it through. So I think that's a very important aspect. And the last thing I will say is that this is super important and information is about having algorithms which can say I don't know. Right? It's like, okay I never have seen this situation in the past. So I don't know what to do. This is much better than giving you just the wrong decision. Right? >> Right, or a low probability that you don't know what to do with. (laughs) >> Yeah. >> Excellent. Ion, thanks again for coming in theCUBE. It was really a pleasure having you. >> Thanks for having me. >> You're welcome. All right, keep it right there everybody. George and I will be back to do our wrap right after this short break. This is theCUBE. We're live from Spark Summit East. Right back. (techno music)

Published Date : Feb 8 2017

SUMMARY :

Brought to you by Databricks. And now having you on is just a pleasure, So loved the talk this morning, [Ion] I think it's great, you know, and what you were trying to achieve there is the decision you can make on the data. So fast means you can affect the outcome. And then targeted means it's relevant. Are you doing it over? because it means so many things for so many people. So with hardware enclaves you can execute the code Right, the enclave is a more granular perimeter. and the rest of the code you run outside. And again it's real-time and the area is also So you guys, I want George, And I know that's a topic that you want to discuss and the objective of RISE and from the more endpoints you can get the better. Let me take you back to the security example So in the security example, and they don't know what a new, and you correlate both at the edge and at the center And then you are going to assemble them, to the car example. and you can use bitcoin to pay for it And if you think about What do you mean by that? So here I'm not just talking about you know, You can't send the back to the cloud And if you think about this one, but I'll give you the last word And you want to understand not only why that you don't know what to do with. It was really a pleasure having you. George and I will be back to do our wrap

ENTITIES

Entity	Category	Confidence
David Flores	PERSON	0.99+
George	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Vellante	PERSON	0.99+
2009	DATE	0.99+
Peter Levine	PERSON	0.99+
Bill Clinton	PERSON	0.99+
New York	LOCATION	0.99+
90%	QUANTITY	0.99+
January	DATE	0.99+
AMB	ORGANIZATION	0.99+
last week	DATE	0.99+
Dave	PERSON	0.99+
yesterday	DATE	0.99+
Ion	PERSON	0.99+
ARM	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
six-year	QUANTITY	0.99+
December last year	DATE	0.99+
Databricks	ORGANIZATION	0.99+
three things	QUANTITY	0.99+
Boston Massachusetts	LOCATION	0.99+
one example	QUANTITY	0.99+
two	QUANTITY	0.98+
UCal Berkeley	ORGANIZATION	0.98+
Berkeley	LOCATION	0.98+
AMPLab	ORGANIZATION	0.98+
Ion Stoica	PERSON	0.98+
tens, hundreds of milliseconds	QUANTITY	0.98+
today	DATE	0.97+
end of 2009	DATE	0.96+
Rumsfeld	PERSON	0.96+
Intel	ORGANIZATION	0.96+
Intell	ORGANIZATION	0.95+
both	QUANTITY	0.95+
One	QUANTITY	0.95+
AMP	ORGANIZATION	0.94+
TrustZone	ORGANIZATION	0.94+
Spark Summit East 2017	EVENT	0.93+
around 60 students	QUANTITY	0.93+
RISE	ORGANIZATION	0.93+
Sparks Summit East 2017	EVENT	0.92+
one	QUANTITY	0.89+
one workload	QUANTITY	0.88+
Spark Summit East	EVENT	0.87+
Apache Spark	ORGANIZATION	0.87+
around eight faculties	QUANTITY	0.86+
this January	DATE	0.86+
this morning	DATE	0.84+
Mulligan	ORGANIZATION	0.78+
few hundredths of milliseconds	QUANTITY	0.77+
Professor	PERSON	0.74+
God	PERSON	0.72+
theCUBE	ORGANIZATION	0.7+
few hundred milliseconds	QUANTITY	0.67+
SGX	COMMERCIAL_ITEM	0.64+
Mesos	ORGANIZATION	0.63+
one application	QUANTITY	0.63+
Apache Mesos	ORGANIZATION	0.62+
Alluxio	ORGANIZATION	0.62+
AMPLab	EVENT	0.59+
Tachyon	ORGANIZATION	0.59+
#SparkSummit	EVENT	0.57+

Robert Nishihara, Anyscale | AWS Startup Showcase S3 E1

(upbeat music) >> Hello everyone. Welcome to theCube's presentation of the "AWS Startup Showcase." The topic this episode is AI and machine learning, top startups building foundational model infrastructure. This is season three, episode one of the ongoing series covering exciting startups from the AWS ecosystem. And this time we're talking about AI and machine learning. I'm your host, John Furrier. I'm excited I'm joined today by Robert Nishihara, who's the co-founder and CEO of a hot startup called Anyscale. He's here to talk about Ray, the open source project, Anyscale's infrastructure for foundation as well. Robert, thank you for joining us today. >> Yeah, thanks so much as well. >> I've been following your company since the founding pre pandemic and you guys really had a great vision scaled up and in a perfect position for this big wave that we all see with ChatGPT and OpenAI that's gone mainstream. Finally, AI has broken out through the ropes and now gone mainstream, so I think you guys are really well positioned. I'm looking forward to to talking with you today. But before we get into it, introduce the core mission for Anyscale. Why do you guys exist? What is the North Star for Anyscale? >> Yeah, like you mentioned, there's a tremendous amount of excitement about AI right now. You know, I think a lot of us believe that AI can transform just every different industry. So one of the things that was clear to us when we started this company was that the amount of compute needed to do AI was just exploding. Like to actually succeed with AI, companies like OpenAI or Google or you know, these companies getting a lot of value from AI, were not just running these machine learning models on their laptops or on a single machine. They were scaling these applications across hundreds or thousands or more machines and GPUs and other resources in the Cloud. And so to actually succeed with AI, and this has been one of the biggest trends in computing, maybe the biggest trend in computing in, you know, in recent history, the amount of compute has been exploding. And so to actually succeed with that AI, to actually build these scalable applications and scale the AI applications, there's a tremendous software engineering lift to build the infrastructure to actually run these scalable applications. And that's very hard to do. So one of the reasons many AI projects and initiatives fail is that, or don't make it to production, is the need for this scale, the infrastructure lift, to actually make it happen. So our goal here with Anyscale and Ray, is to make that easy, is to make scalable computing easy. So that as a developer or as a business, if you want to do AI, if you want to get value out of AI, all you need to know is how to program on your laptop. Like, all you need to know is how to program in Python. And if you can do that, then you're good to go. Then you can do what companies like OpenAI or Google do and get value out of machine learning. >> That programming example of how easy it is with Python reminds me of the early days of Cloud, when infrastructure as code was talked about was, it was just code the infrastructure programmable. That's super important. That's what AI people wanted, first program AI. That's the new trend. And I want to understand, if you don't mind explaining, the relationship that Anyscale has to these foundational models and particular the large language models, also called LLMs, was seen with like OpenAI and ChatGPT. Before you get into the relationship that you have with them, can you explain why the hype around foundational models? Why are people going crazy over foundational models? What is it and why is it so important? >> Yeah, so foundational models and foundation models are incredibly important because they enable businesses and developers to get value out of machine learning, to use machine learning off the shelf with these large models that have been trained on tons of data and that are useful out of the box. And then, of course, you know, as a business or as a developer, you can take those foundational models and repurpose them or fine tune them or adapt them to your specific use case and what you want to achieve. But it's much easier to do that than to train them from scratch. And I think there are three, for people to actually use foundation models, there are three main types of workloads or problems that need to be solved. One is training these foundation models in the first place, like actually creating them. The second is fine tuning them and adapting them to your use case. And the third is serving them and actually deploying them. Okay, so Ray and Anyscale are used for all of these three different workloads. Companies like OpenAI or Cohere that train large language models. Or open source versions like GPTJ are done on top of Ray. There are many startups and other businesses that fine tune, that, you know, don't want to train the large underlying foundation models, but that do want to fine tune them, do want to adapt them to their purposes, and build products around them and serve them, those are also using Ray and Anyscale for that fine tuning and that serving. And so the reason that Ray and Anyscale are important here is that, you know, building and using foundation models requires a huge scale. It requires a lot of data. It requires a lot of compute, GPUs, TPUs, other resources. And to actually take advantage of that and actually build these scalable applications, there's a lot of infrastructure that needs to happen under the hood. And so you can either use Ray and Anyscale to take care of that and manage the infrastructure and solve those infrastructure problems. Or you can build the infrastructure and manage the infrastructure yourself, which you can do, but it's going to slow your team down. It's going to, you know, many of the businesses we work with simply don't want to be in the business of managing infrastructure and building infrastructure. They want to focus on product development and move faster. >> I know you got a keynote presentation we're going to go to in a second, but I think you hit on something I think is the real tipping point, doing it yourself, hard to do. These are things where opportunities are and the Cloud did that with data centers. Turned a data center and made it an API. The heavy lifting went away and went to the Cloud so people could be more creative and build their product. In this case, build their creativity. Is that kind of what's the big deal? Is that kind of a big deal happening that you guys are taking the learnings and making that available so people don't have to do that? >> That's exactly right. So today, if you want to succeed with AI, if you want to use AI in your business, infrastructure work is on the critical path for doing that. To do AI, you have to build infrastructure. You have to figure out how to scale your applications. That's going to change. We're going to get to the point, and you know, with Ray and Anyscale, we're going to remove the infrastructure from the critical path so that as a developer or as a business, all you need to focus on is your application logic, what you want the the program to do, what you want your application to do, how you want the AI to actually interface with the rest of your product. Now the way that will happen is that Ray and Anyscale will still, the infrastructure work will still happen. It'll just be under the hood and taken care of by Ray in Anyscale. And so I think something like this is really necessary for AI to reach its potential, for AI to have the impact and the reach that we think it will, you have to make it easier to do. >> And just for clarification to point out, if you don't mind explaining the relationship of Ray and Anyscale real quick just before we get into the presentation. >> So Ray is an open source project. We created it. We were at Berkeley doing machine learning. We started Ray so that, in order to provide an easy, a simple open source tool for building and running scalable applications. And Anyscale is the managed version of Ray, basically we will run Ray for you in the Cloud, provide a lot of tools around the developer experience and managing the infrastructure and providing more performance and superior infrastructure. >> Awesome. I know you got a presentation on Ray and Anyscale and you guys are positioning as the infrastructure for foundational models. So I'll let you take it away and then when you're done presenting, we'll come back, I'll probably grill you with a few questions and then we'll close it out so take it away. >> Robert: Sounds great. So I'll say a little bit about how companies are using Ray and Anyscale for foundation models. The first thing I want to mention is just why we're doing this in the first place. And the underlying observation, the underlying trend here, and this is a plot from OpenAI, is that the amount of compute needed to do machine learning has been exploding. It's been growing at something like 35 times every 18 months. This is absolutely enormous. And other people have written papers measuring this trend and you get different numbers. But the point is, no matter how you slice and dice it, it' a astronomical rate. Now if you compare that to something we're all familiar with, like Moore's Law, which says that, you know, the processor performance doubles every roughly 18 months, you can see that there's just a tremendous gap between the needs, the compute needs of machine learning applications, and what you can do with a single chip, right. So even if Moore's Law were continuing strong and you know, doing what it used to be doing, even if that were the case, there would still be a tremendous gap between what you can do with the chip and what you need in order to do machine learning. And so given this graph, what we've seen, and what has been clear to us since we started this company, is that doing AI requires scaling. There's no way around it. It's not a nice to have, it's really a requirement. And so that led us to start Ray, which is the open source project that we started to make it easy to build these scalable Python applications and scalable machine learning applications. And since we started the project, it's been adopted by a tremendous number of companies. Companies like OpenAI, which use Ray to train their large models like ChatGPT, companies like Uber, which run all of their deep learning and classical machine learning on top of Ray, companies like Shopify or Spotify or Instacart or Lyft or Netflix, ByteDance, which use Ray for their machine learning infrastructure. Companies like Ant Group, which makes Alipay, you know, they use Ray across the board for fraud detection, for online learning, for detecting money laundering, you know, for graph processing, stream processing. Companies like Amazon, you know, run Ray at a tremendous scale and just petabytes of data every single day. And so the project has seen just enormous adoption since, over the past few years. And one of the most exciting use cases is really providing the infrastructure for building training, fine tuning, and serving foundation models. So I'll say a little bit about, you know, here are some examples of companies using Ray for foundation models. Cohere trains large language models. OpenAI also trains large language models. You can think about the workloads required there are things like supervised pre-training, also reinforcement learning from human feedback. So this is not only the regular supervised learning, but actually more complex reinforcement learning workloads that take human input about what response to a particular question, you know is better than a certain other response. And incorporating that into the learning. There's open source versions as well, like GPTJ also built on top of Ray as well as projects like Alpa coming out of UC Berkeley. So these are some of the examples of exciting projects in organizations, training and creating these large language models and serving them using Ray. Okay, so what actually is Ray? Well, there are two layers to Ray. At the lowest level, there's the core Ray system. This is essentially low level primitives for building scalable Python applications. Things like taking a Python function or a Python class and executing them in the cluster setting. So Ray core is extremely flexible and you can build arbitrary scalable applications on top of Ray. So on top of Ray, on top of the core system, what really gives Ray a lot of its power is this ecosystem of scalable libraries. So on top of the core system you have libraries, scalable libraries for ingesting and pre-processing data, for training your models, for fine tuning those models, for hyper parameter tuning, for doing batch processing and batch inference, for doing model serving and deployment, right. And a lot of the Ray users, the reason they like Ray is that they want to run multiple workloads. They want to train and serve their models, right. They want to load their data and feed that into training. And Ray provides common infrastructure for all of these different workloads. So this is a little overview of what Ray, the different components of Ray. So why do people choose to go with Ray? I think there are three main reasons. The first is the unified nature. The fact that it is common infrastructure for scaling arbitrary workloads, from data ingest to pre-processing to training to inference and serving, right. This also includes the fact that it's future proof. AI is incredibly fast moving. And so many people, many companies that have built their own machine learning infrastructure and standardized on particular workflows for doing machine learning have found that their workflows are too rigid to enable new capabilities. If they want to do reinforcement learning, if they want to use graph neural networks, they don't have a way of doing that with their standard tooling. And so Ray, being future proof and being flexible and general gives them that ability. Another reason people choose Ray in Anyscale is the scalability. This is really our bread and butter. This is the reason, the whole point of Ray, you know, making it easy to go from your laptop to running on thousands of GPUs, making it easy to scale your development workloads and run them in production, making it easy to scale, you know, training to scale data ingest, pre-processing and so on. So scalability and performance, you know, are critical for doing machine learning and that is something that Ray provides out of the box. And lastly, Ray is an open ecosystem. You can run it anywhere. You can run it on any Cloud provider. Google, you know, Google Cloud, AWS, Asure. You can run it on your Kubernetes cluster. You can run it on your laptop. It's extremely portable. And not only that, it's framework agnostic. You can use Ray to scale arbitrary Python workloads. You can use it to scale and it integrates with libraries like TensorFlow or PyTorch or JAX or XG Boost or Hugging Face or PyTorch Lightning, right, or Scikit-learn or just your own arbitrary Python code. It's open source. And in addition to integrating with the rest of the machine learning ecosystem and these machine learning frameworks, you can use Ray along with all of the other tooling in the machine learning ecosystem. That's things like weights and biases or ML flow, right. Or you know, different data platforms like Databricks, you know, Delta Lake or Snowflake or tools for model monitoring for feature stores, all of these integrate with Ray. And that's, you know, Ray provides that kind of flexibility so that you can integrate it into the rest of your workflow. And then Anyscale is the scalable compute platform that's built on top, you know, that provides Ray. So Anyscale is a managed Ray service that runs in the Cloud. And what Anyscale does is it offers the best way to run Ray. And if you think about what you get with Anyscale, there are fundamentally two things. One is about moving faster, accelerating the time to market. And you get that by having the managed service so that as a developer you don't have to worry about managing infrastructure, you don't have to worry about configuring infrastructure. You also, it provides, you know, optimized developer workflows. Things like easily moving from development to production, things like having the observability tooling, the debug ability to actually easily diagnose what's going wrong in a distributed application. So things like the dashboards and the other other kinds of tooling for collaboration, for monitoring and so on. And then on top of that, so that's the first bucket, developer productivity, moving faster, faster experimentation and iteration. The second reason that people choose Anyscale is superior infrastructure. So this is things like, you know, cost deficiency, being able to easily take advantage of spot instances, being able to get higher GPU utilization, things like faster cluster startup times and auto scaling. Things like just overall better performance and faster scheduling. And so these are the kinds of things that Anyscale provides on top of Ray. It's the managed infrastructure. It's fast, it's like the developer productivity and velocity as well as performance. So this is what I wanted to share about Ray in Anyscale. >> John: Awesome. >> Provide that context. But John, I'm curious what you think. >> I love it. I love the, so first of all, it's a platform because that's the platform architecture right there. So just to clarify, this is an Anyscale platform, not- >> That's right. >> Tools. So you got tools in the platform. Okay, that's key. Love that managed service. Just curious, you mentioned Python multiple times, is that because of PyTorch and TensorFlow or Python's the most friendly with machine learning or it's because it's very common amongst all developers? >> That's a great question. Python is the language that people are using to do machine learning. So it's the natural starting point. Now, of course, Ray is actually designed in a language agnostic way and there are companies out there that use Ray to build scalable Java applications. But for the most part right now we're focused on Python and being the best way to build these scalable Python and machine learning applications. But, of course, down the road there always is that potential. >> So if you're slinging Python code out there and you're watching that, you're watching this video, get on Anyscale bus quickly. Also, I just, while you were giving the presentation, I couldn't help, since you mentioned OpenAI, which by the way, congratulations 'cause they've had great scale, I've noticed in their rapid growth 'cause they were the fastest company to the number of users than anyone in the history of the computer industry, so major successor, OpenAI and ChatGPT, huge fan. I'm not a skeptic at all. I think it's just the beginning, so congratulations. But I actually typed into ChatGPT, what are the top three benefits of Anyscale and came up with scalability, flexibility, and ease of use. Obviously, scalability is what you guys are called. >> That's pretty good. >> So that's what they came up with. So they nailed it. Did you have an inside prompt training, buy it there? Only kidding. (Robert laughs) >> Yeah, we hard coded that one. >> But that's the kind of thing that came up really, really quickly if I asked it to write a sales document, it probably will, but this is the future interface. This is why people are getting excited about the foundational models and the large language models because it's allowing the interface with the user, the consumer, to be more human, more natural. And this is clearly will be in every application in the future. >> Absolutely. This is how people are going to interface with software, how they're going to interface with products in the future. It's not just something, you know, not just a chat bot that you talk to. This is going to be how you get things done, right. How you use your web browser or how you use, you know, how you use Photoshop or how you use other products. Like you're not going to spend hours learning all the APIs and how to use them. You're going to talk to it and tell it what you want it to do. And of course, you know, if it doesn't understand it, it's going to ask clarifying questions. You're going to have a conversation and then it'll figure it out. >> This is going to be one of those things, we're going to look back at this time Robert and saying, "Yeah, from that company, that was the beginning of that wave." And just like AWS and Cloud Computing, the folks who got in early really were in position when say the pandemic came. So getting in early is a good thing and that's what everyone's talking about is getting in early and playing around, maybe replatforming or even picking one or few apps to refactor with some staff and managed services. So people are definitely jumping in. So I have to ask you the ROI cost question. You mentioned some of those, Moore's Law versus what's going on in the industry. When you look at that kind of scale, the first thing that jumps out at people is, "Okay, I love it. Let's go play around." But what's it going to cost me? Am I going to be tied to certain GPUs? What's the landscape look like from an operational standpoint, from the customer? Are they locked in and the benefit was flexibility, are you flexible to handle any Cloud? What is the customers, what are they looking at? Basically, that's my question. What's the customer looking at? >> Cost is super important here and many of the companies, I mean, companies are spending a huge amount on their Cloud computing, on AWS, and on doing AI, right. And I think a lot of the advantage of Anyscale, what we can provide here is not only better performance, but cost efficiency. Because if we can run something faster and more efficiently, it can also use less resources and you can lower your Cloud spending, right. We've seen companies go from, you know, 20% GPU utilization with their current setup and the current tools they're using to running on Anyscale and getting more like 95, you know, 100% GPU utilization. That's something like a five x improvement right there. So depending on the kind of application you're running, you know, it's a significant cost savings. We've seen companies that have, you know, processing petabytes of data every single day with Ray going from, you know, getting order of magnitude cost savings by switching from what they were previously doing to running their application on Ray. And when you have applications that are spending, you know, potentially $100 million a year and getting a 10 X cost savings is just absolutely enormous. So these are some of the kinds of- >> Data infrastructure is super important. Again, if the customer, if you're a prospect to this and thinking about going in here, just like the Cloud, you got infrastructure, you got the platform, you got SaaS, same kind of thing's going to go on in AI. So I want to get into that, you know, ROI discussion and some of the impact with your customers that are leveraging the platform. But first I hear you got a demo. >> Robert: Yeah, so let me show you, let me give you a quick run through here. So what I have open here is the Anyscale UI. I've started a little Anyscale Workspace. So Workspaces are the Anyscale concept for interactive developments, right. So here, imagine I'm just, you want to have a familiar experience like you're developing on your laptop. And here I have a terminal. It's not on my laptop. It's actually in the cloud running on Anyscale. And I'm just going to kick this off. This is going to train a large language model, so OPT. And it's doing this on 32 GPUs. We've got a cluster here with a bunch of CPU cores, bunch of memory. And as that's running, and by the way, if I wanted to run this on instead of 32 GPUs, 64, 128, this is just a one line change when I launch the Workspace. And what I can do is I can pull up VS code, right. Remember this is the interactive development experience. I can look at the actual code. Here it's using Ray train to train the torch model. We've got the training loop and we're saying that each worker gets access to one GPU and four CPU cores. And, of course, as I make the model larger, this is using deep speed, as I make the model larger, I could increase the number of GPUs that each worker gets access to, right. And how that is distributed across the cluster. And if I wanted to run on CPUs instead of GPUs or a different, you know, accelerator type, again, this is just a one line change. And here we're using Ray train to train the models, just taking my vanilla PyTorch model using Hugging Face and then scaling that across a bunch of GPUs. And, of course, if I want to look at the dashboard, I can go to the Ray dashboard. There are a bunch of different visualizations I can look at. I can look at the GPU utilization. I can look at, you know, the CPU utilization here where I think we're currently loading the model and running that actual application to start the training. And some of the things that are really convenient here about Anyscale, both I can get that interactive development experience with VS code. You know, I can look at the dashboards. I can monitor what's going on. It feels, I have a terminal, it feels like my laptop, but it's actually running on a large cluster. And I can, with however many GPUs or other resources that I want. And so it's really trying to combine the best of having the familiar experience of programming on your laptop, but with the benefits, you know, being able to take advantage of all the resources in the Cloud to scale. And it's like when, you know, you're talking about cost efficiency. One of the biggest reasons that people waste money, one of the silly reasons for wasting money is just forgetting to turn off your GPUs. And what you can do here is, of course, things will auto terminate if they're idle. But imagine you go to sleep, I have this big cluster. You can turn it off, shut off the cluster, come back tomorrow, restart the Workspace, and you know, your big cluster is back up and all of your code changes are still there. All of your local file edits. It's like you just closed your laptop and came back and opened it up again. And so this is the kind of experience we want to provide for our users. So that's what I wanted to share with you. >> Well, I think that whole, couple of things, lines of code change, single line of code change, that's game changing. And then the cost thing, I mean human error is a big deal. People pass out at their computer. They've been coding all night or they just forget about it. I mean, and then it's just like leaving the lights on or your water running in your house. It's just, at the scale that it is, the numbers will add up. That's a huge deal. So I think, you know, compute back in the old days, there's no compute. Okay, it's just compute sitting there idle. But you know, data cranking the models is doing, that's a big point. >> Another thing I want to add there about cost efficiency is that we make it really easy to use, if you're running on Anyscale, to use spot instances and these preemptable instances that can just be significantly cheaper than the on-demand instances. And so when we see our customers go from what they're doing before to using Anyscale and they go from not using these spot instances 'cause they don't have the infrastructure around it, the fault tolerance to handle the preemption and things like that, to being able to just check a box and use spot instances and save a bunch of money. >> You know, this was my whole, my feature article at Reinvent last year when I met with Adam Selipsky, this next gen Cloud is here. I mean, it's not auto scale, it's infrastructure scale. It's agility. It's flexibility. I think this is where the world needs to go. Almost what DevOps did for Cloud and what you were showing me that demo had this whole SRE vibe. And remember Google had site reliability engines to manage all those servers. This is kind of like an SRE vibe for data at scale. I mean, a similar kind of order of magnitude. I mean, I might be a little bit off base there, but how would you explain it? >> It's a nice analogy. I mean, what we are trying to do here is get to the point where developers don't think about infrastructure. Where developers only think about their application logic. And where businesses can do AI, can succeed with AI, and build these scalable applications, but they don't have to build, you know, an infrastructure team. They don't have to develop that expertise. They don't have to invest years in building their internal machine learning infrastructure. They can just focus on the Python code, on their application logic, and run the stuff out of the box. >> Awesome. Well, I appreciate the time. Before we wrap up here, give a plug for the company. I know you got a couple websites. Again, go, Ray's got its own website. You got Anyscale. You got an event coming up. Give a plug for the company looking to hire. Put a plug in for the company. >> Yeah, absolutely. Thank you. So first of all, you know, we think AI is really going to transform every industry and the opportunity is there, right. We can be the infrastructure that enables all of that to happen, that makes it easy for companies to succeed with AI, and get value out of AI. Now we have, if you're interested in learning more about Ray, Ray has been emerging as the standard way to build scalable applications. Our adoption has been exploding. I mentioned companies like OpenAI using Ray to train their models. But really across the board companies like Netflix and Cruise and Instacart and Lyft and Uber, you know, just among tech companies. It's across every industry. You know, gaming companies, agriculture, you know, farming, robotics, drug discovery, you know, FinTech, we see it across the board. And all of these companies can get value out of AI, can really use AI to improve their businesses. So if you're interested in learning more about Ray and Anyscale, we have our Ray Summit coming up in September. This is going to highlight a lot of the most impressive use cases and stories across the industry. And if your business, if you want to use LLMs, you want to train these LLMs, these large language models, you want to fine tune them with your data, you want to deploy them, serve them, and build applications and products around them, give us a call, talk to us. You know, we can really take the infrastructure piece, you know, off the critical path and make that easy for you. So that's what I would say. And, you know, like you mentioned, we're hiring across the board, you know, engineering, product, go-to-market, and it's an exciting time. >> Robert Nishihara, co-founder and CEO of Anyscale, congratulations on a great company you've built and continuing to iterate on and you got growth ahead of you, you got a tailwind. I mean, the AI wave is here. I think OpenAI and ChatGPT, a customer of yours, have really opened up the mainstream visibility into this new generation of applications, user interface, roll of data, large scale, how to make that programmable so we're going to need that infrastructure. So thanks for coming on this season three, episode one of the ongoing series of the hot startups. In this case, this episode is the top startups building foundational model infrastructure for AI and ML. I'm John Furrier, your host. Thanks for watching. (upbeat music)

Published Date : Mar 9 2023

SUMMARY :

episode one of the ongoing and you guys really had and other resources in the Cloud. and particular the large language and what you want to achieve. and the Cloud did that with data centers. the point, and you know, if you don't mind explaining and managing the infrastructure and you guys are positioning is that the amount of compute needed to do But John, I'm curious what you think. because that's the platform So you got tools in the platform. and being the best way to of the computer industry, Did you have an inside prompt and the large language models and tell it what you want it to do. So I have to ask you and you can lower your So I want to get into that, you know, and you know, your big cluster is back up So I think, you know, the on-demand instances. and what you were showing me that demo and run the stuff out of the box. I know you got a couple websites. and the opportunity is there, right. and you got growth ahead

ENTITIES

Entity	Category	Confidence
Robert Nishihara	PERSON	0.99+
John	PERSON	0.99+
Robert	PERSON	0.99+
John Furrier	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
35 times	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
$100 million	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Ant Group	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Python	TITLE	0.99+
20%	QUANTITY	0.99+
32 GPUs	QUANTITY	0.99+
Lyft	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
tomorrow	DATE	0.99+
Anyscale	ORGANIZATION	0.99+
three	QUANTITY	0.99+
128	QUANTITY	0.99+
September	DATE	0.99+
today	DATE	0.99+
Moore's Law	TITLE	0.99+
Adam Selipsky	PERSON	0.99+
PyTorch	TITLE	0.99+
Ray	ORGANIZATION	0.99+
second reason	QUANTITY	0.99+
64	QUANTITY	0.99+
each worker	QUANTITY	0.99+
each worker	QUANTITY	0.99+
Photoshop	TITLE	0.99+
UC Berkeley	ORGANIZATION	0.99+
Java	TITLE	0.99+
Shopify	ORGANIZATION	0.99+
OpenAI	ORGANIZATION	0.99+
Anyscale	PERSON	0.99+
third	QUANTITY	0.99+
two things	QUANTITY	0.99+
ByteDance	ORGANIZATION	0.99+
Spotify	ORGANIZATION	0.99+
One	QUANTITY	0.99+
95	QUANTITY	0.99+
Asure	ORGANIZATION	0.98+
one line	QUANTITY	0.98+
one GPU	QUANTITY	0.98+
ChatGPT	TITLE	0.98+
TensorFlow	TITLE	0.98+
last year	DATE	0.98+
first bucket	QUANTITY	0.98+
both	QUANTITY	0.98+
two layers	QUANTITY	0.98+
Cohere	ORGANIZATION	0.98+
Alipay	ORGANIZATION	0.98+
Ray	PERSON	0.97+
one	QUANTITY	0.97+
Instacart	ORGANIZATION	0.97+

Breaking Analysis: MWC 2023 goes beyond consumer & deep into enterprise tech

>> From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR, this is Breaking Analysis with Dave Vellante. >> While never really meant to be a consumer tech event, the rapid ascendancy of smartphones sucked much of the air out of Mobile World Congress over the years, now MWC. And while the device manufacturers continue to have a major presence at the show, the maturity of intelligent devices, longer life cycles, and the disaggregation of the network stack, have put enterprise technologies front and center in the telco business. Semiconductor manufacturers, network equipment players, infrastructure companies, cloud vendors, software providers, and a spate of startups are eyeing the trillion dollar plus communications industry as one of the next big things to watch this decade. Hello, and welcome to this week's Wikibon CUBE Insights, powered by ETR. In this Breaking Analysis, we bring you part two of our ongoing coverage of MWC '23, with some new data on enterprise players specifically in large telco environments, a brief glimpse at some of the pre-announcement news and corresponding themes ahead of MWC, and some of the key announcement areas we'll be watching at the show on theCUBE. Now, last week we shared some ETR data that showed how traditional enterprise tech players were performing, specifically within the telecoms vertical. Here's a new look at that data from ETR, which isolates the same companies, but cuts the data for what ETR calls large telco. The N in this cut is 196, down from 288 last week when we included all company sizes in the dataset. Now remember the two dimensions here, on the y-axis is net score, or spending momentum, and on the x-axis is pervasiveness in the data set. The table insert in the upper left informs how the dots and companies are plotted, and that red dotted line, the horizontal line at 40%, that indicates a highly elevated net score. Now while the data are not dramatically different in terms of relative positioning, there are a couple of changes at the margin. So just going down the list and focusing on net score. Azure is comparable, but slightly lower in this sector in the large telco than it was overall. Google Cloud comes in at number two, and basically swapped places with AWS, which drops slightly in the large telco relative to overall telco. Snowflake is also slightly down by one percentage point, but maintains its position. Remember Snowflake, overall, its net score is much, much higher when measuring across all verticals. Snowflake comes down in telco, and relative to overall, a little bit down in large telco, but it's making some moves to attack this market that we'll talk about in a moment. Next are Red Hat OpenStack and Databricks. About the same in large tech telco as they were an overall telco. Then there's Dell next that has a big presence at MWC and is getting serious about driving 16G adoption, and new servers, and edge servers, and other partnerships. Cisco and Red Hat OpenShift basically swapped spots when moving from all telco to large telco, as Cisco drops and Red Hat bumps up a bit. And VMware dropped about four percentage points in large telco. Accenture moved up dramatically, about nine percentage points in big telco, large telco relative to all telco. HPE dropped a couple of percentage points. Oracle stayed about the same. And IBM surprisingly dropped by about five points. So look, I understand not a ton of change in terms of spending momentum in the large sector versus telco overall, but some deltas. The bottom line for enterprise players is one, they're just getting started in this new disruption journey that they're on as the stack disaggregates. Two, all these players have experience in delivering horizontal solutions, but now working with partners and identifying big problems to be solved, and three, many of these companies are generally not the fastest moving firms relative to smaller disruptive disruptors. Now, cloud has been an exception in fairness. But the good news for the legacy infrastructure and IT companies is that the telco transformation and the 5G buildout is going to take years. So it's moving at a pace that is very favorable to many of these companies. Okay, so looking at just some of the pre-announcement highlights that have hit the wire this week, I want to give you a glimpse of the diversity of innovation that is occurring in the telecommunication space. You got semiconductor manufacturers, device makers, network equipment players, carriers, cloud vendors, enterprise tech companies, software companies, startups. Now we've included, you'll see in this list, we've included OpeRAN, that logo, because there's so much buzz around the topic and we're going to come back to that. But suffice it to say, there's no way we can cover all the announcements from the 2000 plus exhibitors at the show. So we're going to cherry pick here and make a few call outs. Hewlett Packard Enterprise announced an acquisition of an Italian private cellular network company called AthoNet. Zeus Kerravala wrote about it on SiliconANGLE if you want more details. Now interestingly, HPE has a partnership with Solana, which also does private 5G. But according to Zeus, Solona is more of an out-of-the-box solution, whereas AthoNet is designed for the core and requires more integration. And as you'll see in a moment, there's going to be a lot of talk at the show about private network. There's going to be a lot of news there from other competitors, and we're going to be watching that closely. And while many are concerned about the P5G, private 5G, encroaching on wifi, Kerravala doesn't see it that way. Rather, he feels that these private networks are really designed for more industrial, and you know mission critical environments, like factories, and warehouses that are run by robots, et cetera. 'Cause these can justify the increased expense of private networks. Whereas wifi remains a very low cost and flexible option for, you know, whatever offices and homes. Now, over to Dell. Dell announced its intent to go hard after opening up the telco network with the announcement that in the second half of this year it's going to begin shipping its infrastructure blocks for Red Hat. Remember it's like kind of the converged infrastructure for telco with a more open ecosystem and sort of more flexible, you know, more mature engineered system. Dell has also announced a range of PowerEdge servers for a variety of use cases. A big wide line bringing forth its 16G portfolio and aiming squarely at the telco space. Dell also announced, here we go, a private wireless offering with airspan, and Expedo, and a solution with AthoNet, the company HPE announced it was purchasing. So I guess Dell and HPE are now partnering up in the private wireless space, and yes, hell is freezing over folks. We'll see where that relationship goes in the mid- to long-term. Dell also announced new lab and certification capabilities, which we said last week was going to be critical for the further adoption of open ecosystem technology. So props to Dell for, you know, putting real emphasis and investment in that. AWS also made a number of announcements in this space including private wireless solutions and associated managed services. AWS named Deutsche Telekom, Orange, T-Mobile, Telefonica, and some others as partners. And AWS announced the stepped up partnership, specifically with T-Mobile, to bring AWS services to T-Mobile's network portfolio. Snowflake, back to Snowflake, announced its telecom data cloud. Remember we showed the data earlier, it's Snowflake not as strong in the telco sector, but they're continuing to move toward this go-to market alignment within key industries, realigning their go-to market by vertical. It also announced that AT&T, and a number of other partners, are collaborating to break down data silos specifically in telco. Look, essentially, this is Snowflake taking its core value prop to the telco vertical and forming key partnerships that resonate in the space. So think simplification, breaking down silos, data sharing, eventually data monetization. Samsung previewed its future capability to allow smartphones to access satellite services, something Apple has previously done. AMD, Intel, Marvell, Qualcomm, are all in the act, all the semiconductor players. Qualcomm for example, announced along with Telefonica, and Erickson, a 5G millimeter network that will be showcased in Spain at the event this coming week using Qualcomm Snapdragon chipset platform, based on none other than Arm technology. Of course, Arm we said is going to dominate the edge, and is is clearly doing so. It's got the volume advantage over, you know, traditional Intel, you know, X86 architectures. And it's no surprise that Microsoft is touting its open AI relationship. You're going to hear a lot of AI talk at this conference as is AI is now, you know, is the now topic. All right, we could go on and on and on. There's just so much going on at Mobile World Congress or MWC, that we just wanted to give you a glimpse of some of the highlights that we've been watching. Which brings us to the key topics and issues that we'll be exploring at MWC next week. We touched on some of this last week. A big topic of conversation will of course be, you know, 5G. Is it ever going to become real? Is it, is anybody ever going to make money at 5G? There's so much excitement around and anticipation around 5G. It has not lived up to the hype, but that's because the rollout, as we've previous reported, is going to take years. And part of that rollout is going to rely on the disaggregation of the hardened telco stack, as we reported last week and in previous Breaking Analysis episodes. OpenRAN is a big component of that evolution. You know, as our RAN intelligent controllers, RICs, which essentially the brain of OpenRAN, if you will. Now as we build out 5G networks at massive scale and accommodate unprecedented volumes of data and apply compute-hungry AI to all this data, the issue of energy efficiency is going to be front and center. It has to be. Not only is it a, you know, hot political issue, the reality is that improving power efficiency is compulsory or the whole vision of telco's future is going to come crashing down. So chip manufacturers, equipment makers, cloud providers, everybody is going to be doubling down and clicking on this topic. Let's talk about AI. AI as we said, it is the hot topic right now, but it is happening not only in consumer, with things like ChatGPT. And think about the theme of this Breaking Analysis in the enterprise, AI in the enterprise cannot be ChatGPT. It cannot be error prone the way ChatGPT is. It has to be clean, reliable, governed, accurate. It's got to be ethical. It's got to be trusted. Okay, we're going to have Zeus Kerravala on the show next week and definitely want to get his take on private networks and how they're going to impact wifi. You know, will private networks cannibalize wifi? If not, why not? He wrote about this again on SiliconANGLE if you want more details, and we're going to unpack that on theCUBE this week. And finally, as always we'll be following the data flows to understand where and how telcos, cloud players, startups, software companies, disruptors, legacy companies, end customers, how are they going to make money from new data opportunities? 'Cause we often say in theCUBE, don't ever bet against data. All right, that's a wrap for today. Remember theCUBE is going to be on location at MWC 2023 next week. We got a great set. We're in the walkway in between halls four and five, right in Congress Square, stand CS-60. Look for us, we got a full schedule. If you got a great story or you have news, stop by. We're going to try to get you on the program. I'll be there with Lisa Martin, co-hosting, David Nicholson as well, and the entire CUBE crew, so don't forget to come by and see us. I want to thank Alex Myerson, who's on production and manages the podcast, and Ken Schiffman, as well, in our Boston studio. Kristen Martin and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hof is our editor-in-chief over at SiliconANGLE.com. He does some great editing. Thank you. All right, remember all these episodes they are available as podcasts wherever you listen. All you got to do is search Breaking Analysis podcasts. I publish each week on Wikibon.com and SiliconANGLE.com. All the video content is available on demand at theCUBE.net, or you can email me directly if you want to get in touch David.Vellante@SiliconANGLE.com or DM me @DVellante, or comment on our LinkedIn posts. And please do check out ETR.ai for the best survey data in the enterprise tech business. This is Dave Vellante for theCUBE Insights, powered by ETR. Thanks for watching. We'll see you next week at Mobile World Congress '23, MWC '23, or next time on Breaking Analysis. (bright music)

Published Date : Feb 25 2023

SUMMARY :

bringing you data-driven in the mid- to long-term.

ENTITIES

Entity	Category	Confidence
David Nicholson	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Alex Myerson	PERSON	0.99+
Orange	ORGANIZATION	0.99+
Qualcomm	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
Telefonica	ORGANIZATION	0.99+
Kristen Martin	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
AMD	ORGANIZATION	0.99+
Spain	LOCATION	0.99+
T-Mobile	ORGANIZATION	0.99+
Ken Schiffman	PERSON	0.99+
Deutsche Telekom	ORGANIZATION	0.99+
Hewlett Packard Enterprise	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Cheryl Knight	PERSON	0.99+
Marvell	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Samsung	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
AT&T	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Intel	ORGANIZATION	0.99+
Rob Hof	PERSON	0.99+
Palo Alto	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
40%	QUANTITY	0.99+
last week	DATE	0.99+
AthoNet	ORGANIZATION	0.99+
Erickson	ORGANIZATION	0.99+
Congress Square	LOCATION	0.99+
Accenture	ORGANIZATION	0.99+
next week	DATE	0.99+
Mobile World Congress	EVENT	0.99+
Solana	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
two dimensions	QUANTITY	0.99+
ETR	ORGANIZATION	0.99+
MWC '23	EVENT	0.99+
MWC	EVENT	0.99+
288	QUANTITY	0.98+
today	DATE	0.98+
this week	DATE	0.98+
Solona	ORGANIZATION	0.98+
David.Vellante@SiliconANGLE.com	OTHER	0.98+
telco	ORGANIZATION	0.98+
Two	QUANTITY	0.98+
each week	QUANTITY	0.97+
Zeus Kerravala	PERSON	0.97+
MWC 2023	EVENT	0.97+
about five points	QUANTITY	0.97+
theCUBE.net	OTHER	0.97+
Red Hat	ORGANIZATION	0.97+
Snowflake	TITLE	0.96+
one	QUANTITY	0.96+
Databricks	ORGANIZATION	0.96+
three	QUANTITY	0.96+
theCUBE Studios	ORGANIZATION	0.96+

Robert Nishihara, Anyscale | CUBE Conversation

(upbeat instrumental) >> Hello and welcome to this CUBE conversation. I'm John Furrier, host of theCUBE, here in Palo Alto, California. Got a great conversation with Robert Nishihara who's the co-founder and CEO of Anyscale. Robert, great to have you on this CUBE conversation. It's great to see you. We did your first Ray Summit a couple years ago and congratulations on your venture. Great to have you on. >> Thank you. Thanks for inviting me. >> So you're first time CEO out of Berkeley in Data. You got the Databricks is coming out of there. You got a bunch of activity coming from Berkeley. It's like a, it really is kind of like where a lot of innovations going on data. Anyscale has been one of those startups that has risen out of that scene. Right? You look at the success of what the Data lakes are now. Now you've got the generative AI. This has been a really interesting innovation market. This new wave is coming. Tell us what's going on with Anyscale right now, as you guys are gearing up and getting some growth. What's happening with the company? >> Yeah, well one of the most exciting things that's been happening in computing recently, is the rise of AI and the excitement about AI, and the potential for AI to really transform every industry. Now of course, one of the of the biggest challenges to actually making that happen is that doing AI, that AI is incredibly computationally intensive, right? To actually succeed with AI to actually get value out of AI. You're typically not just running it on your laptop, you're often running it and scaling it across thousands of machines, or hundreds of machines or GPUs, and to, so organizations and companies and businesses that do AI often end up building a large infrastructure team to manage the distributed systems, the computing to actually scale these applications. And that's a, that's a, a huge software engineering lift, right? And so, one of the goals for Anyscale is really to make that easy. To get to the point where, developers and teams and companies can succeed with AI. Can build these scalable AI applications, without really you know, without a huge investment in infrastructure with a lot of, without a lot of expertise in infrastructure, where really all they need to know is how to program on their laptop, how to program in Python. And if you have that, then that's really all you need to succeed with AI. So that's what we've been focused on. We're building Ray, which is an open source project that's been starting to get adopted by tons of companies, to actually train these models, to deploy these models, to do inference with these models, you know, to ingest and pre-process their data. And our goals, you know, here with the company are really to make Ray successful. To grow the Ray community, and then to build a great product around it and simplify the development and deployment, and productionization of machine learning for, for all these businesses. >> It's a great trend. Everyone wants developer productivity seeing that, clearly right now. And plus, developers are voting literally on what standards become. As you look at how the market is open source driven, a lot of that I love the model, love the Ray project love the, love the Anyscale value proposition. How big are you guys now, and how is that value proposition of Ray and Anyscale and foundational models coming together? Because it seems like you guys are in a perfect storm situation where you guys could get a real tailwind and draft off the the mega trend that everyone's getting excited. The new toy is ChatGPT. So you got to look at that and say, hey, I mean, come on, you guys did all the heavy lifting. >> Absolutely. >> You know how many people you are, and what's the what's the proposition for you guys these days? >> You know our company's about a hundred people, that a bit larger than that. Ray's been going really quickly. It's been, you know, companies using, like OpenAI uses Ray to train their models, like ChatGPT. Companies like Uber run all their deep learning you know, and classical machine learning on top of Ray. Companies like Shopify, Spotify, Netflix, Cruise, Lyft, Instacart, you know, Bike Dance. A lot of these companies are investing heavily in Ray for their machine learning infrastructure. And I think it's gotten to the point where, if you're one of these, you know type of businesses, and you're looking to revamp your machine learning infrastructure. If you're looking to enable new capabilities, you know make your teams more productive, increase, speed up the experimentation cycle, you know make it more performance, like build, you know, run applications that are more scalable, run them faster, run them in a more cost efficient way. All of these types of companies are at least evaluating Ray and Ray is an increasingly common choice there. I think if they're not using Ray, if many of these companies that end up not using Ray, they often end up building their own infrastructure. So Ray has been, the growth there has been incredibly exciting over the, you know we had our first in-person Ray Summit just back in August, and planning the next one for, for coming September. And so when you asked about the value proposition, I think there's there's really two main things, when people choose to go with Ray and Anyscale. One reason is about moving faster, right? It's about developer productivity, it's about speeding up the experimentation cycle, easily getting their models in production. You know, we hear many companies say that they, you know they, once they prototype a model, once they develop a model, it's another eight weeks, or 12 weeks to actually get that model in production. And that's a reason they talk to us. We hear companies say that, you know they've been training their models and, and doing inference on a single machine, and they've been sort of scaling vertically, like using bigger and bigger machines. But they, you know, you can only do that for so long, and at some point you need to go beyond a single machine and that's when they start talking to us. Right? So one of the main value propositions is around moving faster. I think probably the phrase I hear the most is, companies saying that they don't want their machine learning people to have to spend all their time configuring infrastructure. All this is about productivity. >> Yeah. >> The other. >> It's the big brains in the company. That are being used to do remedial tasks that should be automated right? I mean that's. >> Yeah, and I mean, it's hard stuff, right? It's also not these people's area of expertise, and or where they're adding the most value. So all of this is around developer productivity, moving faster, getting to market faster. The other big value prop and the reason people choose Ray and choose Anyscale, is around just providing superior infrastructure. This is really, can we scale more? You know, can we run it faster, right? Can we run it in a more cost effective way? We hear people saying that they're not getting good GPU utilization with the existing tools they're using, or they can't scale beyond a certain point, or you know they don't have a way to efficiently use spot instances to save costs, right? Or their clusters, you know can't auto scale up and down fast enough, right? These are all the kinds of things that Ray and Anyscale, where Ray and Anyscale add value and solve these kinds of problems. >> You know, you bring up great points. Auto scaling concept, early days, it was easy getting more compute. Now it's complicated. They're built into more integrated apps in the cloud. And you mentioned those companies that you're working with, that's impressive. Those are like the big hardcore, I call them hardcore. They have a good technical teams. And as the wave starts to move from these companies that were hyper scaling up all the time, the mainstream are just developers, right? So you need an interface in, so I see the dots connecting with you guys and I want to get your reaction. Is that how you see it? That you got the alphas out there kind of kicking butt, building their own stuff, alpha developers and infrastructure. But mainstream just wants programmability. They want that heavy lifting taken care of for them. Is that kind of how you guys see it? I mean, take us through that. Because to get crossover to be democratized, the automation's got to be there. And for developer productivity to be in, it's got to be coding and programmability. >> That's right. Ultimately for AI to really be successful, and really you know, transform every industry in the way we think it has the potential to. It has to be easier to use, right? And that is, and being easier to use, there's many dimensions to that. But an important one is that as a developer to do AI, you shouldn't have to be an expert in distributed systems. You shouldn't have to be an expert in infrastructure. If you do have to be, that's going to really limit the number of people who can do this, right? And I think there are so many, all of the companies we talk to, they don't want to be in the business of building and managing infrastructure. It's not that they can't do it. But it's going to slow them down, right? They want to allocate their time and their energy toward building their product, right? To building a better product, getting their product to market faster. And if we can take the infrastructure work off of the critical path for them, that's going to speed them up, it's going to simplify their lives. And I think that is critical for really enabling all of these companies to succeed with AI. >> Talk about the customers you guys are talking to right now, and how that translates over. Because I think you hit a good thread there. Data infrastructure is critical. Managed services are coming online, open sources continuing to grow. You have these people building their own, and then if they abandon it or don't scale it properly, there's kind of consequences. 'Cause it's a system you mentioned, it's a distributed system architecture. It's not as easy as standing up a monolithic app these days. So when you guys go to the marketplace and talk to customers, put the customers in buckets. So you got the ones that are kind of leaning in, that are pretty peaked, probably working with you now, open source. And then what's the customer profile look like as you go mainstream? Are they looking to manage service, looking for more architectural system, architecture approach? What's the, Anyscale progression? How do you engage with your customers? What are they telling you? >> Yeah, so many of these companies, yes, they're looking for managed infrastructure 'cause they want to move faster, right? Now the kind of these profiles of these different customers, they're three main workloads that companies run on Anyscale, run with Ray. It's training related workloads, and it is serving and deployment related workloads, like actually deploying your models, and it's batch processing, batch inference related workloads. Like imagine you want to do computer vision on tons and tons of, of images or videos, or you want to do natural language processing on millions of documents or audio, or speech or things like that, right? So the, I would say the, there's a pretty large variety of use cases, but the most common you know, we see tons of people working with computer vision data, you know, computer vision problems, natural language processing problems. And it's across many different industries. We work with companies doing drug discovery, companies doing you know, gaming or e-commerce, right? Companies doing robotics or agriculture. So there's a huge variety of the types of industries that can benefit from AI, and can really get a lot of value out of AI. And, but the, but the problems are the same problems that they all want to solve. It's like how do you make your team move faster, you know succeed with AI, be more productive, speed up the experimentation, and also how do you do this in a more performant way, in a faster, cheaper, in a more cost efficient, more scalable way. >> It's almost like the cloud game is coming back to AI and these foundational models, because I was just on a podcast, we recorded our weekly podcast, and I was just riffing with Dave Vellante, my co-host on this, were like, hey, in the early days of Amazon, if you want to build an app, you just, you have to build a data center, and then you go to now you go to the cloud, cloud's easier, pay a little money, penny's on the dollar, you get your app up and running. Cloud computing is born. With foundation models in generative AI. The old model was hard, heavy lifting, expensive, build out, before you get to do anything, as you mentioned time. So I got to think that you're pretty much in a good position with this foundational model trend in generative AI because I just looked at the foundation map, foundation models, map of the ecosystem. You're starting to see layers of, you got the tooling, you got platform, you got cloud. It's filling out really quickly. So why is Anyscale important to this new trend? How do you talk to people when they ask you, you know what does ChatGPT mean for Anyscale? And how does the financial foundational model growth, fit into your plan? >> Well, foundational models are hugely important for the industry broadly. Because you're going to have these really powerful models that are trained that you know, have been trained on tremendous amounts of data. tremendous amounts of computes, and that are useful out of the box, right? That people can start to use, and query, and get value out of, without necessarily training these huge models themselves. Now Ray fits in and Anyscale fit in, in a number of places. First of all, they're useful for creating these foundation models. Companies like OpenAI, you know, use Ray for this purpose. Companies like Cohere use Ray for these purposes. You know, IBM. If you look at, there's of course also open source versions like GPTJ, you know, created using Ray. So a lot of these large language models, large foundation models benefit from training on top of Ray. And, but of course for every company training and creating these huge foundation models, you're going to have many more that are fine tuning these models with their own data. That are deploying and serving these models for their own applications, that are building other application and business logic around these models. And that's where Ray also really shines, because Ray you know, is, can provide common infrastructure for all of these workloads. The training, the fine tuning, the serving, the data ingest and pre-processing, right? The hyper parameter tuning, the and and so on. And so where the reason Ray and Anyscale are important here, is that, again, foundation models are large, foundation models are compute intensive, doing you know, using both creating and using these foundation models requires tremendous amounts of compute. And there there's a big infrastructure lift to make that happen. So either you are using Ray and Anyscale to do this, or you are building the infrastructure and managing the infrastructure yourself. Which you can do, but it's, it's hard. >> Good luck with that. I always say good luck with that. I mean, I think if you really need to do, build that hardened foundation, you got to go all the way. And I think this, this idea of composability is interesting. How is Ray working with OpenAI for instance? Take, take us through that. Because I think you're going to see a lot of people talking about, okay I got trained models, but I'm going to have not one, I'm going to have many. There's big debate that OpenAI is going to be the mother of all LLMs, but now, but really people are also saying that to be many more, either purpose-built or specific. The fusion and these things come together there's like a blending of data, and that seems to be a value proposition. How does Ray help these guys get their models up? Can you take, take us through what Ray's doing for say OpenAI and others, and how do you see the models interacting with each other? >> Yeah, great question. So where, where OpenAI uses Ray right now, is for the training workloads. Training both to create ChatGPT and models like that. There's both a supervised learning component, where you're pre-training this model on doing supervised pre-training with example data. There's also a reinforcement learning component, where you are fine-tuning the model and continuing to train the model, but based on human feedback, based on input from humans saying that, you know this response to this question is better than this other response to this question, right? And so Ray provides the infrastructure for scaling the training across many, many GPUs, many many machines, and really running that in an efficient you know, performance fault tolerant way, right? And so, you know, open, this is not the first version of OpenAI's infrastructure, right? They've gone through iterations where they did start with building the infrastructure themselves. They were using tools like MPI. But at some point, you know, given the complexity, given the scale of what they're trying to do, you hit a wall with MPI and that's going to happen with a lot of other companies in this space. And at that point you don't have many other options other than to use Ray or to build your own infrastructure. >> That's awesome. And then your vision on this data interaction, because the old days monolithic models were very rigid. You couldn't really interface with them. But we're kind of seeing this future of data fusion, data interaction, data blending at large scale. What's your vision? How do you, what's your vision of where this goes? Because if this goes the way people think. You can have this data chemistry kind of thing going on where people are integrating all kinds of data with each other at large scale. So you need infrastructure, intelligence, reasoning, a lot of code. Is this something that you see? What's your vision in all this? Take us through. >> AI is going to be used everywhere right? It's, we see this as a technology that's going to be ubiquitous, and is going to transform every business. I mean, imagine you make a product, maybe you were making a tool like Photoshop or, or whatever the, you know, tool is. The way that people are going to use your tool, is not by investing, you know, hundreds of hours into learning all of the different, you know specific buttons they need to press and workflows they need to go through it. They're going to talk to it, right? They're going to say, ask it to do the thing they want it to do right? And it's going to do it. And if it, if it doesn't know what it's want, what it's, what's being asked of it. It's going to ask clarifying questions, right? And then you're going to clarify, and you're going to have a conversation. And this is going to make many many many kinds of tools and technology and products easier to use, and lower the barrier to entry. And so, and this, you know, many companies fit into this category of trying to build products that, and trying to make them easier to use, this is just one kind of way it can, one kind of way that AI will will be used. But I think it's, it's something that's pretty ubiquitous. >> Yeah. It'll be efficient, it'll be efficiency up and down the stack, and will change the productivity equation completely. You just highlighted one, I don't want to fill out forms, just stand up my environment for me. And then start coding away. Okay well this is great stuff. Final word for the folks out there watching, obviously new kind of skill set for hiring. You guys got engineers, give a plug for the company, for Anyscale. What are you looking for? What are you guys working on? Give a, take the last minute to put a plug in for the company. >> Yeah well if you're interested in AI and if you think AI is really going to be transformative, and really be useful for all these different industries. We are trying to provide the infrastructure to enable that to happen, right? So I think there's the potential here, to really solve an important problem, to get to the point where developers don't need to think about infrastructure, don't need to think about distributed systems. All they think about is their application logic, and what they want their application to do. And I think if we can achieve that, you know we can be the foundation or the platform that enables all of these other companies to succeed with AI. So that's where we're going. I think something like this has to happen if AI is going to achieve its potential, we're looking for, we're hiring across the board, you know, great engineers, on the go-to-market side, product managers, you know people who want to really, you know, make this happen. >> Awesome well congratulations. I know you got some good funding behind you. You're in a good spot. I think this is happening. I think generative AI and foundation models is going to be the next big inflection point, as big as the pc inter-networking, internet and smartphones. This is a whole nother application framework, a whole nother set of things. So this is the ground floor. Robert, you're, you and your team are right there. Well done. >> Thank you so much. >> All right. Thanks for coming on this CUBE conversation. I'm John Furrier with theCUBE. Breaking down a conversation around AI and scaling up in this new next major inflection point. This next wave is foundational models, generative AI. And thanks to ChatGPT, the whole world's now knowing about it. So it really is changing the game and Anyscale is right there, one of the hot startups, that is in good position to ride this next wave. Thanks for watching. (upbeat instrumental)

Published Date : Feb 24 2023

SUMMARY :

Robert, great to have you Thanks for inviting me. as you guys are gearing up and the potential for AI to a lot of that I love the and at some point you need It's the big brains in the company. and the reason people the automation's got to be there. and really you know, and talk to customers, put but the most common you know, and then you go to now that are trained that you know, and that seems to be a value proposition. And at that point you don't So you need infrastructure, and lower the barrier to entry. What are you guys working on? and if you think AI is really is going to be the next And thanks to ChatGPT,

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Robert Nishihara	PERSON	0.99+
John Furrier	PERSON	0.99+
12 weeks	QUANTITY	0.99+
Robert	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Lyft	ORGANIZATION	0.99+
Shopify	ORGANIZATION	0.99+
eight weeks	QUANTITY	0.99+
Spotify	ORGANIZATION	0.99+
Netflix	ORGANIZATION	0.99+
August	DATE	0.99+
September	DATE	0.99+
Palo Alto, California	LOCATION	0.99+
Cruise	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Instacart	ORGANIZATION	0.99+
Anyscale	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Photoshop	TITLE	0.99+
One reason	QUANTITY	0.99+
Bike Dance	ORGANIZATION	0.99+
Ray	ORGANIZATION	0.99+
Python	TITLE	0.99+
thousands of machines	QUANTITY	0.99+
Berkeley	LOCATION	0.99+
two main things	QUANTITY	0.98+
single machine	QUANTITY	0.98+
Cohere	ORGANIZATION	0.98+
Ray and Anyscale	ORGANIZATION	0.98+
millions of documents	QUANTITY	0.98+
both	QUANTITY	0.98+
one kind	QUANTITY	0.96+
first version	QUANTITY	0.95+
CUBE	ORGANIZATION	0.95+
about a hundred people	QUANTITY	0.95+
hundreds of machines	QUANTITY	0.95+
one	QUANTITY	0.95+
OpenAI	ORGANIZATION	0.94+
First	QUANTITY	0.94+
hundreds of hours	QUANTITY	0.93+
first time	QUANTITY	0.93+
Databricks	ORGANIZATION	0.91+
Ray and Anyscale	ORGANIZATION	0.9+
tons	QUANTITY	0.89+
couple years ago	DATE	0.88+
Ray and	ORGANIZATION	0.86+
ChatGPT	TITLE	0.81+
tons of people	QUANTITY	0.8+

Ed Walsh & Thomas Hazel | A New Database Architecture for Supercloud

(bright music) >> Hi, everybody, this is Dave Vellante, welcome back to Supercloud 2. Last August, at the first Supercloud event, we invited the broader community to help further define Supercloud, we assessed its viability, and identified the critical elements and deployment models of the concept. The objectives here at Supercloud too are, first of all, to continue to tighten and test the concept, the second is, we want to get real world input from practitioners on the problems that they're facing and the viability of Supercloud in terms of applying it to their business. So on the program, we got companies like Walmart, Sachs, Western Union, Ionis Pharmaceuticals, NASDAQ, and others. And the third thing that we want to do is we want to drill into the intersection of cloud and data to project what the future looks like in the context of Supercloud. So in this segment, we want to explore the concept of data architectures and what's going to be required for Supercloud. And I'm pleased to welcome one of our Supercloud sponsors, ChaosSearch, Ed Walsh is the CEO of the company, with Thomas Hazel, who's the Founder, CTO, and Chief Scientist. Guys, good to see you again, thanks for coming into our Marlborough studio. >> Always great. >> Great to be here. >> Okay, so there's a little debate, I'm going to put you right in the spot. (Ed chuckling) A little debate going on in the community started by Bob Muglia, a former CEO of Snowflake, and he was at Microsoft for a long time, and he looked at the Supercloud definition, said, "I think you need to tighten it up a little bit." So, here's what he came up with. He said, "A Supercloud is a platform that provides a programmatically consistent set of services hosted on heterogeneous cloud providers." So he's calling it a platform, not an architecture, which was kind of interesting. And so presumably the platform owner is going to be responsible for the architecture, but Dr. Nelu Mihai, who's a computer scientist behind the Cloud of Clouds Project, he chimed in and responded with the following. He said, "Cloud is a programming paradigm supporting the entire lifecycle of applications with data and logic natively distributed. Supercloud is an open architecture that integrates heterogeneous clouds in an agnostic manner." So, Ed, words matter. Is this an architecture or is it a platform? >> Put us on the spot. So, I'm sure you have concepts, I would say it's an architectural or design principle. Listen, I look at Supercloud as a mega trend, just like cloud, just like data analytics. And some companies are using the principle, design principles, to literally get dramatically ahead of everyone else. I mean, things you couldn't possibly do if you didn't use cloud principles, right? So I think it's a Supercloud effect, you're able to do things you're not able to. So I think it's more a design principle, but if you do it right, you get dramatic effect as far as customer value. >> So the conversation that we were having with Muglia, and Tristan Handy of dbt Labs, was, I'll set it up as the following, and, Thomas, would love to get your thoughts, if you have a CRM, think about applications today, it's all about forms and codifying business processes, you type a bunch of stuff into Salesforce, and all the salespeople do it, and this machine generates a forecast. What if you have this new type of data app that pulls data from the transaction system, the e-commerce, the supply chain, the partner ecosystem, et cetera, and then, without humans, actually comes up with a plan. That's their vision. And Muglia was saying, in order to do that, you need to rethink data architectures and database architectures specifically, you need to get down to the level of how the data is stored on the disc. What are your thoughts on that? Well, first of all, I'm going to cop out, I think it's actually both. I do think it's a design principle, I think it's not open technology, but open APIs, open access, and you can build a platform on that design principle architecture. Now, I'm a database person, I love solving the database problems. >> I'm waited for you to launch into this. >> Yeah, so I mean, you know, Snowflake is a database, right? It's a distributed database. And we wanted to crack those codes, because, multi-region, multi-cloud, customers wanted access to their data, and their data is in a variety of forms, all these services that you're talked about. And so what I saw as a core principle was cloud object storage, everyone streams their data to cloud object storage. From there we said, well, how about we rethink database architecture, rethink file format, so that we can take each one of these services and bring them together, whether distributively or centrally, such that customers can access and get answers, whether it's operational data, whether it's business data, AKA search, or SQL, complex distributed joins. But we had to rethink the architecture. I like to say we're not a first generation, or a second, we're a third generation distributed database on pure, pure cloud storage, no caching, no SSDs. Why? Because all that availability, the cost of time, is a struggle, and cloud object storage, we think, is the answer. >> So when you're saying no caching, so when I think about how companies are solving some, you know, pretty hairy problems, take MySQL Heatwave, everybody thought Oracle was going to just forget about MySQL, well, they come out with Heatwave. And the way they solve problems, and you see their benchmarks against Amazon, "Oh, we crush everybody," is they put it all in memory. So you said no caching? You're not getting performance through caching? How is that true, and how are you getting performance? >> Well, so five, six years ago, right? When you realize that cloud object storage is going to be everywhere, and it's going to be a core foundational, if you will, fabric, what would you do? Well, a lot of times the second generation say, "We'll take it out of cloud storage, put in SSDs or something, and put into cache." And that adds a lot of time, adds a lot of costs. But I said, what if, what if we could actually make the first read hot, the first read distributed joins and searching? And so what we went out to do was said, we can't cache, because that's adds time, that adds cost. We have to make cloud object storage high performance, like it feels like a caching SSD. That's where our patents are, that's where our technology is, and we've spent many years working towards this. So, to me, if you can crack that code, a lot of these issues we're talking about, multi-region, multicloud, different services, everybody wants to send their data to the data lake, but then they move it out, we said, "Keep it right there." >> You nailed it, the data gravity. So, Bob's right, the data's coming in, and you need to get the data from everywhere, but you need an environment that you can deal with all that different schema, all the different type of technology, but also at scale. Bob's right, you cannot use memory or SSDs to cache that, that doesn't scale, it doesn't scale cost effectively. But if you could, and what you did, is you made object storage, S3 first, but object storage, the only persistence by doing that. And then we get performance, we should talk about it, it's literally, you know, hundreds of terabytes of queries, and it's done in seconds, it's done without memory caching. We have concepts of caching, but the only caching, the only persistence, is actually when we're doing caching, we're just keeping another side-eye track of things on the S3 itself. So we're using, actually, the object storage to be a database, which is kind of where Bob was saying, we agree, but that's what you started at, people thought you were crazy. >> And maybe make it live. Don't think of it as archival or temporary space, make it live, real time streaming, operational data. What we do is make it smart, we see the data coming in, we uniquely index it such that you can get your use cases, that are search, observability, security, or backend operational. But we don't have to have this, I dunno, static, fixed, siloed type of architecture technologies that were traditionally built prior to Supercloud thinking. >> And you don't have to move everything, essentially, you can do it wherever the data lands, whatever cloud across the globe, you're able to bring it together, you get the cost effectiveness, because the only persistence is the cheapest storage persistent layer you can buy. But the key thing is you cracked the code. >> We had to crack the code, right? That was the key thing. >> That's where the plans are. >> And then once you do that, then everything else gets easier to scale, your architecture, across regions, across cloud. >> Now, it's a general purpose database, as Bob was saying, but we use that database to solve a particular issue, which is around operational data, right? So, we agree with Bob's. >> Interesting. So this brings me to this concept of data, Jimata Gan is one of our speakers, you know, we talk about data fabric, which is a NetApp, originally NetApp concept, Gartner's kind of co-opted it. But so, the basic concept is, data lives everywhere, whether it's an S3 bucket, or a SQL database, or a data lake, it's just a node on the data mesh. So in your view, how does this fit in with Supercloud? Ed, you've said that you've built, essentially, an enabler for that, for the data mesh, I think you're an enabler for the Supercloud-like principles. This is a big, chewy opportunity, and it requires, you know, a team approach. There's got to be an ecosystem, there's not going to be one Supercloud to rule them all, so where does the ecosystem fit into the discussion, and where do you fit into the ecosystem? >> Right, so we agree completely, there's not one Supercloud in effect, but we use Supercloud principles to build our platform, and then, you know, the ecosystem's going to be built on leveraging what everyone else's secret powers are, right? So our power, our superpower, based upon what we built is, we deal with, if you're having any scale, or cost effective scale issues, with data, machine generated data, like business observability or security data, we are your force multiplier, we will take that in singularly, just let it, simply put it in your object storage wherever it sits, and we give you uniformity access to that using OpenAPI access, SQL, or you know, Elasticsearch API. So, that's what we do, that's our superpower. So I'll play it into data mesh, that's a perfect, we are a node on a data mesh, but I'll play it in the soup about how, the ecosystem, we see it kind of playing, and we talked about it in just in the last couple days, how we see this kind of possibly. Short term, our superpowers, we deal with this data that's coming at these environments, people, customers, building out observability or security environments, or vendors that are selling their own Supercloud, I do observability, the Datadogs of the world, dot dot dot, the Splunks of the world, dot dot dot, and security. So what we do is we fit in naturally. What we do is a cost effective scale, just land it anywhere in the world, we deal with ingest, and it's a cost effective, an order of magnitude, or two or three order magnitudes more cost effective. Allows them, their customers are asking them to do the impossible, "Give me fast monitoring alerting. I want it snappy, but I want it to keep two years of data, (laughs) and I want it cost effective." It doesn't work. They're good at the fast monitoring alerting, we're good at the long-term retention. And yet there's some gray area between those two, but one to one is actually cheaper, so we would partner. So the first ecosystem plays, who wants to have the ability to, really, all the data's in those same environments, the security observability players, they can literally, just through API, drag our data into their point to grab. We can make it seamless for customers. Right now, we make it helpful to customers. Your Datadog, we make a button, easy go from Datadog to us for logs, save you money. Same thing with Grafana. But you can also look at ecosystem, those same vendors, it used to be a year ago it was, you know, its all about how can you grow, like it's growth at all costs, now it's about cogs. So literally we can go an environment, you supply what your customer wants, but we can help with cogs. And one-on one in a partnership is better than you trying to build on your own. >> Thomas, you were saying you make the first read fast, so you think about Snowflake. Everybody wants to talk about Snowflake and Databricks. So, Snowflake, great, but you got to get the data in there. All right, so that's, can you help with that problem? >> I mean we want simple in, right? And if you have to have structure in, you're not simple. So the idea that you have a simple in, data lake, schema read type philosophy, but schema right type performance. And so what I wanted to do, what we have done, is have that simple lake, and stream that data real time, and those access points of Search or SQL, to go after whatever business case you need, security observability, warehouse integration. But the key thing is, how do I make that click, click, click answer, and do it quickly? And so what we want to do is, that first read has to be fast. Why? 'Cause then you're going to do all this siloing, layers, complexity. If your first read's not fast, you're at a disadvantage, particularly in cost. And nobody says I want less data, but everyone has to, whether they say we're going to shorten the window, we're going to use AI to choose, but in a security moment, when you don't have that answer, you're in trouble. And that's why we are this service, this Supercloud service, if you will, providing access, well-known search, well-known SQL type access, that if you just have one access point, you're at a disadvantage. >> We actually talked about Snowflake and BigQuery, and a different platform, Data Bricks. That's kind of where we see the phase two of ecosystem. One is easy, the low-hanging fruit is observability and security firms. But the next one is, what we do, our super power is dealing with this messy data that schema is changing like night and day. Pipelines are tough, and it's changing all the time, but you want these things fast, and it's big data around the world. That's the next point, just use us alongside, or inside, one of their platforms, and now we get the best of both worlds. Our superpower is keeping this messy data as a streaming, okay, not a batch thing, allow you to do that. So, that's the second one. And then to be honest, the third one, which plays you to Supercloud, it also plays perfectly in the data mesh, is if you really go to the ultimate thing, what we have done is made object storage, S3, GCS, and blob storage, we made it a database. Put, get, complex query with big joins. You know, so back to your original thing, and Muglia teed it up perfectly, we've done that. Now imagine if that's an ecosystem, who would want that? If it's, again, it's uniform available across all the regions, across all the clouds, and it's right next to where you are building a service, or a client's trying, that's where the ecosystem, I think people are going to use Superclouds for their superpowers. We're really good at this, allows that short term. I think the Snowflakes and the Data Bricks are the medium term, you know? And then I think eventually gets to, hey, listen if you can make object storage fast, you can just go after it with simple SQL queries, or elastic. Who would want that? I think that's where people are going to leverage it. It's not going to be one Supercloud, and we leverage the super clouds. >> Our viewpoint is smart object storage can be programmable, and so we agree with Bob, but we're not saying do it here, do it here. This core, fundamental layer across regions, across clouds, that everyone has? Simple in. Right now, it's hard to get data in for access for analysis. So we said, simply, we'll automate the entire process, give you API access across regions, across clouds. And again, how do you do a distributed join that's fast? How do you do a distributed join that doesn't cost you an arm or a leg? And how do you do it at scale? And that's where we've been focused. >> So prior, the cloud object store was a niche. >> Yeah. >> S3 obviously changed that. How standard is, essentially, object store across the different cloud platforms? Is that a problem for you? Is that an easy thing to solve? >> Well, let's talk about it. I mean we've fundamentally, yeah we've extracted it, but fundamentally, cloud object storage, put, get, and list. That's why it's so scalable, 'cause it doesn't have all these other components. That complexity is where we have moved up, and provide direct analytical API access. So because of its simplicity, and costs, and security, and reliability, it can scale naturally. I mean, really, distributed object storage is easy, it's put-get anywhere, now what we've done is we put a layer of intelligence, you know, call it smart object storage, where access is simple. So whether it's multi-region, do a query across, or multicloud, do a query across, or hunting, searching. >> We've had clients doing Amazon and Google, we have some Azure, but we see Amazon and Google more, and it's a consistent service across all of them. Just literally put your data in the bucket of choice, or folder of choice, click a couple buttons, literally click that to say "that's hot," and after that, it's hot, you can see it. But we're not moving data, the data gravity issue, that's the other. That it's already natively flowing to these pools of object storage across different regions and clouds. We don't move it, we index it right there, we're spinning up stateless compute, back to the Supercloud concept. But now that allows us to do all these other things, right? >> And it's no longer just cheap and deep object storage. Right? >> Yeah, we make it the same, like you have an analytic platform regardless of where you're at, you don't have to worry about that. Yeah, we deal with that, we deal with a stateless compute coming up -- >> And make it programmable. Be able to say, "I want this bucket to provide these answers." Right, that's really the hope, the vision. And the complexity to build the entire stack, and then connect them together, we said, the fabric is cloud storage, we just provide the intelligence on top. >> Let's bring it back to the customers, and one of the things we're exploring in Supercloud too is, you know, is Supercloud a solution looking for a problem? Is a multicloud really a problem? I mean, you hear, you know, a lot of the vendor marketing says, "Oh, it's a disaster, because it's all different across the clouds." And I talked to a lot of customers even as part of Supercloud too, they're like, "Well, I solved that problem by just going mono cloud." Well, but then you're not able to take advantage of a lot of the capabilities and the primitives that, you know, like Google's data, or you like Microsoft's simplicity, their RPA, whatever it is. So what are customers telling you, what are their near term problems that they're trying to solve today, and how are they thinking about the future? >> Listen, it's a real problem. I think it started, I think this is a a mega trend, just like cloud. Just, cloud data, and I always add, analytics, are the mega trends. If you're looking at those, if you're not considering using the Supercloud principles, in other words, leveraging what I have, abstracting it out, and getting the most out of that, and then build value on top, I think you're not going to be able to keep up, In fact, no way you're going to keep up with this data volume. It's a geometric challenge, and you're trying to do linear things. So clients aren't necessarily asking, hey, for Supercloud, but they're really saying, I need to have a better mechanism to simplify this and get value across it, and how do you abstract that out to do that? And that's where they're obviously, our conversations are more amazed what we're able to do, and what they're able to do with our platform, because if you think of what we've done, the S3, or GCS, or object storage, is they can't imagine the ingest, they can't imagine how easy, time to glass, one minute, no matter where it lands in the world, querying this in seconds for hundreds of terabytes squared. People are amazed, but that's kind of, so they're not asking for that, but they are amazed. And then when you start talking on it, if you're an enterprise person, you're building a big cloud data platform, or doing data or analytics, if you're not trying to leverage the public clouds, and somehow leverage all of them, and then build on top, then I think you're missing it. So they might not be asking for it, but they're doing it. >> And they're looking for a lens, you mentioned all these different services, how do I bring those together quickly? You know, our viewpoint, our service, is I have all these streams of data, create a lens where they want to go after it via search, go after via SQL, bring them together instantly, no e-tailing out, no define this table, put into this database. We said, let's have a service that creates a lens across all these streams, and then make those connections. I want to take my CRM with my Google AdWords, and maybe my Salesforce, how do I do analysis? Maybe I want to hunt first, maybe I want to join, maybe I want to add another stream to it. And so our viewpoint is, it's so natural to get into these lake platforms and then provide lenses to get that access. >> And they don't want it separate, they don't want something different here, and different there. They want it basically -- >> So this is our industry, right? If something new comes out, remember virtualization came out, "Oh my God, this is so great, it's going to solve all these problems." And all of a sudden it just got to be this big, more complex thing. Same thing with cloud, you know? It started out with S3, and then EC2, and now hundreds and hundreds of different services. So, it's a complex matter for a lot of people, and this creates problems for customers, especially when you got divisions that are using different clouds, and you're saying that the solution, or a solution for the part of the problem, is to really allow the data to stay in place on S3, use that standard, super simple, but then give it what, Ed, you've called superpower a couple of times, to make it fast, make it inexpensive, and allow you to do that across clouds. >> Yeah, yeah. >> I'll give you guys the last word on that. >> No, listen, I think, we think Supercloud allows you to do a lot more. And for us, data, everyone says more data, more problems, more budget issue, everyone knows more data is better, and we show you how to do it cost effectively at scale. And we couldn't have done it without the design principles of we're leveraging the Supercloud to get capabilities, and because we use super, just the object storage, we're able to get these capabilities of ingest, scale, cost effectiveness, and then we built on top of this. In the end, a database is a data platform that allows you to go after everything distributed, and to get one platform for analytics, no matter where it lands, that's where we think the Supercloud concepts are perfect, that's where our clients are seeing it, and we're kind of excited about it. >> Yeah a third generation database, Supercloud database, however we want to phrase it, and make it simple, but provide the value, and make it instant. >> Guys, thanks so much for coming into the studio today, I really thank you for your support of theCUBE, and theCUBE community, it allows us to provide events like this and free content. I really appreciate it. >> Oh, thank you. >> Thank you. >> All right, this is Dave Vellante for John Furrier in theCUBE community, thanks for being with us today. You're watching Supercloud 2, keep it right there for more thought provoking discussions around the future of cloud and data. (bright music)

Published Date : Feb 17 2023

SUMMARY :

And the third thing that we want to do I'm going to put you right but if you do it right, So the conversation that we were having I like to say we're not a and you see their So, to me, if you can crack that code, and you need to get the you can get your use cases, But the key thing is you cracked the code. We had to crack the code, right? And then once you do that, So, we agree with Bob's. and where do you fit into the ecosystem? and we give you uniformity access to that so you think about Snowflake. So the idea that you have are the medium term, you know? and so we agree with Bob, So prior, the cloud that an easy thing to solve? you know, call it smart object storage, and after that, it's hot, you can see it. And it's no longer just you don't have to worry about And the complexity to and one of the things we're and how do you abstract it's so natural to get and different there. and allow you to do that across clouds. I'll give you guys and we show you how to do it but provide the value, I really thank you for around the future of cloud and data.

ENTITIES

Entity	Category	Confidence
Walmart	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
NASDAQ	ORGANIZATION	0.99+
Bob Muglia	PERSON	0.99+
Thomas	PERSON	0.99+
Thomas Hazel	PERSON	0.99+
Ionis Pharmaceuticals	ORGANIZATION	0.99+
Western Union	ORGANIZATION	0.99+
Ed Walsh	PERSON	0.99+
Bob	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Nelu Mihai	PERSON	0.99+
Sachs	ORGANIZATION	0.99+
Tristan Handy	PERSON	0.99+
two	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
two years	QUANTITY	0.99+
Supercloud 2	TITLE	0.99+
first	QUANTITY	0.99+
Last August	DATE	0.99+
three	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
both	QUANTITY	0.99+
dbt Labs	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Ed	PERSON	0.99+
Gartner	ORGANIZATION	0.99+
Jimata Gan	PERSON	0.99+
third one	QUANTITY	0.99+
one minute	QUANTITY	0.99+
second	QUANTITY	0.99+
first generation	QUANTITY	0.99+
third generation	QUANTITY	0.99+
Grafana	ORGANIZATION	0.99+
second generation	QUANTITY	0.99+
second one	QUANTITY	0.99+
hundreds of terabytes	QUANTITY	0.98+
SQL	TITLE	0.98+
five	DATE	0.98+
one	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
a year ago	DATE	0.98+
ChaosSearch	ORGANIZATION	0.98+
Muglia	PERSON	0.98+
MySQL	TITLE	0.98+
both worlds	QUANTITY	0.98+
third thing	QUANTITY	0.97+
Marlborough	LOCATION	0.97+
theCUBE	ORGANIZATION	0.97+
today	DATE	0.97+
Supercloud	ORGANIZATION	0.97+
Elasticsearch	TITLE	0.96+
NetApp	TITLE	0.96+
Datadog	ORGANIZATION	0.96+
One	QUANTITY	0.96+
EC2	TITLE	0.96+
each one	QUANTITY	0.96+
S3	TITLE	0.96+
one platform	QUANTITY	0.95+
Supercloud 2	EVENT	0.95+
first read	QUANTITY	0.95+
six years ago	DATE	0.95+

Daren Brabham & Erik Bradley | What the Spending Data Tells us About Supercloud

(gentle synth music) (music ends) >> Welcome back to Supercloud 2, an open industry collaboration between technologists, consultants, analysts, and of course practitioners to help shape the future of cloud. At this event, one of the key areas we're exploring is the intersection of cloud and data. And how building value on top of hyperscale clouds and across clouds is evolving, a concept of course we call "Supercloud". And we're pleased to welcome our friends from Enterprise Technology research, Erik Bradley and Darren Brabham. Guys, thanks for joining us, great to see you. we love to bring the data into these conversations. >> Thank you for having us, Dave, I appreciate it. >> Yeah, thanks. >> You bet. And so, let me do the setup on what is Supercloud. It's a concept that we've floated, Before re:Invent 2021, based on the idea that cloud infrastructure is becoming ubiquitous, incredibly powerful, but there's a lack of standards across the big three clouds. That creates friction. So we defined over the period of time, you know, better part of a year, a set of essential elements, deployment models for so-called supercloud, which create this common experience for specific cloud services that, of course, again, span multiple clouds and even on-premise data. So Erik, with that as background, I wonder if you could add your general thoughts on the term supercloud, maybe play proxy for the CIO community, 'cause you do these round tables, you talk to these guys all the time, you gather a lot of amazing information from senior IT DMs that compliment your survey. So what are your thoughts on the term and the concept? >> Yeah, sure. I'll even go back to last year when you and I did our predictions panel, right? And we threw it out there. And to your point, you know, there's some haters. Anytime you throw out a new term, "Is it marketing buzz? Is it worth it? Why are you even doing it?" But you know, from my own perspective, and then also speaking to the IT DMs that we interview on a regular basis, this is just a natural evolution. It's something that's inevitable in enterprise tech, right? The internet was not built for what it has become. It was never intended to be the underlying infrastructure of our daily lives and work. The cloud also was not built to be what it's become. But where we're at now is, we have to figure out what the cloud is and what it needs to be to be scalable, resilient, secure, and have the governance wrapped around it. And to me that's what supercloud is. It's a way to define operantly, what the next generation, the continued iteration and evolution of the cloud and what its needs to be. And that's what the supercloud means to me. And what depends, if you want to call it metacloud, supercloud, it doesn't matter. The point is that we're trying to define the next layer, the next future of work, which is inevitable in enterprise tech. Now, from the IT DM perspective, I have two interesting call outs. One is from basically a senior developer IT architecture and DevSecOps who says he uses the term all the time. And the reason he uses the term, is that because multi-cloud has a stigma attached to it, when he is talking to his business executives. (David chuckles) the stigma is because it's complex and it's expensive. So he switched to supercloud to better explain to his business executives and his CFO and his CIO what he's trying to do. And we can get into more later about what it means to him. But the inverse of that, of course, is a good CSO friend of mine for a very large enterprise says the concern with Supercloud is the reduction of complexity. And I'll explain, he believes anything that takes the requirement of specific expertise out of the equation, even a little bit, as a CSO worries him. So as you said, David, always two sides to the coin, but I do believe supercloud is a relevant term, and it is necessary because the cloud is continuing to be defined. >> You know, that's really interesting too, 'cause you know, Darren, we use Snowflake a lot as an example, sort of early supercloud, and you think from a security standpoint, we've always pushed Amazon and, "Are you ever going to kind of abstract the complexity away from all these primitives?" and their position has always been, "Look, if we produce these primitives, and offer these primitives, we we can move as the market moves. When you abstract, then it becomes harder to peel the layers." But Darren, from a data standpoint, like I say, we use Snowflake a lot. I think of like Tim Burners-Lee when Web 2.0 came out, he said, "Well this is what the internet was always supposed to be." So in a way, you know, supercloud is maybe what multi-cloud was supposed to be. But I mean, you think about data sharing, Darren, across clouds, it's always been a challenge. Snowflake always, you know, obviously trying to solve that problem, as are others. But what are your thoughts on the concept? >> Yeah, I think the concept fits, right? It is reflective of, it's a paradigm shift, right? Things, as a pendulum have swung back and forth between needing to piece together a bunch of different tools that have specific unique use cases and they're best in breed in what they do. And then focusing on the duct tape that holds 'em all together and all the engineering complexity and skill, it shifted from that end of the pendulum all the way back to, "Let's streamline this, let's simplify it. Maybe we have budget crunches and we need to consolidate tools or eliminate tools." And so then you kind of see this back and forth over time. And with data and analytics for instance, a lot of organizations were trying to bring the data closer to the business. That's where we saw self-service analytics coming in. And tools like Snowflake, what they did was they helped point to different databases, they helped unify data, and organize it in a single place that was, you know, in a sense neutral, away from a single cloud vendor or a single database, and allowed the business to kind of be more flexible in how it brought stuff together and provided it out to the business units. So Snowflake was an example of one of those times where we pulled back from the granular, multiple points of the spear, back to a simple way to do things. And I think Snowflake has continued to kind of keep that mantle to a degree, and we see other tools trying to do that, but that's all it is. It's a paradigm shift back to this kind of meta abstraction layer that kind of simplifies what is the reality, that you need a complex multi-use case, multi-region way of doing business. And it sort of reflects the reality of that. >> And you know, to me it's a spectrum. As part of Supercloud 2, we're talking to a number of of practitioners, Ionis Pharmaceuticals, US West, we got Walmart. And it's a spectrum, right? In some cases the practitioner's saying, "You know, the way I solve multi-cloud complexity is mono-cloud, I just do one cloud." (laughs) Others like Walmart are saying, "Hey, you know, we actually are building an abstraction layer ourselves, take advantage of it." So my general question to both of you is, is this a concept, is the lack of standards across clouds, you know, really a problem, you know, or is supercloud a solution looking for a problem? Or do you hear from practitioners that "No, this is really an issue, we have to bring together a set of standards to sort of unify our cloud estates." >> Allow me to answer that at a higher level, and then we're going to hand it over to Dr. Brabham because he is a little bit more detailed on the realtime streaming analytics use cases, which I think is where we're going to get to. But to answer that question, it really depends on the size and the complexity of your business. At the very large enterprise, Dave, Yes, a hundred percent. This needs to happen. There is complexity, there is not only complexity in the compute and actually deploying the applications, but the governance and the security around them. But for lower end or, you know, business use cases, and for smaller businesses, it's a little less necessary. You certainly don't need to have all of these. Some of the things that come into mind from the interviews that Darren and I have done are, you know, financial services, if you're doing real-time trading, anything that has real-time data metrics involved in your transactions, is going to be necessary. And another use case that we hear about is in online travel agencies. So I think it is very relevant, the complexity does need to be solved, and I'll allow Darren to explain a little bit more about how that's used from an analytics perspective. >> Yeah, go for it. >> Yeah, exactly. I mean, I think any modern, you know, multinational company that's going to have a footprint in the US and Europe, in China, or works in different areas like manufacturing, where you're probably going to have on-prem instances that will stay on-prem forever, for various performance reasons. You have these complicated governance and security and regulatory issues. So inherently, I think, large multinational companies and or companies that are in certain areas like finance or in, you know, online e-commerce, or things that need real-time data, they inherently are going to have a very complex environment that's going to need to be managed in some kind of cleaner way. You know, they're looking for one door to open, one pane of glass to look at, one thing to do to manage these multi points. And, streaming's a good example of that. I mean, not every organization has a real-time streaming use case, and may not ever, but a lot of organizations do, a lot of industries do. And so there's this need to use, you know, they want to use open-source tools, they want to use Apache Kafka for instance. They want to use different megacloud vendors offerings, like Google Pub/Sub or you know, Amazon Kinesis Firehose. They have all these different pieces they want to use for different use cases at different stages of maturity or proof of concept, you name it. They're going to have to have this complexity. And I think that's why we're seeing this need, to have sort of this supercloud concept, to juggle all this, to wrangle all of it. 'Cause the reality is, it's complex and you have to simplify it somehow. >> Great, thanks you guys. All right, let's bring up the graphic, and take a look. Anybody who follows the breaking analysis, which is co-branded with ETR Cube Insights powered by ETR, knows we like to bring data to the table. ETR does amazing survey work every quarter, 1200 plus 1500 practitioners that that answer a number of questions. The vertical axis here is net score, which is ETR's proprietary methodology, which is a measure of spending momentum, spending velocity. And the horizontal axis here is overlap, but it's the presence pervasiveness, and the dataset, the ends, that table insert on the bottom right shows you how the dots are plotted, the net score and then the ends in the survey. And what we've done is we've plotted a bunch of the so-called supercloud suspects, let's start in the upper right, the cloud platforms. Without these hyperscale clouds, you can't have a supercloud. And as always, Azure and AWS, up and to the right, it's amazing we're talking about, you know, 80 plus billion dollar company in AWS. Azure's business is, if you just look at the IaaS is in the 50 billion range, I mean it's just amazing to me the net scores here. Anything above 40% we consider highly elevated. And you got Azure and you got Snowflake, Databricks, HashiCorp, we'll get to them. And you got AWS, you know, right up there at that size, it's quite amazing. With really big ends as well, you know, 700 plus ends in the survey. So, you know, kind of half the survey actually has these platforms. So my question to you guys is, what are you seeing in terms of cloud adoption within the big three cloud players? I wonder if you could could comment, maybe Erik, you could start. >> Yeah, sure. Now we're talking data, now I'm happy. So yeah, we'll get into some of it. Right now, the January, 2023 TSIS is approaching 1500 survey respondents. One caveat, it's not closed yet, it will close on Friday, but with an end that big we are over statistically significant. We also recently did a cloud survey, and there's a couple of key points on that I want to get into before we get into individual vendors. What we're seeing here, is that annual spend on cloud infrastructure is expected to grow at almost a 70% CAGR over the next three years. The percentage of those workloads for cloud infrastructure are expected to grow over 70% as three years as well. And as you mentioned, Azure and AWS are still dominant. However, we're seeing some share shift spreading around a little bit. Now to get into the individual vendors you mentioned about, yes, Azure is still number one, AWS is number two. What we're seeing, which is incredibly interesting, CloudFlare is number three. It's actually beating GCP. That's the first time we've seen it. What I do want to state, is this is on net score only, which is our measure of spending intentions. When you talk about actual pervasion in the enterprise, it's not even close. But from a spending velocity intention point of view, CloudFlare is now number three above GCP, and even Salesforce is creeping up to be at GCPs level. So what we're seeing here, is a continued domination by Azure and AWS, but some of these other players that maybe might fit into your moniker. And I definitely want to talk about CloudFlare more in a bit, but I'm going to stop there. But what we're seeing is some of these other players that fit into your Supercloud moniker, are starting to creep up, Dave. >> Yeah, I just want to clarify. So as you also know, we track IaaS and PaaS revenue and we try to extract, so AWS reports in its quarterly earnings, you know, they're just IaaS and PaaS, they don't have a SaaS play, a little bit maybe, whereas Microsoft and Google include their applications and so we extract those out and if you do that, AWS is bigger, but in the surveys, you know, customers, they see cloud, SaaS to them as cloud. So that's one of the reasons why you see, you know, Microsoft as larger in pervasion. If you bring up that survey again, Alex, the survey results, you see them further to the right and they have higher spending momentum, which is consistent with what you see in the earnings calls. Now, interesting about CloudFlare because the CEO of CloudFlare actually, and CloudFlare itself uses the term supercloud basically saying, "Hey, we're building a new type of internet." So what are your thoughts? Do you have additional information on CloudFlare, Erik that you want to share? I mean, you've seen them pop up. I mean this is a really interesting company that is pretty forward thinking and vocal about how it's disrupting the industry. >> Sure, we've been tracking 'em for a long time, and even from the disruption of just a traditional CDN where they took down Akamai and what they're doing. But for me, the definition of a true supercloud provider can't just be one instance. You have to have multiple. So it's not just the cloud, it's networking aspect on top of it, it's also security. And to me, CloudFlare is the only one that has all of it. That they actually have the ability to offer all of those things. Whereas you look at some of the other names, they're still piggybacking on the infrastructure or platform as a service of the hyperscalers. CloudFlare does not need to, they actually have the cloud, the networking, and the security all themselves. So to me that lends credibility to their own internal usage of that moniker Supercloud. And also, again, just what we're seeing right here that their net score is now creeping above AGCP really does state it. And then just one real last thing, one of the other things we do in our surveys is we track adoption and replacement reasoning. And when you look at Cloudflare's adoption rate, which is extremely high, it's based on technical capabilities, the breadth of their feature set, it's also based on what we call the ability to avoid stack alignment. So those are again, really supporting reasons that makes CloudFlare a top candidate for your moniker of supercloud. >> And they've also announced an object store (chuckles) and a database. So, you know, that's going to be, it takes a while as you well know, to get database adoption going, but you know, they're ambitious and going for it. All right, let's bring the chart back up, and I want to focus Darren in on the ecosystem now, and really, we've identified Snowflake and Databricks, it's always fun to talk about those guys, and there are a number of other, you know, data platforms out there, but we use those too as really proxies for leaders. We got a bunch of the backup guys, the data protection folks, Rubric, Cohesity, and Veeam. They're sort of in a cluster, although Rubric, you know, ahead of those guys in terms of spending momentum. And then VMware, Tanzu and Red Hat as sort of the cross cloud platform. But I want to focus, Darren, on the data piece of it. We're seeing a lot of activity around data sharing, governed data sharing. Databricks is using Delta Sharing as their sort of place, Snowflakes is sort of this walled garden like the app store. What are your thoughts on, you know, in the context of Supercloud, cross cloud capabilities for the data platforms? >> Yeah, good question. You know, I think Databricks is an interesting player because they sort of have made some interesting moves, with their Data Lakehouse technology. So they're trying to kind of complicate, or not complicate, they're trying to take away the complications of, you know, the downsides of data warehousing and data lakes, and trying to find that middle ground, where you have the benefits of a managed, governed, you know, data warehouse environment, but you have sort of the lower cost, you know, capability of a data lake. And so, you know, Databricks has become really attractive, especially by data scientists, right? We've been tracking them in the AI machine learning sector for quite some time here at ETR, attractive for a data scientist because it looks and acts like a lake, but can have some managed capabilities like a warehouse. So it's kind of the best of both worlds. So in some ways I think you've seen sort of a data science driver for the adoption of Databricks that has now become a little bit more mainstream across the business. Snowflake, maybe the other direction, you know, it's a cloud data warehouse that you know, is starting to expand its capabilities and add on new things like Streamlit is a good example in the analytics space, with apps. So you see these tools starting to branch and creep out a bit, but they offer that sort of neutrality, right? We heard one IT decision maker we recently interviewed that referred to Snowflake and Databricks as the quote unquote Switzerland of what they do. And so there's this desirability from an organization to find these tools that can solve the complex multi-headed use-case of data and analytics, which every business unit needs in different ways. And figure out a way to do that, an elegant way that's governed and centrally managed, that federated kind of best of both worlds that you get by bringing the data close to the business while having a central governed instance. So these tools are incredibly powerful and I think there's only going to be room for growth, for those two especially. I think they're going to expand and do different things and maybe, you know, join forces with others and a lot of the power of what they do well is trying to define these connections and find these partnerships with other vendors, and try to be seen as the nice add-on to your existing environment that plays nicely with everyone. So I think that's where those two tools are going, but they certainly fit this sort of label of, you know, trying to be that supercloud neutral, you know, layer that unites everything. >> Yeah, and if you bring the graphic back up, please, there's obviously big data plays in each of the cloud platforms, you know, Microsoft, big database player, AWS is, you know, 11, 12, 15, data stores. And of course, you know, BigQuery and other, you know, data platforms within Google. But you know, I'm not sure the big cloud guys are going to go hard after so-called supercloud, cross-cloud services. Although, we see Oracle getting in bed with Microsoft and Azure, with a database service that is cross-cloud, certainly Google with Anthos and you know, you never say never with with AWS. I guess what I would say guys, and I'll I'll leave you with this is that, you know, just like all players today are cloud players, I feel like anybody in the business or most companies are going to be so-called supercloud players. In other words, they're going to have a cross-cloud strategy, they're going to try to build connections if they're coming from on-prem like a Dell or an HPE, you know, or Pure or you know, many of these other companies, Cohesity is another one. They're going to try to connect to their on-premise states, of course, and create a consistent experience. It's natural that they're going to have sort of some consistency across clouds. You know, the big question is, what's that spectrum look like? I think on the one hand you're going to have some, you know, maybe some rudimentary, you know, instances of supercloud or maybe they just run on the individual clouds versus where Snowflake and others and even beyond that are trying to go with a single global instance, basically building out what I would think of as their own cloud, and importantly their own ecosystem. I'll give you guys the last thought. Maybe you could each give us, you know, closing thoughts. Maybe Darren, you could start and Erik, you could bring us home on just this entire topic, the future of cloud and data. >> Yeah, I mean I think, you know, two points to make on that is, this question of these, I guess what we'll call legacy on-prem players. These, mega vendors that have been around a long time, have big on-prem footprints and a lot of people have them for that reason. I think it's foolish to assume that a company, especially a large, mature, multinational company that's been around a long time, it's foolish to think that they can just uproot and leave on-premises entirely full scale. There will almost always be an on-prem footprint from any company that was not, you know, natively born in the cloud after 2010, right? I just don't think that's reasonable anytime soon. I think there's some industries that need on-prem, things like, you know, industrial manufacturing and so on. So I don't think on-prem is going away, and I think vendors that are going to, you know, go very cloud forward, very big on the cloud, if they neglect having at least decent connectors to on-prem legacy vendors, they're going to miss out. So I think that's something that these players need to keep in mind is that they continue to reach back to some of these players that have big footprints on-prem, and make sure that those integrations are seamless and work well, or else their customers will always have a multi-cloud or hybrid experience. And then I think a second point here about the future is, you know, we talk about the three big, you know, cloud providers, the Google, Microsoft, AWS as sort of the opposite of, or different from this new supercloud paradigm that's emerging. But I want to kind of point out that, they will always try to make a play to become that and I think, you know, we'll certainly see someone like Microsoft trying to expand their licensing and expand how they play in order to become that super cloud provider for folks. So also don't want to downplay them. I think you're going to see those three big players continue to move, and take over what players like CloudFlare are doing and try to, you know, cut them off before they get too big. So, keep an eye on them as well. >> Great points, I mean, I think you're right, the first point, if you're Dell, HPE, Cisco, IBM, your strategy should be to make your on-premise state as cloud-like as possible and you know, make those differences as minimal as possible. And you know, if you're a customer, then the business case is going to be low for you to move off of that. And I think you're right. I think the cloud guys, if this is a real problem, the cloud guys are going to play in there, and they're going to make some money at it. Erik, bring us home please. >> Yeah, I'm going to revert back to our data and this on the macro side. So to kind of support this concept of a supercloud right now, you know Dave, you and I know, we check overall spending and what we're seeing right now is total year spent is expected to only be 4.6%. We ended 2022 at 5% even though it began at almost eight and a half. So this is clearly declining and in that environment, we're seeing the top two strategies to reduce spend are actually vendor consolidation with 36% of our respondents saying they're actively seeking a way to reduce their number of vendors, and consolidate into one. That's obviously supporting a supercloud type of play. Number two is reducing excess cloud resources. So when I look at both of those combined, with a drop in the overall spending reduction, I think you're on the right thread here, Dave. You know, the overall macro view that we're seeing in the data supports this happening. And if I can real quick, couple of names we did not touch on that I do think deserve to be in this conversation, one is HashiCorp. HashiCorp is the number one player in our infrastructure sector, with a 56% net score. It does multiple things within infrastructure and it is completely agnostic to your environment. And if we're also speaking about something that's just a singular feature, we would look at Rubric for data, backup, storage, recovery. They're not going to offer you your full cloud or your networking of course, but if you are looking for your backup, recovery, and storage Rubric, also number one in that sector with a 53% net score. Two other names that deserve to be in this conversation as we watch it move and evolve. >> Great, thank you for bringing that up. Yeah, we had both of those guys in the chart and I failed to focus in on HashiCorp. And clearly a Supercloud enabler. All right guys, we got to go. Thank you so much for joining us, appreciate it. Let's keep this conversation going. >> Always enjoy talking to you Dave, thanks. >> Yeah, thanks for having us. >> All right, keep it right there for more content from Supercloud 2. This is Dave Valente for John Ferg and the entire Cube team. We'll be right back. (gentle synth music) (music fades)

Published Date : Feb 17 2023

SUMMARY :

is the intersection of cloud and data. Thank you for having period of time, you know, and evolution of the cloud So in a way, you know, supercloud the data closer to the business. So my general question to both of you is, the complexity does need to be And so there's this need to use, you know, So my question to you guys is, And as you mentioned, Azure but in the surveys, you know, customers, the ability to offer and there are a number of other, you know, and maybe, you know, join forces each of the cloud platforms, you know, the three big, you know, And you know, if you're a customer, you and I know, we check overall spending and I failed to focus in on HashiCorp. to you Dave, thanks. Ferg and the entire Cube team.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Erik	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
John Ferg	PERSON	0.99+
Dave	PERSON	0.99+
Walmart	ORGANIZATION	0.99+
Erik Bradley	PERSON	0.99+
David	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Dave Valente	PERSON	0.99+
January, 2023	DATE	0.99+
China	LOCATION	0.99+
US	LOCATION	0.99+
HPE	ORGANIZATION	0.99+
50 billion	QUANTITY	0.99+
Ionis Pharmaceuticals	ORGANIZATION	0.99+
Darren Brabham	PERSON	0.99+
56%	QUANTITY	0.99+
4.6%	QUANTITY	0.99+
Europe	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
53%	QUANTITY	0.99+
36%	QUANTITY	0.99+
Tanzu	ORGANIZATION	0.99+
Darren	PERSON	0.99+
1200	QUANTITY	0.99+
Red Hat	ORGANIZATION	0.99+
VMware	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Friday	DATE	0.99+
Rubric	ORGANIZATION	0.99+
last year	DATE	0.99+
two sides	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
5%	QUANTITY	0.99+
Cohesity	ORGANIZATION	0.99+
two tools	QUANTITY	0.99+
Veeam	ORGANIZATION	0.99+
CloudFlare	TITLE	0.99+
two	QUANTITY	0.99+
both	QUANTITY	0.99+
2022	DATE	0.99+
One	QUANTITY	0.99+
Daren Brabham	PERSON	0.99+
three years	QUANTITY	0.99+
TSIS	ORGANIZATION	0.99+
Brabham	PERSON	0.99+
CloudFlare	ORGANIZATION	0.99+
1500 survey respondents	QUANTITY	0.99+
second point	QUANTITY	0.99+
first point	QUANTITY	0.98+
Snowflake	TITLE	0.98+
one	QUANTITY	0.98+
Supercloud	ORGANIZATION	0.98+
ETR	ORGANIZATION	0.98+
Snowflake	ORGANIZATION	0.98+
Akamai	ORGANIZATION	0.98+

Is Data Mesh the Killer App for Supercloud | Supercloud2

(gentle bright music) >> Okay, welcome back to our "Supercloud 2" event live coverage here at stage performance in Palo Alto syndicating around the world. I'm John Furrier with Dave Vellante. We've got exclusive news and a scoop here for SiliconANGLE and theCUBE. Zhamak Dehghani, creator of data mesh has formed a new company called NextData.com NextData, she's a cube alumni and contributor to our Supercloud initiative, as well as our coverage and breaking analysis with Dave Vellante on data, the killer app for Supercloud. Zhamak, great to see you. Thank you for coming into the studio and congratulations on your newly formed venture and continued success on the data mesh. >> Thank you so much. It's great to be here. Great to see you in person. >> Dave: Yeah, finally. >> John: Wonderful. Your contributions to the data conversation has been well-documented certainly by us and others in the industry. Data mesh taking the world by storm. Some people are debating it, throwing, you know, cold water on it. Some are, I think, it's the next big thing. Tell us about the data mesh super data apps that are emerging out of cloud. >> I mean, data mesh, as you said, it's, you know, the pain point that it surfaced were universal. Everybody said, "Oh, why didn't I think of that?" You know, it was just an obvious next step and people are approaching it, implementing it. I guess the last few years, I've been involved in many of those implementations, and I guess Supercloud is somewhat a prerequisite for it because it's data mesh and building applications using data mesh is about sharing data responsibly across boundaries. And those boundaries include boundaries, organizational boundaries cloud technology boundaries and trust boundaries. >> I want to bring that up because your venture, NextData which is new, just formed. Tell us about that. What wave is that riding? What specifically are you targeting? What's the pain point? >> Zhamak: Absolutely, yes. So next data is the result of, I suppose, the pains that I suffered from implementing a database for many of the organizations. Basically, a lot of organizations that I've worked with, they want decentralized data. So they really embrace this idea of decentralized ownership of the data, but yet they want interconnectivity through standard APIs, yet they want discoverability and governance. So they want to have policies implemented, they want to govern that data, they want to be able to discover that data and yet they want to decentralize it. And we do that with a developer experience that is easy and native to a generalist developer. So we try to find, I guess, the common denominator that solves those problems and enables that developer experience for data sharing. >> John: Since you just announced the news, what's been the reaction? >> Zhamak: I just announced the news right now, so what's the reaction? >> John: But people in the industry that know you, you did a lot of work in the area. What have been some of the feedback on the new venture in terms of the approach, the customers, problem? >> Yeah, so we've been in stealth modes, so we haven't publicly talked about it, but folks that have been close to us in fact have reached out. We already have implementations of our pilot platform with early customers, which is super exciting. And we're going to have multiple of those. Of course, we're a tiny, tiny company. We can have many of those where we are going to have multiple pilots, implementations of our platform in real world. We're real global large scale organizations that have real world problems. So we're not going to build our platform in vacuum. And that's what's happening right now. >> Zhamak: When I think about your role at ThoughtWorks, you had a very wide observation space with a number of clients helping them implement data mesh and other things as well prior to your data mesh initiative. But when I look at data mesh, at least the ones that I've seen, they're very narrow. I think of JPMC, I think of HelloFresh. They're generally obviously not surprising. They don't include the big vision of inclusivity across clouds across different data stores. But it seems like people are having to go through some gymnastics to get to, you know, the organizational reality of decentralizing data, and at least pushing data ownership to the line of business. How are you approaching or are you approaching, solving that problem? Are you taking a narrow slice? What can you tell us about Next Data? >> Zhamak: Sure, yeah, absolutely. Gymnastics, the cute word to describe what the organizations have to go through. And one of those problems is that, you know, the data, as you know, resides on different platforms. It's owned by different people, it's processed by pipelines that who owns them. So there's this very disparate and disconnected set of technologies that were very useful for when we thought about data and processing as a centralized problem. But when you think about data as a decentralized problem, the cost of integration of these technologies in a cohesive developer experience is what's missing. And we want to focus on that cohesive end-to-end developer experience to share data responsibly in this autonomous units, we call them data products, I guess in data mesh, right? That constitutes computation, that governs that data policies, discoverability. So I guess, I heard this expression in the last talks that you can have your cake and eat it too. So we want people have their cakes, which is, you know, data in different places, decentralization and eat it too, which is interconnected access to it. So we start with standardizing and codifying this idea of a data product container that encapsulates data computation, APIs to get to it in a technology agnostic way, in an open way. And then, sit on top and use existing existing tech, you know, Snowflake, Databricks, whatever exists, you know, the millions of dollars of investments that companies have made, sit on top of those but create this cohesive, integrated experience where data product is a first class primitive. And that's really key here, that the language, and the modeling that we use is really native to data mesh is that I will make a data product, I'm sharing a data product, and that encapsulates on providing metadata about this. I'm providing computation that's constantly changing the data. I'm providing the API for that. So we're trying to kind of codify and create a new developer experience based on that. And developer, both from provider side and user side connected to peer-to-peer data sharing with data product as a primitive first class concept. >> Okay, so the idea would be developers would build applications leveraging those data products which are discoverable and governed. Now, today you see some companies, you know, take a snowflake for example. >> Zhamak: Yeah. >> Attempting to do that within their own little walled garden. They even, at one point, used the term, "Mesh." I dunno if they pull back on that. And then they sort of became aware of some of your work. But a lot of the things that they're doing within their little insulated environment, you know, support that, that, you know, governance, they're building out an ecosystem. What's different in your vision? >> Exactly. So we realize that, you know, and this is a reality, like you go to organizations, they have a snowflake and half of the organization happily operates on Snowflake. And on the other half, oh, we are on, you know, bare infrastructure on AWS, or we are on Databricks. This is the realities, you know, this Supercloud that's written up here. It's about working across boundaries of technology. So we try to embrace that. And even for our own technology with the way we're building it, we say, "Okay, nobody's going to use next data mesh operating system. People will have different platforms." So you have to build with openness in mind, and in case of Snowflake, I think, you know, they have I'm sure very happy customers as long as customers can be on Snowflake. But once you cross that boundary of platforms then that becomes a problem. And we try to keep that in mind in our solution. >> So, it's worth reviewing that basically, the concept of data mesh is that, whether you're a data lake or a data warehouse, an S3 bucket, an Oracle database as well, they should be inclusive inside of the data. >> We did a session with AWS on the startup showcase, data as code. And remember, I wrote a blog post in 2007 called, "Data's the new developer kit." Back then, they used to call 'em developer kits, if you remember. And that we said at that time, whoever can code data >> Zhamak: Yes. >> Will have a competitive advantage. >> Aren't there machines going to be doing that? Didn't we just hear that? >> Well we have, and you know, Hey Siri, hey Cube. Find me that best video for data mesh. There it is. I mean, this is the point, like what's happening is that, now, data has to be addressable >> Zhamak: Yes. >> For machines and for coding. >> Zhamak: Yes. >> Because as you need to call the data. So the question is, how do you manage the complexity of big things as promiscuous as possible, making it available as well as then governing it because it's a trade off. The more you make open >> Zhamak: Definitely. >> The better the machine learning. >> Zhamak: Yes. >> But yet, the governance issue, so this is the, you need an OS to handle this maybe. >> Yes, well, we call our mental model for our platform is an OS operating system. Operating systems, you know, have shown us how you can kind of abstract what's complex and take care of, you know, a lot of complexities, but yet provide an open and, you know, dynamic enough interface. So we think about it that way. We try to solve the problem of policies live with the data. An enforcement of the policies happens at the most granular level which is, in this concept, the data product. And that would happen whether you read, write, or access a data product. But we can never imagine what are these policies could be. So our thinking is, okay, we should have a open policy framework that can allow organizations write their own policy drivers, and policy definitions, and encode it and encapsulated in this data product container. But I'm not going to fool myself to say that, you know, that's going to solve the problem that you just described. I think we are in this, I don't know, if I look into my crystal ball, what I think might happen is that right now, the primitives that we work with to train machine-learning model are still bits and bites in data. They're fields, rows, columns, right? And that creates quite a large surface area, an attack area for, you know, for privacy of the data. So perhaps, one of the trends that we might see is this evolution of data APIs to become more and more computational aware to bring the compute to the data to reduce that surface area so you can really leave the control of the data to the sovereign owners of that data, right? So that data product. So I think the evolution of our data APIs perhaps will become more and more computational. So you describe what you want, and the data owner decides, you know, how to manage the- >> John: That's interesting, Dave, 'cause it's almost like we just talked about ChatGPT in the last segment with you, who's a machine learning, could really been around the industry. It's almost as if you're starting to see reason come into the data, reasoning. It's like you starting to see not just metadata, using the data to reason so that you don't have to expose the raw data. It's almost like a, I won't say curation layer, but an intelligence layer. >> Zhamak: Exactly. >> Can you share your vision on that 'cause that seems to be where the dots are connecting. >> Zhamak: Yes, this is perhaps further into the future because just from where we stand, we have to create still that bridge of familiarity between that future and present. So we are still in that bridge-making mode, however, by just the basic notion of saying, "I'm going to put an API in front of my data, and that API today might be as primitive as a level of indirection as in you tell me what you want, tell me who you are, let me go process that, all the policies and lineage, and insert all of this intelligence that need to happen. And then I will, today, I will still give you a file. But by just defining that API and standardizing it, now we have this amazing extension point that we can say, "Well, the next revision of this API, you not just tell me who you are, but you actually tell me what intelligence you're after. What's a logic that I need to go and now compute on your API?" And you can kind of evolve that, right? Now you have a point of evolution to this very futuristic, I guess, future where you just describe the question that you're asking from the chat. >> Well, this is the Supercloud, Dave. >> I have a question from a fan, I got to get it in. It's George Gilbert. And so, his question is, you're blowing away the way we synchronize data from operational systems to the data stack to applications. So the concern that he has, and he wants your feedback on this, "Is the data product app devs get exposed to more complexity with respect to moving data between data products or maybe it's attributes between data products, how do you respond to that? How do you see, is that a problem or is that something that is overstated, or do you have an answer for that?" >> Zhamak: Absolutely. So I think there's a sweet spot in getting data developers, data product developers closer to the app, but yet not burdening them with the complexity of the application and application logic, and yet reducing their cognitive load by localizing what they need to know about which is that domain where they're operating within. Because what's happening right now? what's happening right now is that data engineers, a ton of empathy for them for their high threshold of pain that they can, you know, deal with, they have been centralized, they've put into the data team, and they have been given this unbelievable task of make meaning out of data, put semantic over it, curates it, cleans it, and so on. So what we are saying is that get those folks embedded into the domain closer to the application developers, these are still separately moving units. Your app and your data products are independent but yet tightly closed with each other, tightly coupled with each other based on the context of the domain, so reduce cognitive load by localizing what they need to know about to the domain, get them closer to the application but yet have them them separate from app because app provides a very different service. Transactional data for my e-commerce transaction, data product provides a very different service, longitudinal data for the, you know, variety of this intelligent analysis that I can do on the data. But yet, it's all within the domain of e-commerce or sales or whatnot. >> So a lot of decoupling and coupling create that cohesiveness. >> Zhamak: Absolutely. >> Architecture. So I have to ask you, this is an interesting question 'cause it came up on theCUBE all last year. Back on the old server, data center days and cloud, SRE, Google coined the term, "Site Reliability Engineer" for someone to look over the hundreds of thousands of servers. We asked a question to data engineering community who have been suffering, by the way, agree. Is there an SRE-like role for data? Because in a way, data engineering, that platform engineer, they are like the SRE for data. In other words, managing the large scale to enable automation and cell service. What's your thoughts and reaction to that? >> Zhamak: Yes, exactly. So, maybe we go through that history of how SRE came to be. So we had the first DevOps movement which was, remove the wall between dev and ops and bring them together. So you have one cross-functional units of the organization that's responsible for, you build it you run it, right? So then there is no, I'm going to just shoot my application over the wall for somebody else to manage it. So we did that, and then we said, "Okay, as we decentralized and had this many microservices running around, we had to create a layer that abstracted a lot of the complexity around running now a lot or monitoring, observing and running a lot while giving autonomy to this cross-functional team." And that's where the SRE, a new generation of engineers came to exist. So I think if I just look- >> Hence Borg, hence Kubernetes. >> Hence, hence, exactly. Hence chaos engineering, hence embracing the complexity and messiness, right? And putting engineering discipline to embrace that and yet give a cohesive and high integrity experience of those systems. So I think, if we look at that evolution, perhaps something like that is happening by bringing data and apps closer and make them these domain-oriented data product teams or domain oriented cross-functional teams, full stop, and still have a very advanced maybe at the platform infrastructure level kind of operational team that they're not busy doing two jobs which is taking care of domains and the infrastructure, but they're building infrastructure that is embracing that complexity, interconnectivity of this data process. >> John: So you see similarities. >> Absolutely, but I feel like we're probably in a more early days of that movement. >> So it's a data DevOps kind of thing happening where scales happening. It's good things are happening yet. Eh, a little bit fast and loose with some complexities to clean up. >> Yes, yes. This is a different restructure. As you said we, you know, the job of this industry as a whole on architects is decompose, recompose, decompose, recomposing a new way, and now we're like decomposing centralized team, recomposing them as domains and- >> John: So is data mesh the killer app for Supercloud? >> You had to do this for me. >> Dave: Sorry, I couldn't- (John and Dave laughing) >> Zhamak: What do you want me to say, Dave? >> John: Yes. >> Zhamak: Yes of course. >> I mean Supercloud, I think it's, really the terminology's Supercloud, Opencloud. But I think, in spirits of it, this embracing of diversity and giving autonomy for people to make decisions for what's right for them and not yet lock them in. I think just embracing that is baked into how data mesh assume the world would work. >> John: Well thank you so much for coming on Supercloud too, really appreciate it. Data has driven this conversation. Your success of data mesh has really opened up the conversation and exposed the slow moving data industry. >> Dave: Been a great catalyst. (John laughs) >> John: That's now going well. We can move faster, so thanks for coming on. >> Thank you for hosting me. It was wonderful. >> Okay, Supercloud 2 live here in Palo Alto. Our stage performance, I'm John Furrier with Dave Vellante. We're back with more after this short break, Stay with us all day for Supercloud 2. (gentle bright music)

Published Date : Feb 17 2023

SUMMARY :

and continued success on the data mesh. Great to see you in person. and others in the industry. I guess the last few years, What's the pain point? a database for many of the organizations. in terms of the approach, but folks that have been close to us to get to, you know, the data, as you know, resides Okay, so the idea would be developers But a lot of the things that they're doing This is the realities, you know, inside of the data. And that we said at that Well we have, and you know, So the question is, how do so this is the, you need and the data owner decides, you know, so that you don't have 'cause that seems to be where of this API, you not So the concern that he has, into the domain closer to So a lot of decoupling So I have to ask you, this a lot of the complexity of domains and the infrastructure, in a more early days of that movement. to clean up. the job of this industry the world would work. John: Well thank you so much for coming Dave: Been a great catalyst. We can move faster, so Thank you for hosting me. after this short break,

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
John	PERSON	0.99+
Zhamak	PERSON	0.99+
Dave	PERSON	0.99+
George Gilbert	PERSON	0.99+
AWS	ORGANIZATION	0.99+
2007	DATE	0.99+
Palo Alto	LOCATION	0.99+
John Furrier	PERSON	0.99+
John Furrier	PERSON	0.99+
Zhamak Dehghani	PERSON	0.99+
JPMC	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Dav	PERSON	0.99+
two jobs	QUANTITY	0.99+
Supercloud	ORGANIZATION	0.99+
NextData	ORGANIZATION	0.99+
today	DATE	0.99+
Opencloud	ORGANIZATION	0.99+
last year	DATE	0.99+
Siri	TITLE	0.99+
ThoughtWorks	ORGANIZATION	0.98+
NextData.com	ORGANIZATION	0.98+
Supercloud 2	EVENT	0.98+
both	QUANTITY	0.98+
one	QUANTITY	0.98+
HelloFresh	ORGANIZATION	0.98+
first	QUANTITY	0.98+
millions of dollars	QUANTITY	0.96+
Snowflake	EVENT	0.96+
Oracle	ORGANIZATION	0.96+
SRE	TITLE	0.94+
Snowflake	ORGANIZATION	0.94+
Cube	PERSON	0.93+
Zhama	PERSON	0.92+
Data Mesh the Killer App	TITLE	0.92+
SiliconANGLE	ORGANIZATION	0.91+
Databricks	ORGANIZATION	0.9+
first class	QUANTITY	0.89+
Supercloud 2	ORGANIZATION	0.88+
theCUBE	ORGANIZATION	0.88+
hundreds of thousands	QUANTITY	0.85+
one point	QUANTITY	0.84+
Zham	PERSON	0.83+
Supercloud	EVENT	0.83+
ChatGPT	ORGANIZATION	0.72+
SRE	ORGANIZATION	0.72+
Borg	PERSON	0.7+
Snowflake	TITLE	0.66+
Supercloud	TITLE	0.65+
half	QUANTITY	0.64+

Discussion about Walmart's Approach | Supercloud2

(upbeat electronic music) >> Okay, welcome back to Supercloud 2, live here in Palo Alto. I'm John Furrier, with Dave Vellante. Again, all day wall-to-wall coverage, just had a great interview with Walmart, we've got a Next interview coming up, you're going to hear from Bob Muglia and Tristan Handy, two experts, both experienced entrepreneurs, executives in technology. We're here to break down what just happened with Walmart, and what's coming up with George Gilbert, former colleague, Wikibon analyst, Gartner Analyst, and now independent investor and expert. George, great to see you, I know you're following this space. Like you read about it, remember the first days when Dataverse came out, we were talking about them coming out of Berkeley? >> Dave: Snowflake. >> John: Snowflake. >> Dave: Snowflake In the early days. >> We, collectively, have been chronicling the data movement since 2010, you were part of our team, now you've got your nose to the grindstone, you're seeing the next wave. What's this all about? Walmart building their own super cloud, we got Bob Muglia talking about how these next wave of apps are coming. What are the super apps? What's the super cloud to you? >> Well, this key's off Dave's really interesting questions to Walmart, which was like, how are they building their supercloud? 'Cause it makes a concrete example. But what was most interesting about his description of the Walmart WCMP, I forgot what it stood for. >> Dave: Walmart Cloud Native Platform. >> Walmart, okay. He was describing where the logic could run in these stateless containers, and maybe eventually serverless functions. But that's just it, and that's the paradigm of microservices, where the logic is in this stateless thing, where you can shoot it, or it fails, and you can spin up another one, and you've lost nothing. >> That was their triplet model. >> Yeah, in fact, and that was what they were trying to move to, where these things move fluidly between data centers. >> But there's a but, right? Which is they're all stateless apps in the cloud. >> George: Yeah. >> And all their stateful apps are on-prem and VMs. >> Or the stateful part of the apps are in VMs. >> Okay. >> And so if they really want to lift their super cloud layer off of this different provider's infrastructure, they're going to need a much more advanced software platform that manages data. And that goes to the -- >> Muglia and Handy, that you and I did, that's coming up next. So the big takeaway there, George, was, I'll set it up and you can chime in, a new breed of data apps is emerging, and this highly decentralized infrastructure. And Tristan Handy of DBT Labs has a sort of a solution to begin the journey today, Muglia is working on something that's way out there, describe what you learned from it. >> Okay. So to talk about what the new data apps are, and then the platform to run them, I go back to the using what will probably be seen as one of the first data app examples, was Uber, where you're describing entities in the real world, riders, drivers, routes, city, like a city plan, these are all defined by data. And the data is described in a structure called a knowledge graph, for lack of a, no one's come up with a better term. But that means the tough, the stuff that Jack built, which was all stateless and sits above cloud vendors' infrastructure, it needs an entirely different type of software that's much, much harder to build. And the way Bob described it is, you're going to need an entirely new data management infrastructure to handle this. But where, you know, we had this really colorful interview where it was like Rock 'Em Sock 'Em, but they weren't really that much in opposition to each other, because Tristan is going to define this layer, starting with like business intelligence metrics, where you're defining things like bookings, billings, and revenue, in business terms, not in SQL terms -- >> Well, business terms, if I can interrupt, he said the one thing we haven't figured out how to APIify is KPIs that sit inside of a data warehouse, and that's essentially what he's doing. >> George: That's what he's doing, yes. >> Right. And so then you can now expose those APIs, those KPIs, that sit inside of a data warehouse, or a data lake, a data store, whatever, through APIs. >> George: And the difference -- >> So what does that do for you? >> Okay, so all of a sudden, instead of working at technical data terms, where you're dealing with tables and columns and rows, you're dealing instead with business entities, using the Uber example of drivers, riders, routes, you know, ETA prices. But you can define, DBT will be able to define those progressively in richer terms, today they're just doing things like bookings, billings, and revenue. But Bob's point was, today, the data warehouse that actually runs that stuff, whereas DBT defines it, the data warehouse that runs it, you can't do it with relational technology >> Dave: Relational totality, cashing architecture. >> SQL, you can't -- >> SQL caching architectures in memory, you can't do it, you've got to rethink down to the way the data lake is laid out on the disk or cache. Which by the way, Thomas Hazel, who's speaking later, he's the chief scientist and founder at Chaos Search, he says, "I've actually done this," basically leave it in an S3 bucket, and I'm going to query it, you know, with no caching. >> All right, so what I hear you saying then, tell me if I got this right, there are some some things that are inadequate in today's world, that's not compatible with the Supercloud wave. >> Yeah. >> Specifically how you're using storage, and data, and stateful. >> Yes. >> And then the software that makes it run, is that what you're saying? >> George: Yeah. >> There's one other thing you mentioned to me, it's like, when you're using a CRM system, a human is inputting data. >> George: Nothing happens till the human does something. >> Right, nothing happens until that data entry occurs. What you're talking about is a world that self forms, polling data from the transaction system, or the ERP system, and then builds a plan without human intervention. >> Yeah. Something in the real world happens, where the user says, "I want a ride." And then the software goes out and says, "Okay, we got to match a driver to the rider, we got to calculate how long it takes to get there, how long to deliver 'em." That's not driven by a form, other than the first person hitting a button and saying, "I want a ride." All the other stuff happens autonomously, driven by data and analytics. >> But my question was different, Dave, so I want to get specific, because this is where the startups are going to come in, this is the disruption. Snowflake is a data warehouse that's in the cloud, they call it a data cloud, they refactored it, they did it differently, the success, we all know it looks like. These areas where it's inadequate for the future are areas that'll probably be either disrupted, or refactored. What is that? >> That's what Muglia's contention is, that the DBT can start adding that layer where you define these business entities, they're like mini digital twins, you can define them, but the data warehouse isn't strong enough to actually manage and run them. And Muglia is behind a company that is rethinking the database, really in a fundamental way that hasn't been done in 40 or 50 years. It's the first, in his contention, the first real rethink of database technology in a fundamental way since the rise of the relational database 50 years ago. >> And I think you admit it's a real Hail Mary, I mean it's quite a long shot right? >> George: Yes. >> Huge potential. >> But they're pretty far along. >> Well, we've been talking on theCUBE for 12 years, and what, 10 years going to AWS Reinvent, Dave, that no one database will rule the world, Amazon kind of showed that with them. What's different, is it databases are changing, or you can have multiple databases, or? >> It's a good question. And the reason we've had multiple different types of databases, each one specialized for a different type of workload, but actually what Muglia is behind is a new engine that would essentially, you'll never get rid of the data warehouse, or the equivalent engine in like a Databricks datalake house, but it's a new engine that manages the thing that describes all the data and holds it together, and that's the new application platform. >> George, we have one minute left, I want to get real quick thought, you're an investor, and we know your history, and the folks watching, George's got a deep pedigree in investment data, and we can testify against that. If you're going to invest in a company right now, if you're a customer, I got to make a bet, what does success look like for me, what do I want walking through my door, and what do I want to send out? What companies do I want to look at? What's the kind of of vendor do I want to evaluate? Which ones do I want to send home? >> Well, the first thing a customer really has to do when they're thinking about next gen applications, all the people have told you guys, "we got to get our data in order," getting that data in order means building an integrated view of all your data landscape, which is data coming out of all your applications. It starts with the data model, so, today, you basically extract data from all your operational systems, put it in this one giant, central place, like a warehouse or lake house, but eventually you want this, whether you call it a fabric or a mesh, it's all the data that describes how everything hangs together as in one big knowledge graph. There's different ways to implement that. And that's the most critical thing, 'cause that describes your Uber landscape, your Uber platform. >> That's going to power the digital transformation, which will power the business transformation, which powers the business model, which allows the builders to build -- >> Yes. >> Coders to code. That's Supercloud application. >> Yeah. >> George, great stuff. Next interview you're going to see right here is Bob Muglia and Tristan Handy, they're going to unpack this new wave. Great segment, really worth unpacking and reading between the lines with George, and Dave Vellante, and those two great guests. And then we'll come back here for the studio for more of the live coverage of Supercloud 2. Thanks for watching. (upbeat electronic music)

Published Date : Feb 17 2023

SUMMARY :

remember the first days What's the super cloud to you? of the Walmart WCMP, I and that's the paradigm of microservices, and that was what they stateless apps in the cloud. And all their stateful of the apps are in VMs. And that goes to the -- Muglia and Handy, that you and I did, But that means the tough, he said the one thing we haven't And so then you can now the data warehouse that runs it, Dave: Relational totality, Which by the way, Thomas I hear you saying then, and data, and stateful. thing you mentioned to me, George: Nothing happens polling data from the transaction Something in the real world happens, that's in the cloud, that the DBT can start adding that layer Amazon kind of showed that with them. and that's the new application platform. and the folks watching, all the people have told you guys, Coders to code. for more of the live

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
George	PERSON	0.99+
Bob Muglia	PERSON	0.99+
Tristan Handy	PERSON	0.99+
Dave	PERSON	0.99+
Bob	PERSON	0.99+
Thomas Hazel	PERSON	0.99+
George Gilbert	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Walmart	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Palo Alto	LOCATION	0.99+
Chaos Search	ORGANIZATION	0.99+
Jack	PERSON	0.99+
Tristan	PERSON	0.99+
12 years	QUANTITY	0.99+
Berkeley	LOCATION	0.99+
Uber	ORGANIZATION	0.99+
first	QUANTITY	0.99+
DBT Labs	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
two experts	QUANTITY	0.99+
Supercloud 2	TITLE	0.99+
Gartner	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
both	QUANTITY	0.99+
Muglia	ORGANIZATION	0.99+
one minute	QUANTITY	0.99+
40	QUANTITY	0.99+
two great guests	QUANTITY	0.98+
Wikibon	ORGANIZATION	0.98+
50 years	QUANTITY	0.98+
John	PERSON	0.98+
Rock 'Em Sock 'Em	TITLE	0.98+
today	DATE	0.98+
first person	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
S3	COMMERCIAL_ITEM	0.97+
50 years ago	DATE	0.97+
2010	DATE	0.97+
Mary	PERSON	0.96+
first days	QUANTITY	0.96+
SQL	TITLE	0.96+
one	QUANTITY	0.95+
Supercloud wave	EVENT	0.95+
each one	QUANTITY	0.93+
DBT	ORGANIZATION	0.91+
Supercloud	TITLE	0.91+
Supercloud2	TITLE	0.91+
Supercloud 2	ORGANIZATION	0.89+
Snowflake	TITLE	0.86+
Dataverse	ORGANIZATION	0.83+
triplet	QUANTITY	0.78+

AWS Startup Showcase S3E1

(upbeat electronic music) >> Hello everyone, welcome to this CUBE conversation here from the studios in the CUBE in Palo Alto, California. I'm John Furrier, your host. We're featuring a startup, Astronomer. Astronomer.io is the URL, check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI, and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder of Astronomer, and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table who've worked very hard to get this company to the point that it's at. We have long ways to go, right? But there's been a lot of people involved that have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders, sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry to kind of highlight this shift that's happening. It's real, we've been chronicalizing, take a minute to explain what you guys do. >> Yeah, sure, we can get started. So, yeah, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data, and we were using an open source project called Apache Airflow that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into, and that running Airflow is actually quite challenging, and that there's a big opportunity for us to create a set of commercial products and an opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item in the old classic data infrastructure. But with AI, you're seeing a lot more emphasis on scale, tuning, training. Data orchestration is the center of the value proposition, when you're looking at coordinating resources, it's one of the most important things. Can you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one, and Viraj, feel free to jump in. So if you google data orchestration, here's what you're going to get. You're going to get something that says, "Data orchestration is the automated process" "for organizing silo data from numerous" "data storage points, standardizing it," "and making it accessible and prepared for data analysis." And you say, "Okay, but what does that actually mean," right, and so let's give sort of an an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Okay, give me a dashboard in Sigma, for example, for the number of customers or monthly active users, and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have in product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran to ingest data, a data warehouse, like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration, in our view, is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on and the company advances. And so, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run, and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CICD tooling, secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that it's the heartbeat, we think, of of the data ecosystem, and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> One of the things that jumped out, Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out. You mentioned a variety of things. You look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are fundamental, that were once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier, or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over the last however many years is that if a data team is using a bunch of tools to get what they need done, and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them, and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have some sort of base layer, right? That's kind of like, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, things like SageMaker, Redshift, whatever, but they also might need things that their cloud can't provide. Something like Fivetran, or Hightouch, those are other tools. And where data orchestration really shines, and something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need? So that somebody can read a dashboard and trust the number that it says, or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines, or machine learning, or whatever, you need different things to do them, and Airflow helps tie them together in a way that's really specific for a individual business' needs. >> Take a step back and share the journey of what you guys went through as a company startup. So you mentioned Apache, open source. I was just having an interview with a VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone/Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. Are you guys helping them? Take us through, 'cause you guys are on the front end of a big, big wave, and that is to make sense of the chaos, rain it in. Take us through your journey and why this is important. >> Yeah, Paola, I can take a crack at this, then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source, because we started using Airflow as an end user and started to say like, "Hey wait a second," "more and more people need this." Airflow, for background, started at Airbnb, and they were actually using that as a foundation for their whole data stack. Kind of how they made it so that they could give you recommendations, and predictions, and all of the processes that needed orchestrated. Airbnb created Airflow, gave it away to the public, and then fast forward a couple years and we're building a company around it, and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us, it's really been about watching the community and our customers take these problems, find a solution to those problems, standardize those solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting in their ELP infrastructure, they've solved that problem and now they're moving on to things like doing machine learning with their data, because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build a layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Viraj, I'll let you take that one too. (group laughs) >> So you know, a lot of it is... It goes across the gamut, right? Because it doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from other disparate sources into one place and then building on top of that. Be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection, because Airflow's orchestrating how transactions go, transactions get analyzed. They do things like analyzing marketing spend to see where your highest ROI is. And then you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers, kind of analyze and aggregating that at scale, and trying to automate decision making processes. So it goes from your most basic, what we call data plumbing, right? Just to make sure data's moving as needed, all the ways to your more exciting expansive use cases around automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future, is how critical Airflow is to all of those processes, and I think when you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic questions about your business and the growth of your company for so many organizations that we work with. So it's, I think, one of the things that gets Viraj and I and the rest of our company up every single morning is knowing how important the work that we do for all of those use cases across industries, across company sizes, and it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models is that you can integrate data into these models from outside. So you're starting to see the integration easier to deal with. Still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Has it already been disrupted? Would you categorize it as a new first inning kind of opportunity, or what's the state of the data orchestration area right now? Both technically and from a business model standpoint. How would you guys describe that state of the market? >> Yeah, I mean, I think in a lot of ways, in some ways I think we're category creating. Schedulers have been around for a long time. I released a data presentation sort of on the evolution of going from something like Kron, which I think was built in like the 1970s out of Carnegie Mellon. And that's a long time ago, that's 50 years ago. So sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out industry has 5X'd over the last 10 years. And so obviously as that ecosystem grows, and grows, and grows, and grows, the need for orchestration only increases. And I think, as Astronomer, I think we... And we work with so many different types of companies, companies that have been around for 50 years, and companies that got started not even 12 months ago. And so I think for us it's trying to, in a ways, category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration, and then there's folks who have such advanced use case, 'cause they're hitting sort of a ceiling and only want to go up from there. And so I think we, as a company, care about both ends of that spectrum, and certainly want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point, Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. If you rewind the clock like 5 or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business, and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is right on the money. And what we're finding is the need for it is spreading to all parts of the data team, naturally where Airflow's emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. We've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data, and that's data engineering, and then you're got to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I have to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers, or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean, there's so many... Sorry, Viraj, you can jump in. So there's so many companies using Airflow, right? So the baseline is that the open source project that is Airflow that came out of Airbnb, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in their organization, and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Viraj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's to start at the baseline, as we continue to grow our our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way, in a more efficient way, and that's really the crux of who we sell to. And so to answer your question on, we get a lot of inbound because they're... >> You have a built in audience. (laughs) >> The world that use it. Those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean, the power of the opensource community is really just so, so big, and I think that's also one of the things that makes this job fun. >> And you guys are in a great position. Viraj, you can comment a little, get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of productizing it, operationalizing it. This is a huge new dynamic, I mean, in the past 5 or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do, because we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running an e-commerce business, or maybe you're running, I don't know, some sort of like, any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it, you want to be able to google something and get answers for it, you want the benefits of open source. You don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, that you can benefit from, that you can learn from. But you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolve. We used a debate 10 years ago, can there be another Red Hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company? The milestones of Astromer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Viraj and I have obviously been at Astronomer along with our other founding team and leadership folks for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people, solving, again, mission critical problems for so many types of organizations. We've had some funding that has allowed us to invest in the team that we have and in the software that we have, and that's been really phenomenal. And so that investment, I think, keeps us confident, even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us that we know can get valuable out of what we do, and making developers' lives better, and growing the open source community and making sure that everything that we're doing makes it easier for folks to get started, to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> Don't know what the total is, but it's in the ballpark over $200 million. It feels good to... >> A little bit of capital. Got a little bit of cap to work with there. Great success. I know as a Series C Financing, you guys have been down. So you're up and running, what's next? What are you guys looking to do? What's the big horizon look like for you from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done. But really investing our product over the next, at least over the course of our company lifetime. And there's a lot of ways we want to make it more accessible to users, easier to get started with, easier to use, kind of on all areas there. And really, we really want to do more for the community, right, like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways, in more kind of events and everything else that we can keep those folks galvanized and just keep them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. I think we'll keep growing the team this year. We've got a couple of offices in the the US, which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York, and we're excited to be engaging with our coworkers in person finally, after years of not doing so. We've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world, and really focusing on our product and the open source community is where our heads are at this year. So, excited. >> Congratulations. 200 million in funding, plus. Good runway, put that money in the bank, squirrel it away. It's a good time to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open source community does, and good luck with the venture, continue to be successful, and we'll see you at the Startup Showcase. >> Thank you. >> Yeah, thanks so much, John. Appreciate it. >> Okay, that's the CUBE Conversation featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model, good solution for the next gen cloud scale data operations, data stacks that are emerging. I'm John Furrier, your host, thanks for watching. (soft upbeat music)

Published Date : Feb 14 2023

SUMMARY :

and that is the future of for the path we've been on so far. for the AI industry to kind of highlight So the crux of what we center of the value proposition, that it's the heartbeat, One of the things and the number of tools they're using of what you guys went and all of the processes That's a beautiful thing. all the tools that they need, What are some of the companies Viraj, I'll let you take that one too. all of the machine learning and the growth of your company that state of the market? and the value that we can provide and the data scientists that the data market's And so the folks that we sell to You have a built in audience. one of the things that makes this job fun. in the past 5 or so years, 10 years, that you can build on top of, the history of the company? and in the software that we have, How much have you guys raised? but it's in the ballpark What's the big horizon look like for you Kind of one of the best and worst things and continuing to hire the work you guys do. Yeah, thanks so much, John. for the next gen cloud

ENTITIES

Entity	Category	Confidence
Viraj Parekh	PERSON	0.99+
Paola	PERSON	0.99+
Viraj	PERSON	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
Airbnb	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
New York	LOCATION	0.99+
Apache	ORGANIZATION	0.99+
US	LOCATION	0.99+
Two	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Paola Peraza Calderon	PERSON	0.99+
1970s	DATE	0.99+
first question	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Airflow	TITLE	0.99+
both	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
200 million	QUANTITY	0.99+
Astronomer	ORGANIZATION	0.99+
One	QUANTITY	0.99+
over 200 million	QUANTITY	0.99+
over $200 million	QUANTITY	0.99+
this year	DATE	0.99+
10 years ago	DATE	0.99+
HubSpot	ORGANIZATION	0.98+
Fivetran	ORGANIZATION	0.98+
50 years ago	DATE	0.98+
over five years	QUANTITY	0.98+
one stack	QUANTITY	0.98+
12 months ago	DATE	0.98+
10 years	QUANTITY	0.97+
Both	QUANTITY	0.97+
Apache Airflow	TITLE	0.97+
both worlds	QUANTITY	0.97+
CNCF	ORGANIZATION	0.97+
one	QUANTITY	0.97+
ChatGPT	ORGANIZATION	0.97+
5	DATE	0.97+
next year	DATE	0.96+
Astromer	ORGANIZATION	0.96+
today	DATE	0.95+
5X	QUANTITY	0.95+
over five years ago	DATE	0.95+
CUBE	ORGANIZATION	0.94+
two things	QUANTITY	0.94+
each	QUANTITY	0.93+
one person	QUANTITY	0.93+
First	QUANTITY	0.92+
S3	TITLE	0.91+
Carnegie Mellon	ORGANIZATION	0.91+
Startup Showcase	EVENT	0.91+

AWS Startup Showcase S3E1

(soft music) >> Hello everyone, welcome to this Cube conversation here from the studios of theCube in Palo Alto, California. John Furrier, your host. We're featuring a startup, Astronomer, astronomer.io is the url. Check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table, who've worked very hard to get this company to the point that it's at. And we have long ways to go, right? But there's been a lot of people involved that are, have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry. Kind of highlight this shift that's happening. It's real. We've been chronologicalizing, take a minute to explain what you guys do. >> Yeah, sure. We can get started. So yeah, when Astronomer, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data and we were using an open source project called Apache Airflow, that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into. And that running Airflow is actually quite challenging and that there's a lot of, a big opportunity for us to create a set of commercial products and opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item, you know, in the old classic data infrastructure. But with AI you're seeing a lot more emphasis on scale, tuning, training. You know, data orchestration is the center of the value proposition when you're looking at coordinating resources, it's one of the most important things. Could you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one and Viraj feel free to jump in. So if you google data orchestration, you know, here's what you're going to get. You're going to get something that says, data orchestration is the automated process for organizing silo data from numerous data storage points to organizing it and making it accessible and prepared for data analysis. And you say, okay, but what does that actually mean, right? And so let's give sort of an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Hey, give me a dashboard in Sigma, for example, for the number of customers or monthly active users and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have end product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran, to ingest data, a data warehouse like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that, you know, data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration in our view is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration, you know, is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on. And, you know, the company advances. And so, you know, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CI/CD tooling, right? Secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that, it's the heartbeat that we think of the data ecosystem and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> You know, one of the things that jumped out Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out there. You mentioned a variety of things. You know, if you look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are >> Yeah. - >> fundamental, but we're once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got, you know, S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over, you know, the last however many years, is that like if a data team is using a bunch of tools to get what they need done and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have like some sort of base layer, right? That's kind of like, you know, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, you know, things like SageMaker, Redshift, whatever. But they also might need things that their Cloud can't provide, you know, something like Fivetran or Hightouch or anything of those other tools and where data orchestration really shines, right? And something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need, right? Something that makes it so that somebody can read a dashboard and trust the number that it says or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines or machine learning or whatever, you need different things to do them and Airflow helps tie them together in a way that's really specific for a individual business's needs. >> Take a step back and share the journey of what your guys went through as a company startup. So you mentioned Apache open source, you know, we were just, I was just having an interview with the VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone, Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. How do you guys, are you guys helping them? Take us through, 'cuz you guys are on the front end of a big, big wave and that is to make sense of the chaos, reigning it in. Take us through your journey and why this is important. >> Yeah Paola, I can take a crack at this and then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source because we started using Airflow as an end user and started to say like, "Hey wait a second". Like more and more people need this. Airflow, for background, started at Airbnb and they were actually using that as the foundation for their whole data stack. Kind of how they made it so that they could give you recommendations and predictions and all of the processes that need to be or needed to be orchestrated. Airbnb created Airflow, gave it away to the public and then, you know, fast forward a couple years and you know, we're building a company around it and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us it's really been about like watching the community and our customers take these problems, find solution to those problems, build standardized solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting and their ELP infrastructure, you know, they've solved that problem and now they're moving onto things like doing machine learning with their data, right? Because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build the layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Raj, I'll let you take that one too. (all laughing) >> Yeah. (all laughing) So you know, a lot of it is, it goes across the gamut, right? Because all doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from all the disparate sources into one place and then building on top of that, be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection because Airflow's orchestrating how transactions go. Transactions get analyzed, they do things like analyzing marketing spend to see where your highest ROI is. And then, you know, you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers kind of analyze and aggregating that at scale and trying to automate decision making processes. So it goes from your most basic, what we call like data plumbing, right? Just to make sure data's moving as needed. All the ways to your more exciting and sexy use cases around like automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future is how critical Airflow is to all of those processes, you know? And I think when, you know, you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic, you know, questions about your business and the growth of your company for so many organizations that we work with. So it's, I think one of the things that gets Viraj and I, and the rest of our company up every single morning, is knowing how important the work that we do for all of those use cases across industries, across company sizes. And it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models, is that you can integrate data into these models, right? From outside, right? So you're starting to see the integration easier to deal with, still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Is it already been disrupted? Would you categorize it as a new first inning kind of opportunity or what's the state of the data orchestration area right now? Both, you know, technically and from a business model standpoint, how would you guys describe that state of the market? >> Yeah, I mean I think, I think in a lot of ways we're, in some ways I think we're categoric rating, you know, schedulers have been around for a long time. I recently did a presentation sort of on the evolution of going from, you know, something like KRON, which I think was built in like the 1970s out of Carnegie Mellon. And you know, that's a long time ago. That's 50 years ago. So it's sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out the industry has, you know, has some 5X over the last 10 years. And so obviously as that ecosystem grows and grows and grows and grows, the need for orchestration only increases. And I think, you know, as Astronomer, I think we, and there's, we work with so many different types of companies, companies that have been around for 50 years and companies that got started, you know, not even 12 months ago. And so I think for us, it's trying to always category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration and then there's folks who have such advanced use case 'cuz they're hitting sort of a ceiling and only want to go up from there. And so I think we as a company, care about both ends of that spectrum and certainly have want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. You know, if you rewind the clock like five or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is spot on, is right on the money. And what we're finding is it's spreading, the need for it, is spreading to all parts of the data team naturally where Airflows have emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. You know, we've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data and that's data engineering and then you're going to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I got to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean we've, there's so many, there's so many. Sorry Raj, you can jump in. - >> It's okay. So there's so many companies using Airflow, right? So our, the baseline is that the open source project that is Airflow that was, that came out of Airbnb, you know, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in the organization and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Raj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's for, to start at the baseline. You know, as we continue to grow our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way and a more efficient way. And that's really the crux of who we sell to. And so to answer your question on, we actually, we get a lot of inbound because they're are so many - >> A built-in audience. >> In the world that use it, that those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean the power of the open source community is really just so, so big. And I think that's also one of the things that makes this job fun, so. >> And you guys are in a great position, Viraj, you can comment, to get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also, you know, we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of product-izing it, operationalizing it. This is a huge new dynamic. I mean, in the past, you know, five or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do because, you know, we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running e-commerce business or maybe you're running, I don't know, some sort of like any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it. You want to take, you want to be able to google something and get answers for it. You want the benefits of open source. You don't want to have, you don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that, in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, you can benefit from, that you can learn from, but you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolved. We used to debate 10 years ago, can there be another red hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company, the milestones of the Astronomer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Raj and I have obviously been at astronomer along with our other founding team and leadership folks, for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people. Solving again, mission critical problems for so many types of organizations. You know, we've had some funding that has allowed us to invest in the team that we have and in the software that we have. And that's been really phenomenal. And so that investment, I think, keeps us confident even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us, that we know can get value out of what we do. And making developers' lives better and growing the open source community, you know, and making sure that everything that we're doing makes it easier for folks to get started to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> I forget what the total is, but it's in the ballpark of 200, over $200 million. So it feels good - >> A little bit of capital. Got a little bit of cash to work with there. Great success. I know it's a Series C financing, you guys been down, so you're up and running. What's next? What are you guys looking to do? What's the big horizon look like for you? And from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Like, kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done, but really invest in our product over the next, at least the next, over the course of our company lifetime. And there's a lot of ways we wanting to just make it more accessible to users, easier to get started with, easier to use all kind of on all areas there. And really, we really want to do more for the community, right? Like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways and more kind of events and everything else that we can do to keep those folks galvanized and just keeping them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. You know, I think we'll keep growing the team this year. We've got a couple of offices in the US which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York and we're excited to be engaging with our coworkers in person. Finally, after years of not doing so, we've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world and really focusing on our product and the open source community is where our heads are at this year, so. >> Congratulations. - >> Excited. 200 million in funding plus good runway. Put that money in the bank, squirrel it away. You know, it's good to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open sourced community does and good luck with the venture. Continue to be successful and we'll see you at the Startup Showcase. >> Thank you. - >> Yeah, thanks so much, John. Appreciate it. - >> It's theCube conversation, featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model. Good solution for the next gen, Cloud, scale, data operations, data stacks that are emerging. I'm John Furrier, your host. Thanks for watching. (soft music)

Published Date : Feb 8 2023

SUMMARY :

and that is the future of for the path we've been on so far. take a minute to explain what you guys do. and that there's a lot of, of the value proposition And that data team needs to use tools You know, one of the and then a bunch of point solution. and the number of tools they're using and that is to make sense of the chaos, and all of the processes that need to be That's a beautiful thing. you know, they've solved that problem What are some of the companies Raj, I'll let you take that one too. And then, you know, and the growth of your company So I have to ask you guys, and companies that got started, you know, and the data scientists that the data market's kind of you can jump in. And so the folks that we and come to our website and chat with us I mean, in the past, you to what we do because, you history of the company, and in the software that we have. How much have you guys raised? but it's in the ballpark What are you guys looking to do? and you often have to just kind of and the open source community the work you guys do. Yeah, thanks so much, John. that's the website.

ENTITIES

Entity	Category	Confidence
Viraj Parekh	PERSON	0.99+
Paola	PERSON	0.99+
Viraj	PERSON	0.99+
John Furrier	PERSON	0.99+
John	PERSON	0.99+
Raj	PERSON	0.99+
Airbnb	ORGANIZATION	0.99+
US	LOCATION	0.99+
2017	DATE	0.99+
New York	LOCATION	0.99+
Paola Peraza Calderon	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Apache	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Palo Alto, California	LOCATION	0.99+
1970s	DATE	0.99+
10 years	QUANTITY	0.99+
five	QUANTITY	0.99+
Two	QUANTITY	0.99+
first question	QUANTITY	0.99+
over 200 million	QUANTITY	0.99+
both	QUANTITY	0.99+
Both	QUANTITY	0.99+
over $200 million	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
50 years ago	DATE	0.99+
one	QUANTITY	0.99+
five	DATE	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
this year	DATE	0.98+
One	QUANTITY	0.98+
Airflow	TITLE	0.98+
10 years ago	DATE	0.98+
Carnegie Mellon	ORGANIZATION	0.98+
over five years	QUANTITY	0.98+
200	QUANTITY	0.98+
12 months ago	DATE	0.98+
both worlds	QUANTITY	0.98+
5X	QUANTITY	0.98+
ChatGPT	ORGANIZATION	0.98+
first	QUANTITY	0.98+
one stack	QUANTITY	0.97+
one person	QUANTITY	0.97+
two things	QUANTITY	0.97+
Fivetran	ORGANIZATION	0.96+
seven	QUANTITY	0.96+
next year	DATE	0.96+
today	DATE	0.95+
50 years	QUANTITY	0.95+
each	QUANTITY	0.95+
theCube	ORGANIZATION	0.94+
HubSpot	ORGANIZATION	0.93+
Sigma	ORGANIZATION	0.92+
Series C	OTHER	0.92+
Astronomer	ORGANIZATION	0.91+
astronomer.io	OTHER	0.91+
Hightouch	TITLE	0.9+
one place	QUANTITY	0.9+
Android	TITLE	0.88+
Startup Showcase	EVENT	0.88+
Apache Airflow	TITLE	0.86+
CNCF	ORGANIZATION	0.86+

theCUBE's New Analyst Talks Cloud & DevOps

(light music) >> Hi everybody. Welcome to this Cube Conversation. I'm really pleased to announce a collaboration with Rob Strechay. He's a guest cube analyst, and we'll be working together to extract the signal from the noise. Rob is a long-time product pro, working at a number of firms including AWS, HP, HPE, NetApp, Snowplow. I did a stint as an analyst at Enterprise Strategy Group. Rob, good to see you. Thanks for coming into our Marlboro Studios. >> Well, thank you for having me. It's always great to be here. >> I'm really excited about working with you. We've known each other for a long time. You've been in the Cube a bunch. You know, you're in between gigs, and I think we can have a lot of fun together. Covering events, covering trends. So. let's get into it. What's happening out there? We're sort of exited the isolation economy. Things were booming. Now, everybody's tapping the brakes. From your standpoint, what are you seeing out there? >> Yeah. I'm seeing that people are really looking how to get more out of their data. How they're bringing things together, how they're looking at the costs of Cloud, and understanding how are they building out their SaaS applications. And understanding that when they go in and actually start to use Cloud, it's not only just using the base services anymore. They're looking at, how do I use these platforms as a service? Some are easier than others, and they're trying to understand, how do I get more value out of that relationship with the Cloud? They're also consolidating the number of Clouds that they have, I would say to try to better optimize their spend, and getting better pricing for that matter. >> Are you seeing people unhook Clouds, or just reduce maybe certain Cloud activities and going maybe instead of 60/40 going 90/10? >> Correct. It's more like the 90/10 type of rule where they're starting to say, Hey I'm not going to get rid of Azure or AWS or Google. I'm going to move a portion of this over that I was using on this one service. Maybe I got a great two-year contract to start with on this platform as a service or a database as a service. I'm going to unhook from that and maybe go with an independent. Maybe with something like a Snowflake or a Databricks on top of another Cloud, so that I can consolidate down. But it also gives them more flexibility as well. >> In our last breaking analysis, Rob, we identified six factors that were reducing Cloud consumption. There were factors and customer tactics. And I want to get your take on this. So, some of the factors really, you got fewer mortgage originations. FinTech, obviously big Cloud user. Crypto, not as much activity there. Lower ad spending means less Cloud. And then one of 'em, which you kind of disagreed with was less, less analytics, you know, fewer... Less frequency of calculations. I'll come back to that. But then optimizing compute using Graviton or AMD instances moving to cheaper storage tiers. That of course makes sense. And then optimize pricing plans. Maybe going from On Demand, you know, to, you know, instead of pay by the drink, buy in volume. Okay. So, first of all, do those make sense to you with the exception? We'll come back and talk about the analytics piece. Is that what you're seeing from customers? >> Yeah, I think so. I think that was pretty much dead on with what I'm seeing from customers and the ones that I go out and talk to. A lot of times they're trying to really monetize their, you know, understand how their business utilizes these Clouds. And, where their spend is going in those Clouds. Can they use, you know, lower tiers of storage? Do they really need the best processors? Do they need to be using Intel or can they get away with AMD or Graviton 2 or 3? Or do they need to move in? And, I think when you look at all of these Clouds, they always have pricing curves that are arcs from the newest to the oldest stuff. And you can play games with that. And understanding how you can actually lower your costs by looking at maybe some of the older generation. Maybe your application was written 10 years ago. You don't necessarily have to be on the best, newest processor for that application per se. >> So last, I want to come back to this whole analytics piece. Last June, I think it was June, Dev Ittycheria, who's the-- I call him Dev. Spelled Dev, pronounced Dave. (chuckles softly) Same pronunciation, different spelling. Dev Ittycheria, CEO of Mongo, on the earnings call. He was getting, you know, hit. Things were starting to get a little less visible in terms of, you know, the outlook. And people were pushing him like... Because you're in the Cloud, is it easier to dial down? And he said, because we're the document database, we support transaction applications. We're less discretionary than say, analytics. Well on the Snowflake earnings call, that same month or the month after, they were all over Slootman and Scarpelli. Oh, the Mongo CEO said that they're less discretionary than analytics. And Snowflake was an interesting comment. They basically said, look, we're the Cloud. You can dial it up, you can dial it down, but the area under the curve over a period of time is going to be the same, because they get their customers to commit. What do you say? You disagreed with the notion that people are running their calculations less frequently. Is that because they're trying to do a better job of targeting customers in near real time? What are you seeing out there? >> Yeah, I think they're moving away from using people and more expensive marketing. Or, they're trying to figure out what's my Google ad spend, what's my Meta ad spend? And what they're trying to do is optimize that spend. So, what is the return on advertising, or the ROAS as they would say. And what they're looking to do is understand, okay, I have to collect these analytics that better understand where are these people coming from? How do they get to my site, to my store, to my whatever? And when they're using it, how do they they better move through that? What you're also seeing is that analytics is not only just for kind of the retail or financial services or things like that, but then they're also, you know, using that to make offers in those categories. When you move back to more, you know, take other companies that are building products and SaaS delivered products. They may actually go and use this analytics for making the product better. And one of the big reasons for that is maybe they're dialing back how many product managers they have. And they're looking to be more data driven about how they actually go and build the product out or enhance the product. So maybe they're, you know, an online video service and they want to understand why people are either using or not using the whiteboard inside the product. And they're collecting a lot of that product analytics in a big way so that they can go through that. And they're doing it in a constant manner. This first party type tracking within applications is growing rapidly by customers. >> So, let's talk about who wins in that. So, obviously the Cloud guys, AWS, Google and Azure. I want to come back and unpack that a little bit. Databricks and Snowflake, we reported on our last breaking analysis, it kind of on a collision course. You know, a couple years ago we were thinking, okay, AWS, Snowflake and Databricks, like perfect sandwich. And then of course they started to become more competitive. My sense is they still, you know, compliment each other in the field, right? But, you know, publicly, they've got bigger aspirations, they get big TAMs that they're going after. But it's interesting, the data shows that-- So, Snowflake was off the charts in terms of spending momentum and our EPR surveys. Our partner down in New York, they kind of came into line. They're both growing in terms of market presence. Databricks couldn't get to IPO. So, we don't have as much, you know, visibility on their financials. You know, Snowflake obviously highly transparent cause they're a public company. And then you got AWS, Google and Azure. And it seems like AWS appears to be more partner friendly. Microsoft, you know, depends on what market you're in. And Google wants to sell BigQuery. >> Yeah. >> So, what are you seeing in the public Cloud from a data platform perspective? >> Yeah. I think that was pretty astute in what you were talking about there, because I think of the three, Google is definitely I think a little bit behind in how they go to market with their partners. Azure's done a fantastic job of partnering with these companies to understand and even though they may have Synapse as their go-to and where they want people to go to do AI and ML. What they're looking at is, Hey, we're going to also be friendly with Snowflake. We're also going to be friendly with a Databricks. And I think that, Amazon has always been there because that's where the market has been for these developers. So, many, like Databricks' and the Snowflake's have gone there first because, you know, Databricks' case, they built out on top of S3 first. And going and using somebody's object layer other than AWS, was not as simple as you would think it would be. Moving between those. >> So, one of the financial meetups I said meetup, but the... It was either the CEO or the CFO. It was either Slootman or Scarpelli talking at, I don't know, Merrill Lynch or one of the other financial conferences said, I think it was probably their Q3 call. Snowflake said 80% of our business goes through Amazon. And he said to this audience, the next day we got a call from Microsoft. Hey, we got to do more. And, we know just from reading the financial statements that Snowflake is getting concessions from Amazon, they're buying in volume, they're renegotiating their contracts. Amazon gets it. You know, lower the price, people buy more. Long term, we're all going to make more money. Microsoft obviously wants to get into that game with Snowflake. They understand the momentum. They said Google, not so much. And I've had customers tell me that they wanted to use Google's AI with Snowflake, but they can't, they got to go to to BigQuery. So, honestly, I haven't like vetted that so. But, I think it's true. But nonetheless, it seems like Google's a little less friendly with the data platform providers. What do you think? >> Yeah, I would say so. I think this is a place that Google looks and wants to own. Is that now, are they doing the right things long term? I mean again, you know, you look at Google Analytics being you know, basically outlawed in five countries in the EU because of GDPR concerns, and compliance and governance of data. And I think people are looking at Google and BigQuery in general and saying, is it the best place for me to go? Is it going to be in the right places where I need it? Still, it's still one of the largest used databases out there just because it underpins a number of the Google services. So you almost get, like you were saying, forced into BigQuery sometimes, if you want to use the tech on top. >> You do strategy. >> Yeah. >> Right? You do strategy, you do messaging. Is it the right call by Google? I mean, it's not a-- I criticize Google sometimes. But, I'm not sure it's the wrong call to say, Hey, this is our ace in the hole. >> Yeah. >> We got to get people into BigQuery. Cause, first of all, BigQuery is a solid product. I mean it's Cloud native and it's, you know, by all, it gets high marks. So, why give the competition an advantage? Let's try to force people essentially into what is we think a great product and it is a great product. The flip side of that is, they're giving up some potential partner TAM and not treating the ecosystem as well as one of their major competitors. What do you do if you're in that position? >> Yeah, I think that that's a fantastic question. And the question I pose back to the companies I've worked with and worked for is, are you really looking to have vendor lock-in as your key differentiator to your service? And I think when you start to look at these companies that are moving away from BigQuery, moving to even, Databricks on top of GCS in Google, they're looking to say, okay, I can go there if I have to evacuate from GCP and go to another Cloud, I can stay on Databricks as a platform, for instance. So I think it's, people are looking at what platform as a service, database as a service they go and use. Because from a strategic perspective, they don't want that vendor locking. >> That's where Supercloud becomes interesting, right? Because, if I can run on Snowflake or Databricks, you know, across Clouds. Even Oracle, you know, they're getting into business with Microsoft. Let's talk about some of the Cloud players. So, the big three have reported. >> Right. >> We saw AWSs Cloud growth decelerated down to 20%, which is I think the lowest growth rate since they started to disclose public numbers. And they said they exited, sorry, they said January they grew at 15%. >> Yeah. >> Year on year. Now, they had some pretty tough compares. But nonetheless, 15%, wow. Azure, kind of mid thirties, and then Google, we had kind of low thirties. But, well behind in terms of size. And Google's losing probably almost $3 billion annually. But, that's not necessarily a bad thing by advocating and investing. What's happening with the Cloud? Is AWS just running into the law, large numbers? Do you think we can actually see a re-acceleration like we have in the past with AWS Cloud? Azure, we predicted is going to be 75% of AWS IAS revenues. You know, we try to estimate IAS. >> Yeah. >> Even though they don't share that with us. That's a huge milestone. You'd think-- There's some people who have, I think, Bob Evans predicted a while ago that Microsoft would surpass AWS in terms of size. You know, what do you think? >> Yeah, I think that Azure's going to keep to-- Keep growing at a pretty good clip. I think that for Azure, they still have really great account control, even though people like to hate Microsoft. The Microsoft sellers that are out there making those companies successful day after day have really done a good job of being in those accounts and helping people. I was recently over in the UK. And the UK market between AWS and Azure is pretty amazing, how much Azure there is. And it's growing within Europe in general. In the states, it's, you know, I think it's growing well. I think it's still growing, probably not as fast as it is outside the U.S. But, you go down to someplace like Australia, it's also Azure. You hear about Azure all the time. >> Why? Is that just because of the Microsoft's software state? It's just so convenient. >> I think it has to do with, you know, and you can go with the reasoning they don't break out, you know, Office 365 and all of that out of their numbers is because they have-- They're in all of these accounts because the office suite is so pervasive in there. So, they always have reasons to go back in and, oh by the way, you're on these old SQL licenses. Let us move you up here and we'll be able to-- We'll support you on the old version, you know, with security and all of these things. And be able to move you forward. So, they have a lot of, I guess you could say, levers to stay in those accounts and be interesting. At least as part of the Cloud estate. I think Amazon, you know, is hitting, you know, the large number. Laws of large numbers. But I think that they're also going through, and I think this was seen in the layoffs that they were making, that they're looking to understand and have profitability in more of those services that they have. You know, over 350 odd services that they have. And you know, as somebody who went there and helped to start yet a new one, while I was there. And finally, it went to beta back in September, you start to look at the fact that, that number of services, people, their own sellers don't even know all of their services. It's impossible to comprehend and sell that many things. So, I think what they're going through is really looking to rationalize a lot of what they're doing from a services perspective going forward. They're looking to focus on more profitable services and bringing those in. Because right now it's built like a layer cake where you have, you know, S3 EBS and EC2 on the bottom of the layer cake. And then maybe you have, you're using IAM, the authorization and authentication in there and you have all these different services. And then they call it EMR on top. And so, EMR has to pay for that entire layer cake just to go and compete against somebody like Mongo or something like that. So, you start to unwind the costs of that. Whereas Azure, went and they build basically ground up services for the most part. And Google kind of falls somewhere in between in how they build their-- They're a sort of layer cake type effect, but not as many layers I guess you could say. >> I feel like, you know, Amazon's trying to be a platform for the ecosystem. Yes, they have their own products and they're going to sell. And that's going to drive their profitability cause they don't have to split the pie. But, they're taking a piece of-- They're spinning the meter, as Ziyas Caravalo likes to say on every time Snowflake or Databricks or Mongo or Atlas is, you know, running on their system. They take a piece of the action. Now, Microsoft does that as well. But, you look at Microsoft and security, head-to-head competitors, for example, with a CrowdStrike or an Okta in identity. Whereas, it seems like at least for now, AWS is a more friendly place for the ecosystem. At the same time, you do a lot of business in Microsoft. >> Yeah. And I think that a lot of companies have always feared that Amazon would just throw, you know, bodies at it. And I think that people have come to the realization that a two pizza team, as Amazon would call it, is eight people. I think that's, you know, two slices per person. I'm a little bit fat, so I don't know if that's enough. But, you start to look at it and go, okay, if they're going to start out with eight engineers, if I'm a startup and they're part of my ecosystem, do I really fear them or should I really embrace them and try to partner closer with them? And I think the smart people and the smart companies are partnering with them because they're realizing, Amazon, unless they can see it to, you know, a hundred million, $500 million market, they're not going to throw eight to 16 people at a problem. I think when, you know, you could say, you could look at the elastic with OpenSearch and what they did there. And the licensing terms and the battle they went through. But they knew that Elastic had a huge market. Also, you had a number of ecosystem companies building on top of now OpenSearch, that are now domain on top of Amazon as well. So, I think Amazon's being pretty strategic in how they're doing it. I think some of the-- It'll be interesting. I think this year is a payout year for the cuts that they're making to some of the services internally to kind of, you know, how do we take the fat off some of those services that-- You know, you look at Alexa. I don't know how much revenue Alexa really generates for them. But it's a means to an end for a number of different other services and partners. >> What do you make of this ChatGPT? I mean, Microsoft obviously is playing that card. You want to, you want ChatGPT in the Cloud, come to Azure. Seems like AWS has to respond. And we know Google is, you know, sharpening its knives to come up with its response. >> Yeah, I mean Google just went and talked about Bard for the first time this week and they're in private preview or I guess they call it beta, but. Right at the moment to select, select AI users, which I have no idea what that means. But that's a very interesting way that they're marketing it out there. But, I think that Amazon will have to respond. I think they'll be more measured than say, what Google's doing with Bard and just throwing it out there to, hey, we're going into beta now. I think they'll look at it and see where do we go and how do we actually integrate this in? Because they do have a lot of components of AI and ML underneath the hood that other services use. And I think that, you know, they've learned from that. And I think that they've already done a good job. Especially for media and entertainment when you start to look at some of the ways that they use it for helping do graphics and helping to do drones. I think part of their buy of iRobot was the fact that iRobot was a big user of RoboMaker, which is using different models to train those robots to go around objects and things like that, so. >> Quick touch on Kubernetes, the whole DevOps World we just covered. The Cloud Native Foundation Security, CNCF. The security conference up in Seattle last week. First time they spun that out kind of like reinforced, you know, AWS spins out, reinforced from reinvent. Amsterdam's coming up soon, the CubeCon. What should we expect? What's hot in Cubeland? >> Yeah, I think, you know, Kubes, you're going to be looking at how OpenShift keeps growing and I think to that respect you get to see the momentum with people like Red Hat. You see others coming up and realizing how OpenShift has gone to market as being, like you were saying, partnering with those Clouds and really making it simple. I think the simplicity and the manageability of Kubernetes is going to be at the forefront. I think a lot of the investment is still going into, how do I bring observability and DevOps and AIOps and MLOps all together. And I think that's going to be a big place where people are going to be looking to see what comes out of CubeCon in Amsterdam. I think it's that manageability ease of use. >> Well Rob, I look forward to working with you on behalf of the whole Cube team. We're going to do more of these and go out to some shows extract the signal from the noise. Really appreciate you coming into our studio. >> Well, thank you for having me on. Really appreciate it. >> You're really welcome. All right, keep it right there, or thanks for watching. This is Dave Vellante for the Cube. And we'll see you next time. (light music)

Published Date : Feb 7 2023

SUMMARY :

I'm really pleased to It's always great to be here. and I think we can have the number of Clouds that they have, contract to start with those make sense to you And, I think when you look in terms of, you know, the outlook. And they're looking to My sense is they still, you know, in how they go to market And he said to this audience, is it the best place for me to go? You do strategy, you do messaging. and it's, you know, And I think when you start Even Oracle, you know, since they started to to be 75% of AWS IAS revenues. You know, what do you think? it's, you know, I think it's growing well. Is that just because of the And be able to move you forward. I feel like, you know, I think when, you know, you could say, And we know Google is, you know, And I think that, you know, you know, AWS spins out, and I think to that respect forward to working with you Well, thank you for having me on. And we'll see you next time.

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Bob Evans	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
HP	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Rob	PERSON	0.99+
Google	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Rob Strechay	PERSON	0.99+
New York	LOCATION	0.99+
September	DATE	0.99+
Seattle	LOCATION	0.99+
January	DATE	0.99+
Dev Ittycheria	PERSON	0.99+
HPE	ORGANIZATION	0.99+
NetApp	ORGANIZATION	0.99+
Amsterdam	LOCATION	0.99+
75%	QUANTITY	0.99+
UK	LOCATION	0.99+
AWSs	ORGANIZATION	0.99+
June	DATE	0.99+
Snowplow	ORGANIZATION	0.99+
eight	QUANTITY	0.99+
80%	QUANTITY	0.99+
Scarpelli	PERSON	0.99+
15%	QUANTITY	0.99+
Australia	LOCATION	0.99+
Mongo	ORGANIZATION	0.99+
Slootman	PERSON	0.99+
two-year	QUANTITY	0.99+
AMD	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Databricks	ORGANIZATION	0.99+
six factors	QUANTITY	0.99+
three	QUANTITY	0.99+
Merrill Lynch	ORGANIZATION	0.99+
Last June	DATE	0.99+
five countries	QUANTITY	0.99+
eight people	QUANTITY	0.99+
U.S.	LOCATION	0.99+
last week	DATE	0.99+
16 people	QUANTITY	0.99+
Databricks'	ORGANIZATION	0.99+

Breaking Analysis: Cloud players sound a cautious tone for 2023

>> From the Cube Studios in Palo Alto in Boston bringing you data-driven insights from the Cube and ETR. This is Breaking Analysis with Dave Vellante. >> The unraveling of market enthusiasm continued in Q4 of 2022 with the earnings reports from the US hyperscalers, the big three now all in. As we said earlier this year, even the cloud is an immune from the macro headwinds and the cracks in the armor that we saw from the data that we shared last summer, they're playing out into 2023. For the most part actuals are disappointing beyond expectations including our own. It turns out that our estimates for the big three hyperscaler's revenue missed by 1.2 billion or 2.7% lower than we had forecast from even our most recent November estimates. And we expect continued decelerating growth rates for the hyperscalers through the summer of 2023 and we don't think that's going to abate until comparisons get easier. Hello and welcome to this week's Wikibon Cube Insights powered by ETR. In this Breaking Analysis, we share our view of what's happening in cloud markets not just for the hyperscalers but other firms that have hitched a ride on the cloud. And we'll share new ETR data that shows why these trends are playing out tactics that customers are employing to deal with their cost challenges and how long the pain is likely to last. You know, riding the cloud wave, it's a two-edged sword. Let's look at the players that have gone all in on or are exposed to both the positive and negative trends of cloud. Look the cloud has been a huge tailwind for so many companies like Snowflake and Databricks, Workday, Salesforce, Mongo's move with Atlas, Red Hats Cloud strategy with OpenShift and so forth. And you know, the flip side is because cloud is elastic what comes up can also go down very easily. Here's an XY graphic from ETR that shows spending momentum or net score on the vertical axis and market presence in the dataset on the horizontal axis provision or called overlap. This is data from the January 2023 survey and that the red dotted lines show the positions of several companies that we've highlighted going back to January 2021. So let's unpack this for a bit starting with the big three hyperscalers. The first point is AWS and Azure continue to solidify their moat relative to Google Cloud platform. And we're going to get into this in a moment, but Azure and AWS revenues are five to six times that of GCP for IaaS. And at those deltas, Google should be gaining ground much faster than the big two. The second point on Google is notice the red line on GCP relative to its starting point. While it appears to be gaining ground on the horizontal axis, its net score is now below that of AWS and Azure in the survey. So despite its significantly smaller size it's just not keeping pace with the leaders in terms of market momentum. Now looking at AWS and Microsoft, what we see is basically AWS is holding serve. As we know both Google and Microsoft benefit from including SaaS in their cloud numbers. So the fact that AWS hasn't seen a huge downward momentum relative to a January 2021 position is one positive in the data. And both companies are well above that magic 40% line on the Y-axis, anything above 40% we consider to be highly elevated. But the fact remains that they're down as are most of the names on this chart. So let's take a closer look. I want to start with Snowflake and Databricks. Snowflake, as we reported from several quarters back came down to Earth, it was up in the 80% range in the Y-axis here. And it's still highly elevated in the 60% range and it continues to move to the right, which is positive but as we'll address in a moment it's customers can dial down consumption just as in any cloud. Now, Databricks is really interesting. It's not a public company, it never made it to IPO during the sort of tech bubble. So we don't have the same level of transparency that we do with other companies that did make it through. But look at how much more prominent it is on the X-axis relative to January 2021. And it's net score is basically held up over that period of time. So that's a real positive for Databricks. Next, look at Workday and Salesforce. They've held up relatively well, both inching to the right and generally holding their net scores. Same from Mongo, which is the brown dot above its name that says Elastic, it says a little gets a little crowded which Elastic's actually the blue dot above it. But generally, SaaS is harder to dial down, Workday, Salesforce, Oracles, SaaS and others. So it's harder to dial down because commitments have been made in advance, they're kind of locked in. Now, one of the discussions from last summer was as Mongo, less discretionary than analytics i.e. Snowflake. And it's an interesting debate but maybe Snowflake customers, you know, they're also generally committed to a dollar amount. So over time the spending is going to be there. But in the short term, yeah maybe Snowflake customers can dial down. Now that highlighted dotted red line, that bolded one is Datadog and you can see it's made major strides on the X-axis but its net score has decelerated quite dramatically. Openshift's momentum in the survey has dropped although IBM just announced that OpenShift has a a billion dollar ARR and I suspect what's happening there is IBM consulting is bundling OpenShift into its modernization projects. It's got a, that sort of captive base if you will. And as such it's probably not as top of mind to the respondents but I'll bet you the developers are certainly aware of it. Now the other really notable call out here is CloudFlare, We've reported on them earlier. Cloudflare's net score has held up really well since January of 2021. It really hasn't seen the downdraft of some of these others, but it's making major major moves to the right gaining market presence. We really like how CloudFlare is performing. And the last comment is on Oracle which as you can see, despite its much, much lower net score continues to gain ground in the market and thrive from a profitability standpoint. But the data pretty clearly shows that there's a downdraft in the market. Okay, so what's happening here? Let's dig deeper into this data. Here's a graphic from the most recent ETR drill down asking customers that said they were going to cut spending what technique they're using to do so. Now, as we've previously reported, consolidating redundant vendors is by far the most cited approach but there's two key points we want to make here. One is reducing excess cloud resources. As you can see in the bars is the second most cited technique and it's up from the previous polling period. The second we're not showing, you know directly but we've got some red call outs there. Reducing cloud costs jumps to 29% and 28% respectively in financial services and tech telco. And it's much closer to second. It's basically neck and neck with consolidating redundant vendors in those two industries. So they're being really aggressive about optimizing cloud cost. Okay, so as we said, cloud is great 'cause you can dial it up but it's just as easy to dial down. We've identified six factors that customers tell us are affecting their cloud consumption and there are probably more, if you got more we'd love to hear them but these are the ones that are fairly prominent that have hit our radar. First, rising mortgage rates mean banks are processing fewer loans means less cloud. The crypto crash means less trading activity and that means less cloud resources. Third lower ad spend has led companies to reduce not only you know, their ad buying but also their frequency of running their analytics and their calculations. And they're also often using less data, maybe compressing the timeframe of the corpus down to a shorter time period. Also very prominent is down to the bottom left, using lower cost compute instances. For example, Graviton from AWS or AMD chips and tiering storage to cheaper S3 or deep archived tiers. And finally, optimizing based on better pricing plans. So customers are moving from, you know, smaller companies in particular moving maybe from on demand or other larger companies that are experimenting using on demand or they're moving to spot pricing or reserved instances or optimized savings plans. That all lowers cost and that means less cloud resource consumption and less cloud revenue. Now in the days when everything was on prem CFOs, what would they do? They would freeze CapEx and IT Pros would have to try to do more with less and often that meant a lot of manual tasks. With the cloud it's much easier to move things around. It still takes some thinking and some effort but it's dramatically simpler to do so. So you can get those savings a lot faster. Now of course the other huge factor is you can cut or you can freeze. And this graphic shows data from a recent ETR survey with 159 respondents and you can see the meaningful uptick in hiring freezes, freezing new IT deployments and layoffs. And as we've been reporting, this has been trending up since earlier last year. And note the call out, this is especially prominent in retail sectors, all three of these techniques jump up in retail and that's a bit of a concern because oftentimes consumer spending helps the economy make a softer landing out of a pullback. But this is a potential canary in the coal mine. If retail firms are pulling back it's because consumers aren't spending as much. And so we're keeping a close eye on that. So let's boil this down to the market data and what this all means. So in this graphic we show our estimates for Q4 IaaS revenues compared to the "actual" IaaS revenues. And we say quote because AWS is the only one that reports, you know clean revenue and IaaS, Azure and GCP don't report actuals. Why would they? Because it would make them look even, you know smaller relative to AWS. Rather, they bury the figures in overall cloud which includes their, you know G-Suite for Google and all the Microsoft SaaS. And then they give us little tidbits about in Microsoft's case, Azure, they give growth rates. Google gives kind of relative growth of GCP. So, and we use survey data and you know, other data to try to really pinpoint and we've been covering this for, I don't know, five or six years ever since the cloud really became a thing. But looking at the data, we had AWS growing at 25% this quarter and it came in at 20%. So a significant decline relative to our expectations. AWS announced that it exited December, actually, sorry it's January data showed about a 15% mid-teens growth rate. So that's, you know, something we're watching. Azure was two points off our forecast coming in at 38% growth. It said it exited December in the 35% growth range and it said that it's expecting five points of deceleration off of that. So think 30% for Azure. GCP came in three points off our expectation coming in 35% and Alibaba has yet to report but we've shaved a bid off that forecast based on some survey data and you know what maybe 9% is even still not enough. Now for the year, the big four hyperscalers generated almost 160 billion of revenue, but that was 7 billion lower than what what we expected coming into 2022. For 2023, we're expecting 21% growth for a total of 193.3 billion. And while it's, you know, lower, you know, significantly lower than historical expectations it's still four to five times the overall spending forecast that we just shared with you in our predictions post of between 4 and 5% for the overall market. We think AWS is going to come in in around 93 billion this year with Azure closing in at over 71 billion. This is, again, we're talking IaaS here. Now, despite Amazon focusing investors on the fact that AWS's absolute dollar growth is still larger than its competitors. By our estimates Azure will come in at more than 75% of AWS's forecasted revenue. That's a significant milestone. AWS is operating margins by the way declined significantly this past quarter, dropping from 30% of revenue to 24%, 30% the year earlier to 24%. Now that's still extremely healthy and we've seen wild fluctuations like this before so I don't get too freaked out about that. But I'll say this, Microsoft has a marginal cost advantage relative to AWS because one, it has a captive cloud on which to run its massive software estate. So it can just throw software at its own cloud and two software marginal costs. Marginal economics despite AWS's awesomeness in high degrees of automation, software is just a better business. Now the upshot for AWS is the ecosystem. AWS is essentially in our view positioning very smartly as a platform for data partners like Snowflake and Databricks, security partners like CrowdStrike and Okta and Palo Alto and many others and SaaS companies. You know, Microsoft is more competitive even though AWS does have competitive products. Now of course Amazon's competitive to retail companies so that's another factor but generally speaking for tech players, Amazon is a really thriving ecosystem that is a secret weapon in our view. AWS happy to spin the meter with its partners even though it sells competitive products, you know, more so in our view than other cloud players. Microsoft, of course is, don't forget is hyping now, we're hearing a lot OpenAI and ChatGPT we reported last week in our predictions post. How OpenAI is shot up in terms of market sentiment in ETR's emerging technology company surveys and people are moving to Azure to get OpenAI and get ChatGPT that is a an interesting lever. Amazon in our view has to have a response. They have lots of AI and they're going to have to make some moves there. Meanwhile, Google is emphasizing itself as an AI first company. In fact, Google spent at least five minutes of continuous dialogue, nonstop on its AI chops during its latest earnings call. So that's an area that we're watching very closely as the buzz around large language models continues. All right, let's wrap up with some assumptions for 2023. We think SaaS players are going to continue to be sticky. They're going to be somewhat insulated from all these downdrafts because they're so tied in and customers, you know they make the commitment up front, you've got the lock in. Now having said that, we do expect some backlash over time on the onerous and generally customer unfriendly pricing models of most large SaaS companies. But that's going to play out over a longer period of time. Now for cloud generally and the hyperscalers specifically we do expect accelerating growth rates into Q3 but the amplitude of the demand swings from this rubber band economy, we expect to continue to compress and become more predictable throughout the year. Estimates are coming down, CEOs we think are going to be more cautious when the market snaps back more cautious about hiring and spending and as such a perhaps we expect a more orderly return to growth which we think will slightly accelerate in Q4 as comps get easier. Now of course the big risk to these scenarios is of course the economy, the FED, consumer spending, inflation, supply chain, energy prices, wars, geopolitics, China relations, you know, all the usual stuff. But as always with our partners at ETR and the Cube community, we're here for you. We have the data and we'll be the first to report when we see a change at the margin. Okay, that's a wrap for today. I want to thank Alex Morrison who's on production and manages the podcast, Ken Schiffman as well out of our Boston studio getting this up on LinkedIn Live. Thank you for that. Kristen Martin also and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hof is our Editor-in-Chief over at siliconangle.com. He does some great editing for us. Thank you all. Remember all these episodes are available as podcast. Wherever you listen, just search Breaking Analysis podcast. I publish each week on wikibon.com, at siliconangle.com where you can see all the data and you want to get in touch. Just all you can do is email me david.vellante@siliconangle.com or DM me @dvellante if you if you got something interesting, I'll respond. If you don't, it's either 'cause I'm swamped or it's just not tickling me. You can comment on our LinkedIn post as well. And please check out ETR.ai for the best survey data in the enterprise tech business. This is Dave Vellante for the Cube Insights powered by ETR. Thanks for watching and we'll see you next time on Breaking Analysis. (gentle upbeat music)

Published Date : Feb 4 2023

SUMMARY :

From the Cube Studios and how long the pain is likely to last.

ENTITIES

Entity	Category	Confidence
Alex Morrison	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Alibaba	ORGANIZATION	0.99+
Cheryl Knight	PERSON	0.99+
Kristen Martin	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Ken Schiffman	PERSON	0.99+
January 2021	DATE	0.99+
Microsoft	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Rob Hof	PERSON	0.99+
2.7%	QUANTITY	0.99+
January	DATE	0.99+
Amazon	ORGANIZATION	0.99+
December	DATE	0.99+
January of 2021	DATE	0.99+
five	QUANTITY	0.99+
January 2023	DATE	0.99+
Snowflake	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
1.2 billion	QUANTITY	0.99+
20%	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
29%	QUANTITY	0.99+
30%	QUANTITY	0.99+
six factors	QUANTITY	0.99+
second point	QUANTITY	0.99+
24%	QUANTITY	0.99+
2022	DATE	0.99+
david.vellante@siliconangle.com	OTHER	0.99+
X-axis	ORGANIZATION	0.99+
2023	DATE	0.99+
28%	QUANTITY	0.99+
193.3 billion	QUANTITY	0.99+
ETR	ORGANIZATION	0.99+
38%	QUANTITY	0.99+
7 billion	QUANTITY	0.99+
21%	QUANTITY	0.99+
Earth	LOCATION	0.99+
25%	QUANTITY	0.99+
Mongo	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Atlas	ORGANIZATION	0.99+
two industries	QUANTITY	0.99+
last week	DATE	0.99+
six years	QUANTITY	0.99+
first point	QUANTITY	0.99+
Red Hats	ORGANIZATION	0.99+
35%	QUANTITY	0.99+
four	QUANTITY	0.99+
159 respondents	QUANTITY	0.99+
Okta	ORGANIZATION	0.99+

Jon Turow, Madrona Venture Group | CloudNativeSecurityCon 23

(upbeat music) >> Hello and welcome back to theCUBE. We're here in Palo Alto, California. I'm your host, John Furrier with a special guest here in the studio. As part of our Cloud Native SecurityCon Coverage we had an opportunity to bring in Jon Turow who is the partner at Madrona Venture Partners formerly with AWS and to talk about machine learning, foundational models, and how the future of AI is going to be impacted by some of the innovation around what's going on in the industry. ChatGPT has taken the world by storm. A million downloads, fastest to the million downloads there. Before some were saying it's just a gimmick. Others saying it's a game changer. Jon's here to break it down, and great to have you on. Thanks for coming in. >> Thanks John. Glad to be here. >> Thanks for coming on. So first of all, I'm glad you're here. First of all, because two things. One, you were formerly with AWS, got a lot of experience running projects at AWS. Now a partner at Madrona, a great firm doing great deals, and they had this future at modern application kind of thesis. Now you are putting out some content recently around foundational models. You're deep into computer vision. You were the IoT general manager at AWS among other things, Greengrass. So you know a lot about data. You know a lot about some of this automation, some of the edge stuff. You've been in the middle of all these kind of areas that now seem to be the next wave coming. So I wanted to ask you what your thoughts are of how the machine learning and this new automation wave is coming in, this AI tools are coming out. Is it a platform? Is it going to be smarter? What feeds AI? What's your take on this whole foundational big movement into AI? What's your general reaction to all this? >> So, thanks, Jon, again for having me here. Really excited to talk about these things. AI has been coming for a long time. It's been kind of the next big thing. Always just over the horizon for quite some time. And we've seen really compelling applications in generations before and until now. Amazon and AWS have introduced a lot of them. My firm, Madrona Venture Group has invested in some of those early players as well. But what we're seeing now is something categorically different. That's really exciting and feels like a durable change. And I can try and explain what that is. We have these really large models that are useful in a general way. They can be applied to a lot of different tasks beyond the specific task that the designers envisioned. That makes them more flexible, that makes them more useful for building applications than what we've seen before. And so that, we can talk about the depths of it, but in a nutshell, that's why I think people are really excited. >> And I think one of the things that you wrote about that jumped out at me is that this seems to be this moment where there's been a multiple decades of nerds and computer scientists and programmers and data thinkers around waiting for AI to blossom. And it's like they're scratching that itch. Every year is going to be, and it's like the bottleneck's always been compute power. And we've seen other areas, genome sequencing, all kinds of high computation things where required high forms computing. But now there's no real bottleneck to compute. You got cloud. And so you're starting to see the emergence of a massive acceleration of where AI's been and where it needs to be going. Now, it's almost like it's got a reboot. It's almost a renaissance in the AI community with a whole nother macro environmental things happening. Cloud, younger generation, applications proliferate from mobile to cloud native. It's the perfect storm for this kind of moment to switch over. Am I overreading that? Is that right? >> You're right. And it's been cooking for a cycle or two. And let me try and explain why that is. We have cloud and AWS launch in whatever it was, 2006, and offered more compute to more people than really was possible before. Initially that was about taking existing applications and running them more easily in a bigger scale. But in that period of time what's also become possible is new kinds of computation that really weren't practical or even possible without that vast amount of compute. And so one result that came of that is something called the transformer AI model architecture. And Google came out with that, published a paper in 2017. And what that says is, with a transformer model you can actually train an arbitrarily large amount of data into a model, and see what happens. That's what Google demonstrated in 2017. The what happens is the really exciting part because when you do that, what you start to see, when models exceed a certain size that we had never really seen before all of a sudden they get what we call emerging capabilities of complex reasoning and reasoning outside a domain and reasoning with data. The kinds of things that people describe as spooky when they play with something like ChatGPT. That's the underlying term. We don't as an industry quite know why it happens or how it happens, but we can measure that it does. So cloud enables new kinds of math and science. New kinds of math and science allow new kinds of experimentation. And that experimentation has led to this new generation of models. >> So one of the debates we had on theCUBE at our Supercloud event last month was, what's the barriers to entry for say OpenAI, for instance? Obviously, I weighed in aggressively and said, "The barriers for getting into cloud are high because all the CapEx." And Howie Xu formerly VMware, now at ZScaler, he's an AI machine learning guy. He was like, "Well, you can spend $100 million and replicate it." I saw a quote that set up for 180,000 I can get this other package. What's the barriers to entry? Is ChatGPT or OpenAI, does it have sustainability? Is it easy to get into? What is the market like for AI? I mean, because a lot of entrepreneurs are jumping in. I mean, I just read a story today. San Francisco's got more inbound migration because of the AI action happening, Seattle's booming, Boston with MIT's been working on neural networks for generations. That's what we've found the answer. Get off the neural network, Boston jump on the AI bus. So there's total excitement for this. People are enthusiastic around this area. >> You can think of an iPhone versus Android tension that's happening today. In the iPhone world, there are proprietary models from OpenAI who you might consider as the leader. There's Cohere, there's AI21, there's Anthropic, Google's going to have their own, and a few others. These are proprietary models that developers can build on top of, get started really quickly. They're measured to have the highest accuracy and the highest performance today. That's the proprietary side. On the other side, there is an open source part of the world. These are a proliferation of model architectures that developers and practitioners can take off the shelf and train themselves. Typically found in Hugging face. What people seem to think is that the accuracy and performance of the open source models is something like 18 to 20 months behind the accuracy and performance of the proprietary models. But on the other hand, there's infinite flexibility for teams that are capable enough. So you're going to see teams choose sides based on whether they want speed or flexibility. >> That's interesting. And that brings up a point I was talking to a startup and the debate was, do you abstract away from the hardware and be software-defined or software-led on the AI side and let the hardware side just extremely accelerate on its own, 'cause it's flywheel? So again, back to proprietary, that's with hardware kind of bundled in, bolted on. Is it accelerator or is it bolted on or is it part of it? So to me, I think that the big struggle in understanding this is that which one will end up being right. I mean, is it a beta max versus VHS kind of thing going on? Or iPhone, Android, I mean iPhone makes a lot of sense, but if you're Apple, but is there an Apple moment in the machine learning? >> In proprietary models, here does seem to be a jump ball. That there's going to be a virtuous flywheel that emerges that, for example, all these excitement about ChatGPT. What's really exciting about it is it's really easy to use. The technology isn't so different from what we've seen before even from OpenAI. You mentioned a million users in a short period of time, all providing training data for OpenAI that makes their underlying models, their next generation even better. So it's not unreasonable to guess that there's going to be power laws that emerge on the proprietary side. What I think history has shown is that iPhone, Android, Windows, Linux, there seems to be gravity towards this yin and yang. And my guess, and what other people seem to think is going to be the case is that we're going to continue to see these two poles of AI. >> So let's get into the relationship with data because I've been emerging myself with ChatGPT, fascinated by the ease of use, yes, but also the fidelity of how you query it. And I felt like when I was doing writing SQL back in the eighties and nineties where SQL was emerging. You had to be really a guru at the SQL to get the answers you wanted. It seems like the querying into ChatGPT is a good thing if you know how to talk to it. Labeling whether your input is and it does a great job if you feed it right. If you ask a generic questions like Google. It's like a Google search. It gives you great format, sounds credible, but the facts are kind of wrong. >> That's right. >> That's where general consensus is coming on. So what does that mean? That means people are on one hand saying, "Ah, it's bullshit 'cause it's wrong." But I look at, I'm like, "Wow, that's that's compelling." 'Cause if you feed it the right data, so now we're in the data modeling here, so the role of data's going to be critical. Is there a data operating system emerging? Because if this thing continues to go the way it's going you can almost imagine as you would look at companies to invest in. Who's going to be right on this? What's going to scale? What's sustainable? What could build a durable company? It might not look what like what people think it is. I mean, I remember when Google started everyone thought it was the worst search engine because it wasn't a portal. But it was the best organic search on the planet became successful. So I'm trying to figure out like, okay, how do you read this? How do you read the tea leaves? >> Yeah. There are a few different ways that companies can differentiate themselves. Teams with galactic capabilities to take an open source model and then change the architecture and retrain and go down to the silicon. They can do things that might not have been possible for other teams to do. There's a company that that we're proud to be investors in called RunwayML that provides video accelerated, sorry, AI accelerated video editing capabilities. They were used in everything, everywhere all at once and some others. In order to build RunwayML, they needed a vision of what the future was going to look like and they needed to make deep contributions to the science that was going to enable all that. But not every team has those capabilities, maybe nor should they. So as far as how other teams are going to differentiate there's a couple of things that they can do. One is called prompt engineering where they shape on behalf of their own users exactly how the prompt to get fed to the underlying model. It's not clear whether that's going to be a durable problem or whether like Google, we consumers are going to start to get more intuitive about this. That's one. The second is what's called information retrieval. How can I get information about the world outside, information from a database or a data store or whatever service into these models so they can reason about them. And the third is, this is going to sound funny, but attribution. Just like you would do in a news report or an academic paper. If you can state where your facts are coming from, the downstream consumer or the human being who has to use that information actually is going to be able to make better sense of it and rely better on it. So that's prompt engineering, that's retrieval, and that's attribution. >> So that brings me to my next point I want to dig in on is the foundational model stack that you published. And I'll start by saying that with ChatGPT, if you take out the naysayers who are like throwing cold water on it about being a gimmick or whatever, and then you got the other side, I would call the alpha nerds who are like they can see, "Wow, this is amazing." This is truly NextGen. This isn't yesterday's chatbot nonsense. They're like, they're all over it. It's that everybody's using it right now in every vertical. I heard someone using it for security logs. I heard a data center, hardware vendor using it for pushing out appsec review updates. I mean, I've heard corner cases. We're using it for theCUBE to put our metadata in. So there's a horizontal use case of value. So to me that tells me it's a market there. So when you have horizontal scalability in the use case you're going to have a stack. So you publish this stack and it has an application at the top, applications like Jasper out there. You're seeing ChatGPT. But you go after the bottom, you got silicon, cloud, foundational model operations, the foundational models themselves, tooling, sources, actions. Where'd you get this from? How'd you put this together? Did you just work backwards from the startups or was there a thesis behind this? Could you share your thoughts behind this foundational model stack? >> Sure. Well, I'm a recovering product manager and my job that I think about as a product manager is who is my customer and what problem he wants to solve. And so to put myself in the mindset of an application developer and a founder who is actually my customer as a partner at Madrona, I think about what technology and resources does she need to be really powerful, to be able to take a brilliant idea, and actually bring that to life. And if you spend time with that community, which I do and I've met with hundreds of founders now who are trying to do exactly this, you can see that the stack is emerging. In fact, we first drew it in, not in January 2023, but October 2022. And if you look at the difference between the October '22 and January '23 stacks you're going to see that holes in the stack that we identified in October around tooling and around foundation model ops and the rest are organically starting to get filled because of how much demand from the developers at the top of the stack. >> If you look at the young generation coming out and even some of the analysts, I was just reading an analyst report on who's following the whole data stacks area, Databricks, Snowflake, there's variety of analytics, realtime AI, data's hot. There's a lot of engineers coming out that were either data scientists or I would call data platform engineering folks are becoming very key resources in this area. What's the skillset emerging and what's the mindset of that entrepreneur that sees the opportunity? How does these startups come together? Is there a pattern in the formation? Is there a pattern in the competency or proficiency around the talent behind these ventures? >> Yes. I would say there's two groups. The first is a very distinct pattern, John. For the past 10 years or a little more we've seen a pattern of democratization of ML where more and more people had access to this powerful science and technology. And since about 2017, with the rise of the transformer architecture in these foundation models, that pattern has reversed. All of a sudden what has become broader access is now shrinking to a pretty small group of scientists who can actually train and manipulate the architectures of these models themselves. So that's one. And what that means is the teams who can do that have huge ability to make the future happen in ways that other people don't have access to yet. That's one. The second is there is a broader population of people who by definition has even more collective imagination 'cause there's even more people who sees what should be possible and can use things like the proprietary models, like the OpenAI models that are available off the shelf and try to create something that maybe nobody has seen before. And when they do that, Jasper AI is a great example of that. Jasper AI is a company that creates marketing copy automatically with generative models such as GPT-3. They do that and it's really useful and it's almost fun for a marketer to use that. But there are going to be questions of how they can defend that against someone else who has access to the same technology. It's a different population of founders who has to find other sources of differentiation without being able to go all the way down to the the silicon and the science. >> Yeah, and it's going to be also opportunity recognition is one thing. Building a viable venture product market fit. You got competition. And so when things get crowded you got to have some differentiation. I think that's going to be the key. And that's where I was trying to figure out and I think data with scale I think are big ones. Where's the vulnerability in the stack in terms of gaps? Where's the white space? I shouldn't say vulnerability. I should say where's the opportunity, where's the white space in the stack that you see opportunities for entrepreneurs to attack? >> I would say there's two. At the application level, there is almost infinite opportunity, John, because almost every kind of application is about to be reimagined or disrupted with a new generation that takes advantage of this really powerful new technology. And so if there is a kind of application in almost any vertical, it's hard to rule something out. Almost any vertical that a founder wishes she had created the original app in, well, now it's her time. So that's one. The second is, if you look at the tooling layer that we discussed, tooling is a really powerful way that you can provide more flexibility to app developers to get more differentiation for themselves. And the tooling layer is still forming. This is the interface between the models themselves and the applications. Tools that help bring in data, as you mentioned, connect to external actions, bring context across multiple calls, chain together multiple models. These kinds of things, there's huge opportunity there. >> Well, Jon, I really appreciate you coming in. I had a couple more questions, but I will take a minute to read some of your bios for the audience and we'll get into, I won't embarrass you, but I want to set the context. You said you were recovering product manager, 10 plus years at AWS. Obviously, recovering from AWS, which is a whole nother dimension of recovering. In all seriousness, I talked to Andy Jassy around that time and Dr. Matt Wood and it was about that time when AI was just getting on the radar when they started. So you guys started seeing the wave coming in early on. So I remember at that time as Amazon was starting to grow significantly and even just stock price and overall growth. From a tech perspective, it was pretty clear what was coming, so you were there when this tsunami hit. >> Jon: That's right. >> And you had a front row seat building tech, you were led the product teams for Computer Vision AI, Textract, AI intelligence for document processing, recognition for image and video analysis. You wrote the business product plan for AWS IoT and Greengrass, which we've covered a lot in theCUBE, which extends out to the whole edge thing. So you know a lot about AI/ML, edge computing, IOT, messaging, which I call the law of small numbers that scale become big. This is a big new thing. So as a former AWS leader who's been there and at Madrona, what's your investment thesis as you start to peruse the landscape and talk to entrepreneurs as you got the stack? What's the big picture? What are you looking for? What's the thesis? How do you see this next five years emerging? >> Five years is a really long time given some of this science is only six months out. I'll start with some, no pun intended, some foundational things. And we can talk about some implications of the technology. The basics are the same as they've always been. We want, what I like to call customers with their hair on fire. So they have problems, so urgent they'll buy half a product. The joke is if your hair is on fire you might want a bucket of cold water, but you'll take a tennis racket and you'll beat yourself over the head to put the fire out. You want those customers 'cause they'll meet you more than halfway. And when you find them, you can obsess about them and you can get better every day. So we want customers with their hair on fire. We want founders who have empathy for those customers, understand what is going to be required to serve them really well, and have what I like to call founder-market fit to be able to build the products that those customers are going to need. >> And because that's a good strategy from an emerging, not yet fully baked out requirements definition. >> Jon: That's right. >> Enough where directionally they're leaning in, more than in, they're part of the product development process. >> That's right. And when you're doing early stage development, which is where I personally spend a lot of my time at the seed and A and a little bit beyond that stage often that's going to be what you have to go on because the future is going to be so complex that you can't see the curves beyond it. But if you have customers with their hair on fire and talented founders who have the capability to serve those customers, that's got me interested. >> So if I'm an entrepreneur, I walk in and say, "I have customers that have their hair on fire." What kind of checks do you write? What's the kind of the average you're seeing for seed and series? Probably seed, seed rounds and series As. >> It can depend. I have seen seed rounds of double digit million dollars. I have seen seed rounds much smaller than that. It really depends on what is going to be the right thing for these founders to prove out the hypothesis that they're testing that says, "Look, we have this customer with her hair on fire. We think we can build at least a tennis racket that she can use to start beating herself over the head and put the fire out. And then we're going to have something really interesting that we can scale up from there and we can make the future happen. >> So it sounds like your advice to founders is go out and find some customers, show them a product, don't obsess over full completion, get some sort of vibe on fit and go from there. >> Yeah, and I think by the time founders come to me they may not have a product, they may not have a deck, but if they have a customer with her hair on fire, then I'm really interested. >> Well, I always love the professional services angle on these markets. You go in and you get some business and you understand it. Walk away if you don't like it, but you see the hair on fire, then you go in product mode. >> That's right. >> All Right, Jon, thank you for coming on theCUBE. Really appreciate you stopping by the studio and good luck on your investments. Great to see you. >> You too. >> Thanks for coming on. >> Thank you, Jon. >> CUBE coverage here at Palo Alto. I'm John Furrier, your host. More coverage with CUBE Conversations after this break. (upbeat music)

Published Date : Feb 2 2023

SUMMARY :

and great to have you on. that now seem to be the next wave coming. It's been kind of the next big thing. is that this seems to be this moment and offered more compute to more people What's the barriers to entry? is that the accuracy and the debate was, do you that there's going to be power laws but also the fidelity of how you query it. going to be critical. exactly how the prompt to get So that brings me to my next point and actually bring that to life. and even some of the analysts, But there are going to be questions Yeah, and it's going to be and the applications. the radar when they started. and talk to entrepreneurs the head to put the fire out. And because that's a good of the product development process. that you can't see the curves beyond it. What kind of checks do you write? and put the fire out. to founders is go out time founders come to me and you understand it. stopping by the studio More coverage with CUBE

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Jon	PERSON	0.99+
AWS	ORGANIZATION	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
Andy Jassy	PERSON	0.99+
2017	DATE	0.99+
January 2023	DATE	0.99+
Jon Turow	PERSON	0.99+
October	DATE	0.99+
18	QUANTITY	0.99+
MIT	ORGANIZATION	0.99+
$100 million	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
10 plus years	QUANTITY	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Google	ORGANIZATION	0.99+
two	QUANTITY	0.99+
October 2022	DATE	0.99+
hundreds	QUANTITY	0.99+
Madrona	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Madrona Venture Partners	ORGANIZATION	0.99+
January '23	DATE	0.99+
two groups	QUANTITY	0.99+
Matt Wood	PERSON	0.99+
Madrona Venture Group	ORGANIZATION	0.99+
180,000	QUANTITY	0.99+
October '22	DATE	0.99+
Jasper	TITLE	0.99+
Palo Alto, California	LOCATION	0.99+
six months	QUANTITY	0.99+
2006	DATE	0.99+
million downloads	QUANTITY	0.99+
Five years	QUANTITY	0.99+
SQL	TITLE	0.99+
last month	DATE	0.99+
two poles	QUANTITY	0.99+
first	QUANTITY	0.99+
Howie Xu	PERSON	0.99+
VMware	ORGANIZATION	0.99+
third	QUANTITY	0.99+
20 months	QUANTITY	0.99+
Greengrass	ORGANIZATION	0.99+
Madrona Venture Group	ORGANIZATION	0.98+
second	QUANTITY	0.98+
One	QUANTITY	0.98+
Supercloud	EVENT	0.98+
RunwayML	TITLE	0.98+
San Francisco	LOCATION	0.98+
ZScaler	ORGANIZATION	0.98+
yesterday	DATE	0.98+
one	QUANTITY	0.98+
First	QUANTITY	0.97+
CapEx	ORGANIZATION	0.97+
eighties	DATE	0.97+
ChatGPT	TITLE	0.96+
Dr.	PERSON	0.96+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Databricks':