Bob Muglia, George Gilbert & Tristan Handy | How Supercloud will Support a new Class of Data Apps
(upbeat music) >> Hello, everybody. This is Dave Vellante. Welcome back to Supercloud2, where we're exploring the intersection of data analytics and the future of cloud. In this segment, we're going to look at how the Supercloud will support a new class of applications, not just work that runs on multiple clouds, but rather a new breed of apps that can orchestrate things in the real world. Think Uber for many types of businesses. These applications, they're not about codifying forms or business processes. They're about orchestrating people, places, and things in a business ecosystem. And I'm pleased to welcome my colleague and friend, George Gilbert, former Gartner Analyst, Wiki Bond market analyst, former equities analyst as my co-host. And we're thrilled to have Tristan Handy, who's the founder and CEO of DBT Labs and Bob Muglia, who's the former President of Microsoft's Enterprise business and former CEO of Snowflake. Welcome all, gentlemen. Thank you for coming on the program. >> Good to be here. >> Thanks for having us. >> Hey, look, I'm going to start actually with the SuperCloud because both Tristan and Bob, you've read the definition. Thank you for doing that. And Bob, you have some really good input, some thoughts on maybe some of the drawbacks and how we can advance this. So what are your thoughts in reading that definition around SuperCloud? >> Well, I thought first of all that you did a very good job of laying out all of the characteristics of it and helping to define it overall. But I do think it can be tightened a bit, and I think it's helpful to do it in as short a way as possible. And so in the last day I've spent a little time thinking about how to take it and write a crisp definition. And here's my go at it. This is one day old, so gimme a break if it's going to change. And of course we have to follow the industry, and so that, and whatever the industry decides, but let's give this a try. So in the way I think you're defining it, what I would say is a SuperCloud is a platform that provides programmatically consistent services hosted on heterogeneous cloud providers. >> Boom. Nice. Okay, great. I'm going to go back and read the script on that one and tighten that up a bit. Thank you for spending the time thinking about that. Tristan, would you add anything to that or what are your thoughts on the whole SuperCloud concept? >> So as I read through this, I fully realize that we need a word for this thing because I have experienced the inability to talk about it as well. But for many of us who have been living in the Confluence, Snowflake, you know, this world of like new infrastructure, this seems fairly uncontroversial. Like I read through this, and I'm just like, yeah, this is like the world I've been living in for years now. And I noticed that you called out Snowflake for being an example of this, but I think that there are like many folks, myself included, for whom this world like fully exists today. >> Yeah, I think that's a fair, I dunno if it's criticism, but people observe, well, what's the big deal here? It's just kind of what we're living in today. It reminds me of, you know, Tim Burns Lee saying, well, this is what the internet was supposed to be. It was supposed to be Web 2.0, so maybe this is what multi-cloud was supposed to be. Let's turn our attention to apps. Bob first and then go to Tristan. Bob, what are data apps to you? When people talk about data products, is that what they mean? Are we talking about something more, different? What are data apps to you? >> Well, to understand data apps, it's useful to contrast them to something, and I just use the simple term people apps. I know that's a little bit awkward, but it's clear. And almost everything we work with, almost every application that we're familiar with, be it email or Salesforce or any consumer app, those are applications that are targeted at responding to people. You know, in contrast, a data application reacts to changes in data and uses some set of analytic services to autonomously take action. So where applications that we're familiar with respond to people, data apps respond to changes in data. And they both do something, but they do it for different reasons. >> Got it. You know, George, you and I were talking about, you know, it comes back to SuperCloud, broad definition, narrow definition. Tristan, how do you see it? Do you see it the same way? Do you have a different take on data apps? >> Oh, geez. This is like a conversation that I don't know has an end. It's like been, I write a substack, and there's like this little community of people who all write substack. We argue with each other about these kinds of things. Like, you know, as many different takes on this question as you can find, but the way that I think about it is that data products are atomic units of functionality that are fundamentally data driven in nature. So a data product can be as simple as an interactive dashboard that is like actually had design thinking put into it and serves a particular user group and has like actually gone through kind of a product development life cycle. And then a data app or data application is a kind of cohesive end-to-end experience that often encompasses like many different data products. So from my perspective there, this is very, very related to the way that these things are produced, the kinds of experiences that they're provided, that like data innovates every product that we've been building in, you know, software engineering for, you know, as long as there have been computers. >> You know, Jamak Dagani oftentimes uses the, you know, she doesn't name Spotify, but I think it's Spotify as that kind of example she uses. But I wonder if we can maybe try to take some examples. If you take, like George, if you take a CRM system today, you're inputting leads, you got opportunities, it's driven by humans, they're really inputting the data, and then you got this system that kind of orchestrates the business process, like runs a forecast. But in this data driven future, are we talking about the app itself pulling data in and automatically looking at data from the transaction systems, the call center, the supply chain and then actually building a plan? George, is that how you see it? >> I go back to the example of Uber, may not be the most sophisticated data app that we build now, but it was like one of the first where you do have users interacting with their devices as riders trying to call a car or driver. But the app then looks at the location of all the drivers in proximity, and it matches a driver to a rider. It calculates an ETA to the rider. It calculates an ETA then to the destination, and it calculates a price. Those are all activities that are done sort of autonomously that don't require a human to type something into a form. The application is using changes in data to calculate an analytic product and then to operationalize that, to assign the driver to, you know, calculate a price. Those are, that's an example of what I would think of as a data app. And my question then I guess for Tristan is if we don't have all the pieces in place for sort of mainstream companies to build those sorts of apps easily yet, like how would we get started? What's the role of a semantic layer in making that easier for mainstream companies to build? And how do we get started, you know, say with metrics? How does that, how does that take us down that path? >> So what we've seen in the past, I dunno, decade or so, is that one of the most successful business models in infrastructure is taking hard things and rolling 'em up behind APIs. You take messaging, you take payments, and you all of a sudden increase the capability of kind of your median application developer. And you say, you know, previously you were spending all your time being focused on how do you accept credit cards, how do you send SMS payments, and now you can focus on your business logic, and just create the thing. One of, interestingly, one of the things that we still don't know how to API-ify is concepts that live inside of your data warehouse, inside of your data lake. These are core concepts that, you know, you would imagine that the business would be able to create applications around very easily, but in fact that's not the case. It's actually quite challenging to, and involves a lot of data engineering pipeline and all this work to make these available. And so if you really want to make it very easy to create some of these data experiences for users, you need to have an ability to describe these metrics and then to turn them into APIs to make them accessible to application developers who have literally no idea how they're calculated behind the scenes, and they don't need to. >> So how rich can that API layer grow if you start with metric definitions that you've defined? And DBT has, you know, the metric, the dimensions, the time grain, things like that, that's a well scoped sort of API that people can work within. How much can you extend that to say non-calculated business rules or governance information like data reliability rules, things like that, or even, you know, features for an AIML feature store. In other words, it starts, you started pragmatically, but how far can you grow? >> Bob is waiting with bated breath to answer this question. I'm, just really quickly, I think that we as a company and DBT as a product tend to be very pragmatic. We try to release the simplest possible version of a thing, get it out there, and see if people use it. But the idea that, the concept of a metric is really just a first landing pad. The really, there is a physical manifestation of the data and then there's a logical manifestation of the data. And what we're trying to do here is make it very easy to access the logical manifestation of the data, and metric is a way to look at that. Maybe an entity, a customer, a user is another way to look at that. And I'm sure that there will be more kind of logical structures as well. >> So, Bob, chime in on this. You know, what's your thoughts on the right architecture behind this, and how do we get there? >> Yeah, well first of all, I think one of the ways we get there is by what companies like DBT Labs and Tristan is doing, which is incrementally taking and building on the modern data stack and extending that to add a semantic layer that describes the data. Now the way I tend to think about this is a fairly major shift in the way we think about writing applications, which is today a code first approach to moving to a world that is model driven. And I think that's what the big change will be is that where today we think about data, we think about writing code, and we use that to produce APIs as Tristan said, which encapsulates those things together in some form of services that are useful for organizations. And that idea of that encapsulation is never going to go away. It's very, that concept of an API is incredibly useful and will exist well into the future. But what I think will happen is that in the next 10 years, we're going to move to a world where organizations are defining models first of their data, but then ultimately of their business process, their entire business process. Now the concept of a model driven world is a very old concept. I mean, I first started thinking about this and playing around with some early model driven tools, probably before Tristan was born in the early 1980s. And those tools didn't work because the semantics associated with executing the model were too complex to be written in anything other than a procedural language. We're now reaching a time where that is changing, and you see it everywhere. You see it first of all in the world of machine learning and machine learning models, which are taking over more and more of what applications are doing. And I think that's an incredibly important step. And learned models are an important part of what people will do. But if you look at the world today, I will claim that we've always been modeling. Modeling has existed in computers since there have been integrated circuits and any form of computers. But what we do is what I would call implicit modeling, which means that it's the model is written on a whiteboard. It's in a bunch of Slack messages. It's on a set of napkins in conversations that happen and during Zoom. That's where the model gets defined today. It's implicit. There is one in the system. It is hard coded inside application logic that exists across many applications with humans being the glue that connects those models together. And really there is no central place you can go to understand the full attributes of the business, all of the business rules, all of the business logic, the business data. That's going to change in the next 10 years. And we'll start to have a world where we can define models about what we're doing. Now in the short run, the most important models to build are data models and to describe all of the attributes of the data and their relationships. And that's work that DBT Labs is doing. A number of other companies are doing that. We're taking steps along that way with catalogs. People are trying to build more complete ontologies associated with that. The underlying infrastructure is still super, super nascent. But what I think we'll see is this infrastructure that exists today that's building learned models in the form of machine learning programs. You know, some of these incredible machine learning programs in foundation models like GPT and DALL-E and all of the things that are happening in these global scale models, but also all of that needs to get applied to the domains that are appropriate for a business. And I think we'll see the infrastructure developing for that, that can take this concept of learned models and put it together with more explicitly defined models. And this is where the concept of knowledge graphs come in and then the technology that underlies that to actually implement and execute that, which I believe are relational knowledge graphs. >> Oh, oh wow. There's a lot to unpack there. So let me ask the Colombo question, Tristan, we've been making fun of your youth. We're just, we're just jealous. Colombo, I'll explain it offline maybe. >> I watch Colombo. >> Okay. All right, good. So but today if you think about the application stack and the data stack, which is largely an analytics pipeline. They're separate. Do they, those worlds, do they have to come together in order to achieve Bob's vision? When I talk to practitioners about that, they're like, well, I don't want to complexify the application stack cause the data stack today is so, you know, hard to manage. But but do those worlds have to come together? And you know, through that model, I guess abstraction or translation that Bob was just describing, how do you guys think about that? Who wants to take that? >> I think it's inevitable that data and AI are going to become closer together? I think that the infrastructure there has been moving in that direction for a long time. Whether you want to use the Lakehouse portmanteau or not. There's also, there's a next generation of data tech that is still in the like early stage of being developed. There's a company that I love that is essentially Cross Cloud Lambda, and it's just a wonderful abstraction for computing. So I think that, you know, people have been predicting that these worlds are going to come together for awhile. A16Z wrote a great post on this back in I think 2020, predicting this, and I've been predicting this since since 2020. But what's not clear is the timeline, but I think that this is still just as inevitable as it's been. >> Who's that that does Cross Cloud? >> Let me follow up on. >> Who's that, Tristan, that does Cross Cloud Lambda? Can you name names? >> Oh, they're called Modal Labs. >> Modal Labs, yeah, of course. All right, go ahead, George. >> Let me ask about this vision of trying to put the semantics or the code that represents the business with the data. It gets us to a world that's sort of more data centric, where data's not locked inside or behind the APIs of different applications so that we don't have silos. But at the same time, Bob, I've heard you talk about building the semantics gradually on top of, into a knowledge graph that maybe grows out of a data catalog. And the vision of getting to that point, essentially the enterprise's metadata and then the semantics you're going to add onto it are really stored in something that's separate from the underlying operational and analytic data. So at the same time then why couldn't we gradually build semantics beyond the metric definitions that DBT has today? In other words, you build more and more of the semantics in some layer that DBT defines and that sits above the data management layer, but any requests for data have to go through the DBT layer. Is that a workable alternative? Or where, what type of limitations would you face? >> Well, I think that it is the way the world will evolve is to start with the modern data stack and, you know, which is operational applications going through a data pipeline into some form of data lake, data warehouse, the Lakehouse, whatever you want to call it. And then, you know, this wide variety of analytics services that are built together. To the point that Tristan made about machine learning and data coming together, you see that in every major data cloud provider. Snowflake certainly now supports Python and Java. Databricks is of course building their data warehouse. Certainly Google, Microsoft and Amazon are doing very, very similar things in terms of building complete solutions that bring together an analytics stack that typically supports languages like Python together with the data stack and the data warehouse. I mean, all of those things are going to evolve, and they're not going to go away because that infrastructure is relatively new. It's just being deployed by companies, and it solves the problem of working with petabytes of data if you need to work with petabytes of data, and nothing will do that for a long time. What's missing is a layer that understands and can model the semantics of all of this. And if you need to, if you want to model all, if you want to talk about all the semantics of even data, you need to think about all of the relationships. You need to think about how these things connect together. And unfortunately, there really is no platform today. None of our existing platforms are ultimately sufficient for this. It was interesting, I was just talking to a customer yesterday, you know, a large financial organization that is building out these semantic layers. They're further along than many companies are. And you know, I asked what they're building it on, and you know, it's not surprising they're using a, they're using combinations of some form of search together with, you know, textual based search together with a document oriented database. In this case it was Cosmos. And that really is kind of the state of the art right now. And yet those products were not built for this. They don't really, they can't manage the complicated relationships that are required. They can't issue the queries that are required. And so a new generation of database needs to be developed. And fortunately, you know, that is happening. The world is developing a new set of relational algorithms that will be able to work with hundreds of different relations. If you look at a SQL database like Snowflake or a big query, you know, you get tens of different joins coming together, and that query is going to take a really long time. Well, fortunately, technology is evolving, and it's possible with new join algorithms, worst case, optimal join algorithms they're called, where you can join hundreds of different relations together and run semantic queries that you simply couldn't run. Now that technology is nascent, but it's really important, and I think that will be a requirement to have this semantically reach its full potential. In the meantime, Tristan can do a lot of great things by building up on what he's got today and solve some problems that are very real. But in the long run I think we'll see a new set of databases to support these models. >> So Tristan, you got to respond to that, right? You got to, so take the example of Snowflake. We know it doesn't deal well with complex joins, but they're, they've got big aspirations. They're building an ecosystem to really solve some of these problems. Tristan, you guys are part of that ecosystem, and others, but please, your thoughts on what Bob just shared. >> Bob, I'm curious if, I would have no idea what you were talking about except that you introduced me to somebody who gave me a demo of a thing and do you not want to go there right now? >> No, I can talk about it. I mean, we can talk about it. Look, the company I've been working with is Relational AI, and they're doing this work to actually first of all work across the industry with academics and research, you know, across many, many different, over 20 different research institutions across the world to develop this new set of algorithms. They're all fully published, just like SQL, the underlying algorithms that are used by SQL databases are. If you look today, every single SQL database uses a similar set of relational algorithms underneath that. And those algorithms actually go back to system R and what IBM developed in the 1970s. We're just, there's an opportunity for us to build something new that allows you to take, for example, instead of taking data and grouping it together in tables, treat all data as individual relations, you know, a key and a set of values and then be able to perform purely relational operations on it. If you go back to what, to Codd, and what he wrote, he defined two things. He defined a relational calculus and relational algebra. And essentially SQL is a query language that is translated by the query processor into relational algebra. But however, the calculus of SQL is not even close to the full semantics of the relational mathematics. And it's possible to have systems that can do everything and that can store all of the attributes of the data model or ultimately the business model in a form that is much more natural to work with. >> So here's like my short answer to this. I think that we're dealing in different time scales. I think that there is actually a tremendous amount of work to do in the semantic layer using the kind of technology that we have on the ground today. And I think that there's, I don't know, let's say five years of like really solid work that there is to do for the entire industry, if not more. But the wonderful thing about DBT is that it's independent of what the compute substrate is beneath it. And so if we develop new platforms, new capabilities to describe semantic models in more fine grain detail, more procedural, then we're going to support that too. And so I'm excited about all of it. >> Yeah, so interpreting that short answer, you're basically saying, cause Bob was just kind of pointing to you as incremental, but you're saying, yeah, okay, we're applying it for incremental use cases today, but we can accommodate a much broader set of examples in the future. Is that correct, Tristan? >> I think you're using the word incremental as if it's not good, but I think that incremental is great. We have always been about applying incremental improvement on top of what exists today, but allowing practitioners to like use different workflows to actually make use of that technology. So yeah, yeah, we are a very incremental company. We're going to continue being that way. >> Well, I think Bob was using incremental as a pejorative. I mean, I, but to your point, a lot. >> No, I don't think so. I want to stop that. No, I don't think it's pejorative at all. I think incremental, incremental is usually the most successful path. >> Yes, of course. >> In my experience. >> We agree, we agree on that. >> Having tried many, many moonshot things in my Microsoft days, I can tell you that being incremental is a good thing. And I'm a very big believer that that's the way the world's going to go. I just think that there is a need for us to build something new and that ultimately that will be the solution. Now you can argue whether it's two years, three years, five years, or 10 years, but I'd be shocked if it didn't happen in 10 years. >> Yeah, so we all agree that incremental is less disruptive. Boom, but Tristan, you're, I think I'm inferring that you believe you have the architecture to accommodate Bob's vision, and then Bob, and I'm inferring from Bob's comments that maybe you don't think that's the case, but please. >> No, no, no. I think that, so Bob, let me put words into your mouth and you tell me if you disagree, DBT is completely useless in a world where a large scale cloud data warehouse doesn't exist. We were not able to bring the power of Python to our users until these platforms started supporting Python. Like DBT is a layer on top of large scale computing platforms. And to the extent that those platforms extend their functionality to bring more capabilities, we will also service those capabilities. >> Let me try and bridge the two. >> Yeah, yeah, so Bob, Bob, Bob, do you concur with what Tristan just said? >> Absolutely, I mean there's nothing to argue with in what Tristan just said. >> I wanted. >> And it's what he's doing. It'll continue to, I believe he'll continue to do it, and I think it's a very good thing for the industry. You know, I'm just simply saying that on top of that, I would like to provide Tristan and all of those who are following similar paths to him with a new type of database that can actually solve these problems in a much more architected way. And when I talk about Cosmos with something like Mongo or Cosmos together with Elastic, you're using Elastic as the join engine, okay. That's the purpose of it. It becomes a poor man's join engine. And I kind of go, I know there's a better answer than that. I know there is, but that's kind of where we are state of the art right now. >> George, we got to wrap it. So give us the last word here. Go ahead, George. >> Okay, I just, I think there's a way to tie together what Tristan and Bob are both talking about, and I want them to validate it, which is for five years we're going to be adding or some number of years more and more semantics to the operational and analytic data that we have, starting with metric definitions. My question is for Bob, as DBT accumulates more and more of those semantics for different enterprises, can that layer not run on top of a relational knowledge graph? And what would we lose by not having, by having the knowledge graph store sort of the joins, all the complex relationships among the data, but having the semantics in the DBT layer? >> Well, I think this, okay, I think first of all that DBT will be an environment where many of these semantics are defined. The question we're asking is how are they stored and how are they processed? And what I predict will happen is that over time, as companies like DBT begin to build more and more richness into their semantic layer, they will begin to experience challenges that customers want to run queries, they want to ask questions, they want to use this for things where the underlying infrastructure becomes an obstacle. I mean, this has happened in always in the history, right? I mean, you see major advances in computer science when the data model changes. And I think we're on the verge of a very significant change in the way data is stored and structured, or at least metadata is stored and structured. Again, I'm not saying that anytime in the next 10 years, SQL is going to go away. In fact, more SQL will be written in the future than has been written in the past. And those platforms will mature to become the engines, the slicer dicers of data. I mean that's what they are today. They're incredibly powerful at working with large amounts of data, and that infrastructure is maturing very rapidly. What is not maturing is the infrastructure to handle all of the metadata and the semantics that that requires. And that's where I say knowledge graphs are what I believe will be the solution to that. >> But Tristan, bring us home here. It sounds like, let me put pause at this, is that whatever happens in the future, we're going to leverage the vast system that has become cloud that we're talking about a supercloud, sort of where data lives irrespective of physical location. We're going to have to tap that data. It's not necessarily going to be in one place, but give us your final thoughts, please. >> 100% agree. I think that the data is going to live everywhere. It is the responsibility for both the metadata systems and the data processing engines themselves to make sure that we can join data across cloud providers, that we can join data across different physical regions and that we as practitioners are going to kind of start forgetting about details like that. And we're going to start thinking more about how we want to arrange our teams, how does the tooling that we use support our team structures? And that's when data mesh I think really starts to get very, very critical as a concept. >> Guys, great conversation. It was really awesome to have you. I can't thank you enough for spending time with us. Really appreciate it. >> Thanks a lot. >> All right. This is Dave Vellante for George Gilbert, John Furrier, and the entire Cube community. Keep it right there for more content. You're watching SuperCloud2. (upbeat music)
SUMMARY :
and the future of cloud. And Bob, you have some really and I think it's helpful to do it I'm going to go back and And I noticed that you is that what they mean? that we're familiar with, you know, it comes back to SuperCloud, is that data products are George, is that how you see it? that don't require a human to is that one of the most And DBT has, you know, the And I'm sure that there will be more on the right architecture is that in the next 10 years, So let me ask the Colombo and the data stack, which is that is still in the like Modal Labs, yeah, of course. and that sits above the and that query is going to So Tristan, you got to and that can store all of the that there is to do for the pointing to you as incremental, but allowing practitioners to I mean, I, but to your point, a lot. the most successful path. that that's the way the that you believe you have the architecture and you tell me if you disagree, there's nothing to argue with And I kind of go, I know there's George, we got to wrap it. and more of those semantics and the semantics that that requires. is that whatever happens in the future, and that we as practitioners I can't thank you enough John Furrier, and the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tristan | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
John | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Steve Mullaney | PERSON | 0.99+ |
Katie | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
Charles | PERSON | 0.99+ |
Mike Dooley | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Chris | PERSON | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
Bob | PERSON | 0.99+ |
Maribel Lopez | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Mike Wolf | PERSON | 0.99+ |
VMware | ORGANIZATION | 0.99+ |
Merim | PERSON | 0.99+ |
Adrian Cockcroft | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Brian | PERSON | 0.99+ |
Brian Rossi | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Chris Wegmann | PERSON | 0.99+ |
Whole Foods | ORGANIZATION | 0.99+ |
Eric | PERSON | 0.99+ |
Chris Hoff | PERSON | 0.99+ |
Jamak Dagani | PERSON | 0.99+ |
Jerry Chen | PERSON | 0.99+ |
Caterpillar | ORGANIZATION | 0.99+ |
John Walls | PERSON | 0.99+ |
Marianna Tessel | PERSON | 0.99+ |
Josh | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Jerome | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Lori MacVittie | PERSON | 0.99+ |
2007 | DATE | 0.99+ |
Seattle | LOCATION | 0.99+ |
10 | QUANTITY | 0.99+ |
five | QUANTITY | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
Peter McKee | PERSON | 0.99+ |
Nutanix | ORGANIZATION | 0.99+ |
Eric Herzog | PERSON | 0.99+ |
India | LOCATION | 0.99+ |
Mike | PERSON | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
five years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Kit Colbert | PERSON | 0.99+ |
Peter | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tanuja Randery | PERSON | 0.99+ |
Action Item Quick Take | George Gilbert - Feb 2018
(upbeat music) >> Hi, this is Peter Burris with another Wikibon Action Item Quick Take. George Gilbert, everybody's talking about AI, ML, DL, as though, like always, the stack's going to be highly disaggregated. What's really happening? >> Well, the key part for going really mainstream is that we're going to have these embedded in the fabric of applications, high volume applications. And right now, because it's so bespoke, it can only really be justified in strategic applications. But the key scarce resource of the data scientists, we can look at them as the new class of developer, but they're very different from the old class of developer. I mean, they need entirely different schooling and training and tools. So, the closest we can get to the completely bespoke apps are at the big tech companies for the most part, or tech-centric companies like Adtec and Fintec. But beyond that, when you're trying to get these out to more mainstream, but still sophisticated, customers, we have platforms like C3 or IBM's Watson IoT, where they're templates, but you work intensively between the vendor and the customer. It's going to be a while before we see them as widespread components of legacy enterprise apps, because those apps actually are keeping the vendors and the customers busy trying to move them to the cloud. They were heavily customized and you can't really embed the machine learning apps in those while they're sort of, so customized, because they need a certain amount of data, and they evolve very quickly, where as the systems of record are very, very rigid. >> All right, once again, thank you very much George. This has been Peter Burris with another Wikibon Action Item Quick Take. (upbeat music)
SUMMARY :
like always, the stack's going to be highly disaggregated. But the key scarce resource of the data scientists, with another Wikibon Action Item Quick Take.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Peter Burris | PERSON | 0.99+ |
Adtec | ORGANIZATION | 0.99+ |
Fintec | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Feb 2018 | DATE | 0.99+ |
George | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
C3 | TITLE | 0.93+ |
Watson IoT | TITLE | 0.89+ |
Wikibon | ORGANIZATION | 0.87+ |
Seth Myers, Demandbase | George Gilbert at HQ
>> This is George Gilbert, we're on the ground at Demandbase, the B2B CRM company, based on AI, one of uh, a very special company that's got some really unique technology. We have the privilege to be with Seth Myers today, Senior Data Scientist and resident wizard, and who's going to take us on a journey through some of the technology Demandbase is built on, and some of the technology coming down the road. So Seth, welcome. >> Thank you very much for having me. >> So, we talked earlier with Aman Naimat, Senior VP of Technology, and we talked about some of the functionality in Demandbase, and how it's very flexible, and reactive, and adaptive in helping guide, or react to a customer's journey, through the buying process. Tell us about what that journey might look like, how it's different, and the touchpoints, and the participants, and then how your technology rationalizes that, because we know, old CRM packages were really just lists of contact points. So this is something very different. How's it work? >> Yeah, absolutely, so at the highest level, each customer's going to be different, each customer's going to make decisions and look at different marketing collateral, and respond to different marketing collateral in different ways, you know, as the companies get bigger, and their products they're offering become more sophisticated, that's certainly the case, and also, sales cycles take a long time. You're engaged with an opportunity over many months, and so there's a lot of touchpoints, there's a lot of planning that has to be done, so that actually offers a huge opportunity to be solved with AI, especially in light of recent developments in this thing called reinforcement learning. So reinforcement learning is basically machine learning that can think strategically, they can actually plan ahead in a series of decisions, and it's actually technology behind AlphaGo which is the Google technology that beat the best Go players in the world. And what we basically do is we say, "Okay, if we understand "you're a customer, we understand the company you work at, "we understand the things they've been researching elsewhere "on third party sites, then we can actually start to predict "about content they will be likely to engage with." But more importantly, we can start to predict content they're more likely to engage with next, and after that, and after that, and after that, and so what our technology does is it looks at all possible paths that your potential customer can take, all the different content you could ever suggest to them, all the different routes they will take, and it looks at ones that they're likely to follow, but also ones they're likely to turn them into an opportunity. And so we basically, in the same way Google Maps considers all possible routes to get you from your office to home, we do the same, and we choose the one that's most likely to convert the opportunity, the same way Google chooses the quickest road home. >> Okay, this is really, that's a great example, because people can picture that, but how do you, how do you know what's the best path, is it based on learning from previous journeys from customers? >> Yes. >> And then, if you make a wrong guess, you sort of penalize the engine and say, "Pick the next best, "what you thought was the next best path." >> Absolutely, so the way, the nuts and bolts of how it works is we start working with our clients, and they have all this data of different customers, and how they've engaged with different pieces of content throughout their journey, and so the machine learning model, what it's really doing at any moment in time, given any customer in any stage of the opportunity that they find themselves in, it says, what piece of content are they likely to engage with next, and that's based on historical training data, if you will. And then once we make that decision on a step-by-step basis, then we kind of extrapolate, and we basically say, "Okay, if we showed them this page, or if they engage with "this material, what would that do, what situation would "we find them in at the next step, and then what would "we recommend from there, and then from there, "and then from there," and so it's really kind of learning the right move to make at each time, and then extrapolating that all the way to the opportunity being closed. >> The picture that's in my mind is like, the Deep Blue, I think it was chess, where it would map out all the potential moves. >> Very similar, yeah. >> To the end game. >> Very similar idea. >> So, what about if you're trying to engage with a customer across different channels, and it's not just web content? How is that done? >> Well, that's something that we're very excited about, and that's something that we're currently really starting to devote resources to. Right now, we already have a product live that's focused on web content specifically, but yeah, we're working on kind of a multi-channel type solution, and we're all pretty excited about it. >> Okay so, obviously you can't talk too much about it. Can you tell us what channels that might touch? >> I might have to play my cards a little close to my chest on this one, but I'll just say we're excited. >> Alright. Well I guess that means I'll have to come back. >> Please, please. >> So, um, tell us about the personalized conversations. Is the conversation just another way of saying, this is how we're personalizing the journey? Or is there more to it than that? >> Yeah, it really is about personalizing the journey, right? Like you know, a lot of our clients now have a lot of sophisticated marketing collateral, and a lot of time and energy has gone into developing content that different people find engaging, that kind of positions products towards pain points, and all that stuff, and so really there's so much low-hanging fruit by just organizing and leveraging all of this material, and actually forming the conversation through a series of journeys through that material. >> Okay, so, Aman was telling us earlier that we have so many sort of algorithms, they're all open source, or they're all published, and they're only as good as the data you can apply them to. So, tell us, where do companies, startups, you know, not the Googles, Microsofts, Amazons, where do they get their proprietary information? Is it that you have algorithms that now are so advanced that you can refine raw information into proprietary information that others don't have? >> Really I think it comes down to, our competitive advantage I think is largely in the source of our data, and so, yes, you can build more and more sophisticated algorithms, but again, you're starting with a public data set, you'll be able to derive some insights, but there will always be a path to those datasets for, say, a competitor. For example, we're currently tracking about 700 billion web interactions a year, and then we're also able to attribute those web interactions to companies, meaning the employees at those companies involved in those web interactions, and so that's able to give us an insight that no amount of public data or processing would ever really be able to achieve. >> How do you, Aman started to talk to us about how, like there were DNS, reverse DNS registries. >> Reverse IP lookups, yes. >> Yeah, so how are those, if they're individuals within companies, and then the companies themselves, how do you identify them reliably? >> Right, so reverse IP lookup is, we've been doing this for years now, and so we've kind of developed a multi-source solution, so reverse IP lookups is a big one. Also machine learning, you can look at traffic coming from an IP address, and you can start to make some very informed decisions about what the IP address is actually doing, who they are, and so if you're looking at, at the account level, which is what we're tracking at, there's a lot of information to be gleaned from that kind of information. >> Sort of the way, and this may be a weird-sounding analogy, but the way a virus or some piece of malware has a signature in terms of its behavior, you find signatures in terms of users associated with an IP address. >> And we certainly don't de-anonymize individual users, but if we're looking at things at the account level, then you know, the bigger the data, the more signal you can infer, and so if we're looking at a company-wide usage of an IP address, then you can start to make some very educated guesses as to who that company is, the things that they're researching, what they're in market for, that type of thing. >> And how do you find out, if they're not coming to your site, and they're not coming to one of your customer's sites, how do you find out what they're touching? >> Right, I mean, I can't really go into too much detail, but a lot of it comes from working with publishers, and a lot of this data is just raw, and it's only because we can identify the companies behind these IP addresses, that we're able to actually turn these web interactions into insights about specific companies. >> George: Sort of like how advertisers or publishers would track visitors across many, many sites, by having agreements. >> Yes. Along those lines, yeah. >> Okay. So, tell us a little more about natural language processing, I think where most people have assumed or have become familiar with it is with the B2C capabilities, with the big internet giants, where they're trying to understand all language. You have a more well-scoped problem, tell us how that changes your approach. >> So a lot of really exciting things are happening in natural language processing in general, and the research, and right now in general, it's being measured against this yardstick of, can it understand languages as good as a human can, obviously we're not there yet, but that doesn't necessarily mean you can't derive a lot of meaningful insights from it, and the way we're able to do that is, instead of trying to understand all of human language, let's understand very specific language associated with the things that we're trying to learn. So obviously we're a B2B marketing company, so it's very important to us to understand what companies are investing in other companies, what companies are buying from other companies, what companies are suing other companies, and so if we said, okay, we only want to be able to infer a competitive relationship between two businesses in an actual document, that becomes a much more solvable and manageable problem, as opposed to, let's understand all of human language. And so we actually started off with these kind of open source solutions, with some of these proprietary solutions that we paid for, and they didn't work because their scope was this broad, and so we said, okay, we can do better by just focusing in on the types of insights we're trying to learn, and then work backwards from them. >> So tell us, how much of the algorithms that we would call building blocks for what you're doing, and others, how much of those are all published or open source, and then how much is your secret sauce? Because we talk about data being a key part of the secret sauce, what about the algorithms? >> I mean yeah, you can treat the algorithms as tools, but you know, a bag of tools a product does not make, right? So our secret sauce becomes how we use these tools, how we deploy them, and the datasets we put them again. So as mentioned before, we're not trying to understand all of human language, actually the exact opposite. So we actually have a single machine learning algorithm that all it does is it learns to recognize when Amazon, the company, is being mentioned in a document. So if you see the word Amazon, is it talking about the river, is it talking about the company? So we have a classifier that all it does is it fires whenever Amazon is being mentioned in a document. And that's a much easier problem to solve than understanding, than Siri basically. >> Okay. I still get rather irritated with Siri. So let's talk about, um, broadly this topic that sort of everyone lays claim to as their great higher calling, which is democratizing machine learning and AI, and opening it up to a much greater audience. Help set some context, just the way you did by saying, "Hey, if we narrow the scope of a problem, "it's easier to solve." What are some of the different approaches people are taking to that problem, and what are their sweet spots? >> Right, so the the talk of the data science community, talking machinery right now, is some of the work that's coming out of DeepMind, which is a subsidiary of Google, they just built AlphaGo, which solved the strategy game that we thought we were decades away from actually solving, and their approach of restricting the problem to a game, with well-defined rules, with a limited scope, I think that's how they're able to propel the field forward so significantly. They started off by playing Atari games, then they moved to long term strategy games, and now they're doing video games, like video strategy games, and I think the idea of, again, narrowing the scope to well-defined rules and well-defined limited settings is how they're actually able to advance the field. >> Let me ask just about playing the video games. I can't remember Star... >> Starcraft. >> Starcraft. Would you call that, like, where the video game is a model, and you're training a model against that other model, so it's almost like they're interacting with each other. >> Right, so it really comes down, you can think of it as pulling levers, so you have a very complex machine, and there's certain levers you can pull, and the machine will respond in different ways. If you're trying to, for example, build a robot that can walk amongst a factory and pick out boxes, like how you move each joint, what you look around, all the different things you can see and sense, those are all levers to pull, and that gets very complicated very quickly, but if you narrow it down to, okay, there's certain places on the screen I can click, there's certain things I can do, there's certain inputs I can provide in the video game, you basically limit the number of levers, and then optimizing and learning how to work those levers is a much more scoped and reasonable problem, as opposed to learn everything all at once. >> Okay, that's interesting, now, let me switch gears a little bit. We've done a lot of work at WikiBound about IOT and increasingly edge-based intelligence, because you can't go back to the cloud for your analytics for everything, but one of the things that's becoming apparent is, it's not just the training that might go on in a cloud, but there might be simulations, and then the sort of low-latency response is based on a model that's at the edge. Help elaborate where that applies and how that works. >> Well in general, when you're working with machine learning, in almost every situation, training the model is, that's really the data-intensive process that requires a lot of extensive computation, and that's something that makes sense to have localized in a single location which you can leverage resources and you can optimize it. Then you can say, alright, now that I have this model that understands the problem that's trained, it becomes a much simpler endeavor to basically put that as close to the device as possible. And so that really is how they're able to say, okay, let's take this really complicated billion-parameter neural network that took days and weeks to train, and let's actually derive insights at the level, right at the device level. Recent technology though, like I mentioned deep learning, that in itself, just the actual deploying the technology creates new challenges as well, to the point that actually Google invented a new type of chip to just run... >> The tensor processing. >> Yeah, the TPU. The tensor processing unit, just to handle what is now a machine learning algorithm so sophisticated that even deploying it after it's been trained is still a challenge. >> Is there a difference in the hardware that you need for training vs. inferencing? >> So they initially deployed the TPU just for the sake of inference. In general, the way it actually works is that, when you're building a neural network, there is a type of mathematical operation to do a whole bunch, and it's based on the idea of working with matrices and it's like that, that's still absolutely the case with training as well as inference, where actually, querying the model, but so if you can solve that one mathematical operation, then you can deploy it everywhere. >> Okay. So, one of our CTOs was talking about how, in his view, what's going to happen in the cloud is richer and richer simulations, and as you say, the querying the model, getting an answer in realtime or near realtime, is out on the edge. What exactly is the role of the simulation? Is that just a model that understands time, and not just time, but many multiple parameters that it's playing with? >> Right, so simulations are particularly important in taking us back to reinforcement learning, where you basically have many decisions to make before you actually see some sort of desirable or undesirable outcome, and so, for example, the way AlphaGo trained itself is basically by running simulations of the game being played against itself, and really what that simulations are doing is allowing the artificial intelligence to explore the entire possibilities of all games. >> Sort of like WarGames, if you remember that movie. >> Yes, with uh... >> Matthew Broderick, and it actually showed all the war game scenarios on the screen, and then figured out, you couldn't really win. >> Right, yes, it's a similar idea where they, for example in Go, there's more board configurations than there are atoms in the observable universe, and so the way Deep Blue won chess is basically, more or less explore the vast majority of chess moves, that's really not the same option, you can't really play that same strategy with AlphaGo, and so, this constant simulation is how they explore the meaningful game configurations that it needed to win. >> So in other words, they were scoped down, so the problem space was smaller. >> Right, and in fact, basically one of the reasons, like AlphaGo was really kind of two different artificial intelligences working together, one that decided which solutions to explore, like which possibilities it should pursue more, and which ones not to, to ignore, and then the second piece was, okay, given the certain board configuration, what's the likely outcome? And so those two working in concert, one that narrows and focuses, and one that comes up with the answer, given that focus, is how it was actually able to work so well. >> Okay. Seth, on that note, that was a very, very enlightening 20 minutes. >> Okay. I'm glad to hear that. >> We'll have to come back and get an update from you soon. >> Alright, absolutely. >> This is George Gilbert, I'm with Seth Myers, Senior Data Scientist at Demandbase, a company I expect we'll be hearing a lot more about, and we're on the ground, and we'll be back shortly.
SUMMARY :
We have the privilege to and the participants, and the company you work at, say, "Pick the next best, the right move to make the Deep Blue, I think it was chess, that we're very excited about, Okay so, obviously you I might have to play I'll have to come back. Is the conversation just and actually forming the as good as the data you can apply them to. and so that's able to give us Aman started to talk to us about how, and you can start to make Sort of the way, and this the things that they're and a lot of this data is just George: Sort of like how Along those lines, yeah. the B2C capabilities, focusing in on the types of about the company? the way you did by saying, the problem to a game, playing the video games. Would you call that, and that gets very complicated a model that's at the edge. that in itself, just the Yeah, the TPU. the hardware that you need and it's based on the idea is out on the edge. and so, for example, the if you remember that movie. it actually showed all the and so the way Deep Blue so the problem space was smaller. and focuses, and one that Seth, on that note, that was a very, very I'm glad to hear that. We'll have to come back and and we're on the ground,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
George | PERSON | 0.99+ |
Amazons | ORGANIZATION | 0.99+ |
Microsofts | ORGANIZATION | 0.99+ |
Siri | TITLE | 0.99+ |
Googles | ORGANIZATION | 0.99+ |
Demandbase | ORGANIZATION | 0.99+ |
20 minutes | QUANTITY | 0.99+ |
Starcraft | TITLE | 0.99+ |
second piece | QUANTITY | 0.99+ |
WikiBound | ORGANIZATION | 0.99+ |
two businesses | QUANTITY | 0.99+ |
Seth Myers | PERSON | 0.99+ |
Aman Naimat | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Atari | ORGANIZATION | 0.99+ |
Seth | PERSON | 0.98+ |
each customer | QUANTITY | 0.98+ |
each joint | QUANTITY | 0.98+ |
Go | TITLE | 0.98+ |
single | QUANTITY | 0.98+ |
Matthew Broderick | PERSON | 0.98+ |
one | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
Aman | PERSON | 0.96+ |
Deep Blue | TITLE | 0.96+ |
billion-parameter | QUANTITY | 0.94+ |
each time | QUANTITY | 0.91+ |
two different artificial intelligences | QUANTITY | 0.88+ |
decades | QUANTITY | 0.88+ |
Google Maps | TITLE | 0.86+ |
AlphaGo | ORGANIZATION | 0.82+ |
about 700 billion web interactions a year | QUANTITY | 0.81+ |
Star | TITLE | 0.81+ |
AlphaGo | TITLE | 0.79+ |
one mathematical | QUANTITY | 0.78+ |
lot | QUANTITY | 0.76+ |
years | QUANTITY | 0.74+ |
DeepMind | ORGANIZATION | 0.74+ |
lot of information | QUANTITY | 0.73+ |
bag of tools | QUANTITY | 0.63+ |
IOT | TITLE | 0.62+ |
WarGames | TITLE | 0.6+ |
sites | QUANTITY | 0.6+ |
Aman Naimat, Demandbase, Chapter 2 | George Gilbert at HQ
>> And we're back, this is George Gilbert from Wikibon, and I'm here with Aman Naimat at Demandbase, the pioneers in the next gen AI generation of CRM. So Aman, let's continue where we left off. So we're talking about natural language processing, and I think most people are familiar with it more on the B to C technology, where the big internet providers have sort of accumulated a lot of voice data and have learned how to process it and convert it into text. So tell us how B to B NLP is different, to use a lot of acronyms. In other words, how you're using it to build up a map of relationships between businesses. >> Right, yeah, we call it the demand graph. So it's an interesting question, because firstly, it turns out that, while very different, B to B is also, the language is quite boring. It doesn't evolve as fast as consumer concepts. And so it makes the problem much more approachable from a language understanding point of view. So natural language processing or natural language understanding is all about how machines can understand and store and take action on language. So while we were working on this four or five years ago, and that's my background as well, it turned out the problem was simpler, because human language is very rich, and natural language processing converting voice to text is trivial compared to understanding meaning of things and words, which is much more difficult. Or even the sense of the word, apparently in English each word has six meanings, right? We call them word senses. So the problem was only simpler because B to B language doesn't tend to evolve as fast as regular language, because terms stick in an industry. The challenge with B to B and why it was different is that each industry or sub-industry has a very specific language and jargon and acronyms. So to really understand that industry, you need to come from that industry. So if you go back to the CRM example of what happened 10, 20 years ago, you would have a sales person that would come from that industry if you wanted to sell into it. And that still happens in some traditional companies, right? So the idea was to be able to replicate the knowledge that they would have as if they came from that industry. So it's the language, the vocabularies, and then ultimately have a way of storing and taking action on it. It's very analogous to what Google had done with Knowledge Graph. >> Alright, so two questions I guess. First is, it sounds almost like a translation problem, in the sense that you have some base language primitives, like partner, supplier, competitor, customer. But that the language in each industry is different, and so you have to map those down to those sort of primitives. So tell us the process. You don't have on staff people who translate from every industry. >> I mean that was the whole, writing logical rules or expressions for language, which use conventional good old fashioned AI. >> You mean this was the rules-based knowledge engineering? >> That's right. And that clearly did not succeed, because it is impossible to do it. >> The old quip which was, one researcher said, "Every time I fired a rules engineer, "my accuracy score would go up." (chuckles) >> That's right, and now the problem is because language is evolving, and the context is so different. So even pharmaceutical companies in the US or in the Bay Area would use different language than pharma in Europe or in Switzerland. And so it's just impossible to be able to quantify the variations. >> George: To do it manually. >> To do it manually, it's impossible. It's certainly not possible for a small startup. And we did try having it be generated. In the early days we used to have crowdsource workers validate the machine. But it turned out that they couldn't do it either, because they didn't understand the pharmaceutical language either, right? So in the end, the only way to do that was to have some sort of model and some seed data to be able to validate it, or to hire experts and to have small samples of data to validate. So going back to the graph, right, it turns out that when we have seen sophisticated AI work, you know, towards complex problems, so for example predicting your next connection on LinkedIn, or your next friend, or what ads should you see on Facebook, they have used network-based data, social graph data, or in the case of Google, it's the Knowledge Graph, of how things are connected. And somehow machine learning and AI systems based on network data tend to be more powerful and more intuitive than other types of models. >> So OK, when you say model, help us with an example of, you're representing a business and who it's connected to and its place in the world. >> So the demand graph is basically as Demandbase, who are our customers, who are their partners, who are their suppliers, who are their competitors. And utilizing that network of companies in a manner that we have network of friends on LinkedIn or Facebook. And it turns out that businesses are extremely social in nature. In fact, we found out that the connections between companies have more signal, and are more predictive of acquisition or predicting the next customer, than even the Facebook social graph. So it's much easier to utilize the business graph, the B to B business graph, to predict the next customer, than to say, predict your next friend on Facebook. >> OK, so that's a perfect analogy. So tell us about the raw material you churn through on the web, and then how you learn what that terminology might be. You've boot-strapped a little bit, now you have all this data, and you have to make sense out of new terms, and then you build this graph of who this business is related to. >> That's right, and the hardest part is to be able to handle rumors and to be able to handle jokes, like, "Isn't it time for Microsoft to just buy Salesforce?" Question mark, smiley face. You know, so it's a challenging problem. But we were lucky that business language and business press is definitely more boring than, you know, people talking about movies. >> George: Or Reddit. >> Or Reddit, right. So the way we work is we process the entire business internet, or the entire internet. And initially we used to crawl it ourselves, but soon realized that Common Crawl, which is an open source foundation that has crawled the internet and put at least a large chunk of it, and that really enabled us to stop the crawling. And we read the entire internet and look at, ultimately we're interested in businesses, 'cause that's the world we are, in business, B to B marketing and B to B sales. We look at wherever there's a company mentioned or a business person or business title mentioned, and then ignore everything else. 'Cause if it doesn't have a company or a business person, we don't care. Right, so, or a business product. So we read the entire internet, and try to then infer that this is, Amazon is mentioned in it, then we figure out, is it Amazon the company, or is it Amazon the river? So that's problem number one. So we call it the entity linking problem. And then we try to understand and piece together the various expressions of relationships between companies expressed in text. It could be a press release, it could be a competitive analysis, it could be announcement of a new product. It could be a supply chain relationship. It could be a rumor. And then it also turns out the internet's very noisy, so we look at corroboration across multiple disparate sources-- >> George: Interesting, to decide-- >> Is it true? >> George: To signal is it real. >> Right, yeah, 'cause there's a lot of fake news out there. (George laughs) So we look at corroboration and the sources to be able to infer if we can have confidence in this. >> I can imagine this could be applied to-- >> A lot of other problems. >> Political issues. So OK, you've got all these sources, give us some specific examples of feeds, of sources, and then help us understand. 'Cause I don't think we've heard a lot about the notion of boot-strapping, and it sounds like you're generalizing, which is not something that most of us are familiar with who have a surface-level familiarity with machine learning. >> I think there was a lot of research like, not to credit Google too much, but... Boot-strapping methods were used by Sergei I think was the first person, and then he gave up 'cause they founded Google and they moved on. And since then in 2003, 2004, there was a lot of research around this topic. You know, and it's in the genre of unsupervised machine learning models. And in the real world, because there's less labeled data, we tend to find that to be an extremely effective method, to learn language and obviously now with deep learning, it's also being utilized more, unsupervised methods. But the idea is really to, and this was around five years ago when we started building this graph, and I obviously don't know how the Google Knowledge Graph is built, but I can assume it's a similar technique. We don't tend to talk about how commercial products work that much. But the idea is basically to generalize models or learn from a small seed, so let's say I put in seed like Nike and Adidas, and say they compete, right? And then if you look at the entire internet and look at all the expressions of how Nike and Adidas are expressed together in language, it could be, you know, "I think "Nike shoes are better than Adidas." >> Ah, so it's not just that you find an opinion that they're better than, but you find all the expressions that explain that they're different and they're competition. >> That's right. But we also find cases where somebody's saying, "I bought Nike and Adidas," or, "Nike and Adidas shoes are sold here." So we have to be able to be smart enough to discern when it's something else and not competition. >> OK, so you've told us how this graph gets built out. So the suppliers, the partners, the customers, the competitors, now you've got this foundation-- >> And people and products as well. >> OK, people, products. You've got this really rich foundation. Now you build and application on top of it. Tell us about CRM with that foundation. >> Yeah, I mean we have the demand graph, in which we tie in also things around basic data that you could find from graphics and intent that we've also built. But it also turns out that the knowledge graph itself, our initial intuition was that we'll just expose this to end users, and they'll be able to figure it out. But it was just too complicated. It really needed another level of machinery and AI on top to take advantage of the graph, and to be able to build prescriptive actions. And action could be, or to solve a business problem. A problem could be, I'm an IOT startup, I'm looking for manufacturing companies who will buy my product. Or it could be, I am a venture capital firm, I want to understand what other venture capital firms are investing in. Or, hey, I'm Tesla, and I'm looking for a new supplier for the new Tesla screen. Or you know, things of that nature. So then we apply and build specific models, more machine learning, or layers of machine learning, to then solve specific business problems. Like the reinforcement learning to understand next best action. >> And are these models associated with one of your customers? >> No, they're general purpose, they're packaged applications. >> OK, tell us more, so what was the base level technology that you started with in terms of the being able to manage a customer conversation, a marketing conversation, and then how did that get richer over time? >> Yeah, I mean we take our proprietary data sets that we've accumulated over the years and manufactured over the years, and then co-mingle with customer data, which we keep private, 'cause they own the data. And the technology is generic, but you're right, the model being generated by the machine is specific to every customer. So obviously the next best action model for a pharmaceutical company is based on doctors visiting, and is this person an oncologist, or what they're researching online. And that model is very different than a model for Demandbase for example, or Salesforce. >> Is it that the algorithm's different, or it's trained on different data? >> It's trained on different data. It's the same code, I mean we only have 20, 30 data scientists, so we're obviously not going to build custom code for... So the idea is it's the same model, but the same meta model is trained on different data. So public data, but also customers' private data. >> And how much does the customer, let's say your customer's Tesla, how much of it is them running some of their data through this boot-strapping process, versus how much of it is, your model is set up and it just automatically once you've boot-strapped it, it automatically starts learning from the interactions with the Tesla, with Tesla itself from all the different partners and customers? >> Right, I think you know, we have found, most startups are just learning over small data sets, which are customer-centric. What we have found is real magic happens when you take private data and combine it with large amounts of public data. So at Demandbase, we have massive amounts of public and proprietary data. And then we plug in, and we have to tell you that our client is Tesla, so it understands the localized graph, and knows the Tesla ecosystem, and that's based on public data sets and our proprietary data. Then we also bring in your private slice whenever possible. >> George: Private...? >> Slice of data. So we have code that can plug into your web site, and then start understanding interactions that your customers are having. And then based on that, we're able to train our models. As much as possible, we try to automate the data capture process, so in essence using a sensor or using a pixel on your web site, and then we take that private stream of data and include it in our graph and merge it in, and that's where we find... Our data by itself is not as powerful as our data mixed with your private data. >> So I guess one way to think about it would be, there's a skeletal graph, and that may be sounding too minimalistic, there's a graph. But let's say you take Tesla as the example, you tell them what data you need from them, and that trains the meta models, and then it fleshes out the graph of the Tesla ecosystem. >> Right, whatever data we couldn't get or infer, from the outside. And we have a lot of proprietary data, where we see online traffic, business traffic, what people are reading, who's interested in what, for hundreds of millions of people. We have developed that technology. So we know a lot without actually getting people's private slice. But you know, whenever possible, we want the maximum impact. >> So... >> It's actually simple, and let's divorce the words graphs for a second. It's really about, let's say that I know you, right, and there's some information you can tell me about you. But imagine if I google your name, and I read every document about you, every video you have produced, every blog you have written, then I have the best of both knowledge, right, your private data from maybe your social graph on Facebook, and then your public data. And then if I knew, you know... If I partnered with Forbes and they told me you logged in and read something on Forbes, then they'll get me that data, so now I really have a deep understanding of what you're interested in, who you are, what's your language, you know, what are you interested in. It's that, sort of simplified, but similar, at a much larger scale. >> Alright, let's take a pause at this point and then we'll come back with part three. >> Excellent.
SUMMARY :
more on the B to C technology, So the idea was to be able to replicate in the sense that you have I mean that was the because it is impossible to do it. The old quip which And so it's just impossible to be So in the end, the only way to do that was So OK, when you say model, the B to B business graph, and then how you learn what the hardest part is to So the way we work is and the sources to be and it sounds like you're generalizing, But the idea is basically to generalize Ah, so it's not just that you find So we have to be able to So the suppliers, the Now you build and and to be able to build No, they're general purpose, and manufactured over the years, So the idea is it's the same model, and we have to tell you and then we take that graph of the Tesla ecosystem. get or infer, from the outside. and then your public data. and then we'll come back with part three.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Switzerland | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Europe | LOCATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
US | LOCATION | 0.99+ |
2003 | DATE | 0.99+ |
George | PERSON | 0.99+ |
Sergei | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Bay Area | LOCATION | 0.99+ |
Tesla | ORGANIZATION | 0.99+ |
Adidas | ORGANIZATION | 0.99+ |
Nike | ORGANIZATION | 0.99+ |
two questions | QUANTITY | 0.99+ |
six meanings | QUANTITY | 0.99+ |
2004 | DATE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Forbes | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
Demandbase | ORGANIZATION | 0.99+ |
each word | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Aman | ORGANIZATION | 0.99+ |
each industry | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
four | DATE | 0.98+ |
Aman Naimat | PERSON | 0.98+ |
Wikibon | ORGANIZATION | 0.95+ |
ORGANIZATION | 0.95+ | |
hundreds of millions of people | QUANTITY | 0.95+ |
English | OTHER | 0.94+ |
10, 20 years ago | DATE | 0.94+ |
first person | QUANTITY | 0.94+ |
one way | QUANTITY | 0.94+ |
Aman Naimat | ORGANIZATION | 0.94+ |
five years ago | DATE | 0.93+ |
20, 30 data scientists | QUANTITY | 0.88+ |
Salesforce | ORGANIZATION | 0.88+ |
firstly | QUANTITY | 0.86+ |
one researcher | QUANTITY | 0.83+ |
around five years ago | DATE | 0.82+ |
one | QUANTITY | 0.73+ |
a second | QUANTITY | 0.71+ |
Salesforce | TITLE | 0.67+ |
Chapter 2 | OTHER | 0.64+ |
Knowledge Graph | TITLE | 0.63+ |
part three | QUANTITY | 0.56+ |
Aman Naimat, Demandbase, Chapter 1 | George Gilbert at HQ
>> Hi, this is George Gilbert. We have an extra-special guest today on our CUBEcast, Aman Naimat, Senior Vice President and CTO of Demandbase started with a five-person startup, Spiderbook. Almost like a reverse IPO, Demandbase bought Spiderbook, but it sounds like Spiderbook took over Demandbase. So Aman, welcome. >> Thank you, excited to be here. Always good to see you. >> So, um, Demandbase is a Next Gen CRM program. Let's talk about, just to set some context. >> Yes. >> For those who aren't intimately familiar with traditional CRM, what problems do they solve? And how did they start, and how did they evolve? >> Right, that's a really good question. So, for the audience, CRM really started as a contact manager, right? And it was replicating what a salesperson did in their own private notebook, writing contact phone numbers in an electronic version of it, right? So you had products that were really built for salespeople on an individual basis. But it slowly evolved, particularly with Siebel, into more of a different twist. It evolved into more of a management tool or reporting tool because Tom Siebel was himself a sales manager, ran a sales team at Oracle. And so, it actually turned from an individual-focused product to an organization management reporting product. And I've been building this stuff since I was 19. And so, it's interesting that, you know, the products today, we're going, actually pivoting back into products that help salespeople or help individual marketers and add value and not just focus on management reporting. >> That's an interesting perspective. So it's more now empowering as opposed to, sort of, reporting. >> Right, and I think some of it is cultural influence. You know, over the last decade, we have seen consumer apps actually take a much more, sort of predominant position rather than in the traditional, earlier in the 80s and 90s, the advanced applications were corporate applications, your large computers and companies. But over the last year, as consumer technology has taken off, and actually, I would argue has advanced more than even enterprise technology, so in essence, that's influencing the business. >> So, even ERP was a system of record, which is the state of the enterprise. And this is much more an organizational productivity tool. >> Right. >> So, tell us now, the mental leap, the conceptual leap that Demandbase made in terms of trying to solve a different problem. >> Right, so, you know, Demandbase started on the premise or around marketing automation and marketing application which was around identifying who you are. As we move towards more digital transaction and Web was becoming the predominant way of doing business, as people say that's 70 to 80 percent of all businesses start using online digital research, there was no way to know it, right? The majority of the Internet is this dark, unknown place. You don't know who's on your website, right? >> You're referring to the anonymity. >> Exactly. >> And not knowing who is interacting with you until very late. >> Exactly, and you can't do anything intelligent if you don't know somebody, right? So if you didn't know me, you couldn't really ask. What will you do? You'll ask me stupid questions around the weather. And really, as humans, I can only communicate if you know somebody. So the sort of innovation behind Demandbase was, and it still continues to be to actually bring around and identify who you're talking to, be it online on your website and now even off your website. And that allows you to have a much more sort of personalized conversation. Because ultimately in marketing and perhaps even in sales, it comes down to having a personal conversation. So that's really what, which if you could have a billion people who could talk to every person coming to your website in a personalized manner, that would be fantastic. But that's just not possible. >> So, how do you identify a person before they even get to a vendor's website so that you can start on a personalized level? >> Right, so Demandbase has been building this for a long time, but really, it's a hard problem. And it's harder now than ever before because of security and privacy, lots of hackers out there. People are actually trying to hide, or at least prevent this from leaking out. So, eight, nine years ago, we could buy registries or reverse DNS. But now with ISBs, and we are behind probably Comcast or Level 3. So how do you even know who this IP address is even registered to? So about eight years ago, we started mapping IP addresses, 'cause that's how you browse the Internet, to companies that they work at, right? But it turned out that was no longer effective. So we have built over the last eight years proprietary methods that know how companies relate to the IP addresses that they have. But we have gone to doing partnerships. So when you log into certain websites, we partner with them to identify you if you self-identify at Forbes.com, for example. So when you log in, we do a deal. And we have hundreds of partners and data providers. But now, the state of the art where we are is we are now looking at behavioral signals to identify who you are. >> In other words, not just touch points with partners where they collect an identity. >> Right. >> You have a signature of behavior. >> That's right. >> It's really interesting that humans are very unique. And based on what they're reading online and what they're reading about, you can actually identify a person and certainly identify enough things about them to know that this is an executive at Tesla who's interested in IOT manufacturing. >> Ah, so you don't need to resolve down to the name level. >> No. >> You need to know sort of the profile. >> Persona, exactly. >> The persona. >> The persona, and that's enough for marketing. So if I knew that this is a C-level supply chain executive from Tesla who lives in Palo Alto and has interests in these areas or problems, that's enough for Siemens to then have an intelligent conversation to this person, even if they're anonymous on their website or if they call on the phone or anything else. >> So, okay, tell us the next step. Once you have a persona, is it Demandbase that helps them put together a personalized? >> Profile. >> Profile, and lead it through the conversation? >> Yeah, so earlier, well, not earlier, but very recently, rebuilding this technology was just a very hard problem. To identify now hundreds of millions of people, I think around 700 are businesspeople globally which is majority of the business world. But we realize that in AI, making recommendations or giving you data in advanced analytics is just not good enough because you need a way to actually take action and have a personalized conversation because there are 100 thousand people on your website. Making recommendations, it's just overwhelming for humans to get that much data. So the better sort of idea now that we're working on is just take the action. So if somebody from Tesla visits your website, and they are an executive who will buy your product, take them to the right application. If they go back and leave your website, then display them the right message in a personalized ad. So it's all about taking actions. And then obviously, whenever possible, guiding humans towards a personalized conversation that will maximize your relationship. >> So, it sounds like sometimes it's anticipating and recommending a next best action. >> Yeah. >> And sometimes, it's your program taking the next best action. >> That's right, because it's just not possible to scale people to take actions. I mean, we have 30, 40 sales reps in Demandbase. We can't handle the volume. And it's difficult to create that personalized letter, right? So we make recommendations, but we've found that it's just too overwhelming. >> Ah, so in other words, when you're talking about recommendations, you're talking about recommendations for Demandbase for? >> Or our clients, employees, or salespeople, right? >> Okay. >> But whenever possible, we are looking to now build systems that in essence are in autopilot mode, and they take the action. They drive themselves. >> Give us some examples of the actions. >> That's right, so some actions could be if you know that a qualified person came to your website, notify the salesperson and open a chat window saying, "This is an executive. "This is similar to a person who will buy "a product from you. "They're looking for this thing. "Do you want to connect with a salesperson?" And obviously, only the people that will buy from you. Or, the action could be, send them an email automatically based on something they will be interested in, and in essence, have a conversation. Right? So it's all about conversation. An ad or an email or a person are just ways of having a conversation, different channels. >> So, it sounds like there was an intermediate marketing automation generation. >> Right. >> After traditional CRM which was reporting. >> Right, that's true. >> Where it was basically, it didn't work until you registered on the website. >> That's right. >> And then, they could email you. They could call you. The inside sales reps. >> That's right. >> You know, if you took a demo, >> That's right. >> you had to put an idea in there. >> And that's still, you know, so when Demandbase came around, that was the predominant between the CRM we were talking about. >> George: Right. >> There was a gap. There was a generation which started to be marketing. It was all about form fills. >> George: Yeah. >> And it was all about nurturing, but I think that's just spam. And today, their effectiveness is close to nothing. >> Because it's basically email or outbound calls. >> Yeah, it's email spam. Do you know we all have email boxes filled with this stuff? And why doesn't it work? Because, not only because it's becoming ineffective and that's one reason. Because they don't know me, right? And it boils down to if the email was really good and it related to what you're looking for or who you are, then it will be effective. But spam, or generic email is just not effective. So it's to some extent, we lost the intimacy. And with the new generation of what we call account-based marketing, we are trying to build intimacy at scale. >> Okay, so tell us more. Tell us first the philosophy behind account-based marketing and then the mechanics of how you do it. >> Sure, really, account-based marketing is nothing new. So if you walk into a corporation, they have these really sophisticated salespeople who understand their clients, and they focus on one-on-one, and it's very effective. So if you had Google as a client or Tesla as a client, and you are Siemens, you have two people working and keeping that relationship working 'cause you make millions of dollars. But that's not a scalable model. It's certainly not scalable for startups here to work with or to scale your organization, be more effective. So really, the idea behind account-based marketing is to scale that same efficacy, that same personalized conversation but at higher volume, right? And maximize, and the only way to really do that is using artificial intelligence. Because in essence, we are trying to replicate human behavior, human knowledge at scale. Right? And to be able to harvest and know what somebody who knows about pharma would know. >> So give me an example of, let's stay in pharma for a sec. >> Sure. >> And what are the decision points where based on what a customer does or responds to, you determine the next step or Demandbase determines what next step to take? >> Right. >> What are some of those options? Like a decision tree maybe? >> You can think of it, it's quite faddish in our industry now. It's reinforcement learning which is what Google used in the Go system. >> George: Yeah, AlphaGo. >> AlphaGo, right, and we were inspired by that. And in essence, what we are trying to do is predict not only what will keep you going but where you will win. So we give rewards at each point. And the ultimate goal is to convert you to a customer. So it looks at all your possible futures, and then it figures out in what possible futures you will be a customer. And then it works backwards to figure out where it should take you next. >> Wow, okay, so this is very different from >> They play six months ahead. So it's a planning system. >> Okay. >> Cause your sales cycles are six months ahead. >> So help us understand the difference between the traditional statistical machine learning that is a little more mainstream now. >> Sure. >> Then the deep learning, the neural nets, and then reinforcement learning. >> Right. >> Where are the sweet spots? What are the sweet spots for the problems they solve? >> Yeah, I mean, you know, there's a lot of fad and things out there. In my opinion, you can achieve a lot and solve real-world problems with simpler machine learning algorithms. In fact, for the data science team that I run, I always say, "Start with like the most simplest algorithm." Because if the data is there and you have the intuition, you can get to a 60% F-score or quality with the most naive implementation. >> George: 60% meaning? >> Like accuracy of the model. >> Confidence. >> Confidence. Sure, how good the model is, how precise it is. >> Okay. >> And sure, then you can make it better by using more advanced algorithms. The reinforcement learning, the interesting thing is that its ability to plan ahead. Most machine learning can only make a decision. They are classifiers of sorts, right? They say, is this good or bad? Or, is this blue? Or, is this a cat or not? They're mostly Boolean in nature or you can simulate that in multi-class classifiers. But reinforcement learning allows you to sort of plan ahead. And in CRM or as humans, we're always planning ahead. You know, a really good salesperson knows that for this stage opportunity or this person in pharma, I need to invite them to the dinner 'cause their friends are coming and they know that last year when they did that, then in the future, that person converted. Right, if they go to the next stage and they, so it plans ahead the possible futures and figures out what to do next. >> So, for those who are familiar with the term AB testing. >> Sure. >> And who are familiar with the notion that most machine learning models have to be trained on data where the answer exists, and they test it out, train it on one set of data >> Sure. >> Where they know the answers, then they hold some back and test it and see if it works. So, how does reinforcement learning change that? >> I mean, it's still testing on supervised models to know. It can be used to derive. You still need data to understand what the reward function would be. Right? And you still need to have historical data to understand what you should give it. And sure, have humans influence it as well, right? At some point, we always need data. Right? If you don't have the data, you're nowhere. And if you don't have, but it also turns out that most of the times, there is a way to either derive the data from some unsupervised method or have a proxy for the data that you really need. >> So pick a key feature in Demandbase and then where you can derive the data you need to make a decision, just as an example. >> Yeah, that's a really good question. We derive datas all the time, right? So, let me use something quite, quite interesting that I wish more companies and people used is the Internet data, right? The Internet today is the largest source of human knowledge, and it actually know more than you could imagine. And even simple queries, so we use the Bing API a lot. And to know, so one of the simple problems we ran into many years ago, and that's when we realized how we should be using Internet data which in academia has been used but not as used as it should be. So you know, you can buy APIs from Bing. And I wish Google would give their API, but they don't. So, that's our next best choice. We wanted to understand who people are. So there's their common names, right? So, George Gilbert is a common name or Alan Fletcher who's my co-founder. And, you know, is that a common name? And if you search that, just that name, you get that name in various contexts. Or co-occurring with other words, you can see that there are many Alan Fletchers, right? Or if you get, versus if you type in my name, Aman Naimat, you will always find the same kind of context. So you will know it's one person or it's a unique name. >> So, it sounds to me that reinforcement learning is online learning where you're using context. It's not perfectly labeled data. >> Right. I think there is no perfectly labeled data. So there's a misunderstanding of data scientists coming out of perfectly labeled data courses from Stanford, or whatever machine learning program. And we realized very quickly that the world doesn't have any perfect labeled data. We think we are going to crowdsource that data. And it turns out, we've tried it multiple times, and after a year, we realized that it's just a waste of time. You can't get, you know, 20 cents or 25 cents per item worker somewhere in wherever to hat and label data of any quality to you. So, it's much more effective to, and we were a startup, so we didn't have money like Google to pay. And even if you had the money, it generally never works out. We find it more effective to bootstrap or reuse unsupervised models to actually create data. >> Help us. Elaborate on that, the unsupervised and the bootstrapping where maybe it's sort of like a lawnmower where you give it that first. >> That's right. >> You know, tug. >> I mean, we've used it extensively. So let me give you an example. Let's say you wanted to create a list of cities, right? Or a list of the classic example actually was a paper written by Sergey Brin. I think he was trying to figure out the names of all authors in the world, and this is 1988. And basically if you search on Google, the term "has written the book," just the term "has written the book," these are called patterns, or hearse patterns, I think. Then you can imagine that it's also always preceded by a name of a person who's an author. So, "George Gilbert has written the book," and then the name of the book, right? Or "William Shakespeare has written the book X." And you seed it with William Shakespeare, and you get some books. Or you put Shakespeare and you get some authors, right? And then, you use it to learn other patterns that also co-occurred between William Shakespeare and the book. >> George: Ah. >> And then you learn more patterns and you use it to extract more authors. >> And in the case of Demandbase, that's how you go from learning, starting bootstrapping within, say, pharma terminology. >> Yes. >> And learning the rest of pharma terminology. >> And then, using generic terminology to enter an industry, and then learning terminology that we ourselves don't understand yet it means. For example, I always used this example where if we read a sentence like "Takeda has in-licensed "a molecule from Roche," it may mean nothing to us, but it means that they're partnered and bought a product, in pharma lingo. So we use it to learn new language. And it's a common technique. We use it extensively, both. So it goes down to, while we do use highly sophisticated algorithms for some problems, I think most problems can be solved with simple models and thinking through how to apply domain expertise and data intuition and having the data to do it. >> Okay, let's pause on that point and come back to it. >> Sure. >> Because that sounds like a rich vein to explore. So this is George Gilbert on the ground at Demandbase. We'll be right back in a few minutes.
SUMMARY :
and CTO of Demandbase Always good to see you. Let's talk about, just to set some context. And so, it's interesting that, you know, So it's more now empowering so in essence, that's influencing the business. And this is much more an organizational the conceptual leap that Demandbase made identifying who you are. And not knowing who is interacting with you And that allows you to have a much more to identify who you are. with partners where they collect an identity. you can actually identify a person Ah, so you don't need to resolve down So if I knew that this is a C-level Once you have a persona, is it Demandbase is just not good enough because you need a way So, it sounds like sometimes it's anticipating And sometimes, it's your program And it's difficult to create that personalized letter, to now build systems that in essence And obviously, only the people that will buy from you. So, it sounds like there was an intermediate until you registered on the website. And then, they could email you. And that's still, you know, There was a generation which started to be marketing. And it was all about nurturing, And it boils down to if the email was really good the mechanics of how you do it. So if you had Google as a client So give me an example of, You can think of it, it's quite faddish And the ultimate goal is to convert you to a customer. So it's a planning system. between the traditional statistical machine learning Then the deep learning, the neural nets, Because if the data is there and you have Sure, how good the model is, how precise it is. And sure, then you can make it better So, for those who are familiar with the term and see if it works. And if you don't have, but it also turns out and then where you can derive the data you need And if you search that, just that name, So, it sounds to me that reinforcement learning And even if you had the money, it's sort of like a lawnmower where you give it that first. And basically if you search on Google, And then you learn more patterns And in the case of Demandbase, and having the data to do it. So this is George Gilbert on the ground at Demandbase.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
70 | QUANTITY | 0.99+ |
Tesla | ORGANIZATION | 0.99+ |
Alan Fletcher | PERSON | 0.99+ |
Tom Siebel | PERSON | 0.99+ |
Siemens | ORGANIZATION | 0.99+ |
Comcast | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
25 cents | QUANTITY | 0.99+ |
60% | QUANTITY | 0.99+ |
20 cents | QUANTITY | 0.99+ |
Sergey Brin | PERSON | 0.99+ |
hundreds | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
two people | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Roche | ORGANIZATION | 0.99+ |
1988 | DATE | 0.99+ |
Stanford | ORGANIZATION | 0.99+ |
100 thousand people | QUANTITY | 0.99+ |
William Shakespeare | PERSON | 0.99+ |
Aman Naimat | PERSON | 0.99+ |
six months | QUANTITY | 0.99+ |
last year | DATE | 0.99+ |
Shakespeare | PERSON | 0.99+ |
Demandbase | ORGANIZATION | 0.99+ |
Takeda | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Siebel | PERSON | 0.99+ |
one reason | QUANTITY | 0.99+ |
five-person | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
Bing | ORGANIZATION | 0.98+ |
AlphaGo | ORGANIZATION | 0.98+ |
one person | QUANTITY | 0.98+ |
Alan Fletchers | PERSON | 0.98+ |
Aman | PERSON | 0.98+ |
90s | DATE | 0.97+ |
around 700 | QUANTITY | 0.97+ |
millions of dollars | QUANTITY | 0.96+ |
each point | QUANTITY | 0.96+ |
hundreds of millions of people | QUANTITY | 0.96+ |
Chapter 1 | OTHER | 0.96+ |
eight, | DATE | 0.96+ |
80 percent | QUANTITY | 0.96+ |
first | QUANTITY | 0.96+ |
Spiderbook | ORGANIZATION | 0.96+ |
Googl | ORGANIZATION | 0.96+ |
one set | QUANTITY | 0.95+ |
Forbes.com | ORGANIZATION | 0.95+ |
one | QUANTITY | 0.94+ |
a billion people | QUANTITY | 0.9+ |
80s | DATE | 0.89+ |
about eight years ago | DATE | 0.84+ |
last eight years | DATE | 0.84+ |
last decade | DATE | 0.83+ |
CUBEcast | ORGANIZATION | 0.82+ |
Aman Naimat, Demandbase, Chaper 3 | George Gilbert at HQ
>> This is George Gilbert from Wikibon. We're back on the ground with Aman Naimat at Demandbase. >> Hey. >> And, we're having a really interesting conversation About building next-gen enterprise applications. >> It's getting really deep. (laughing) >> So, so let's look ahead a little bit. >> Sure. >> We've talked in some detail about the foundation technologies. >> Right. >> And you told me before that we have so much technology, you know, still to work with. >> Yeah. >> That is unexploited. That we don't need, you know, a whole lot of breakthroughs, but we should focus on customer needs that are unmet. >> Yeah. >> Let's talk about some problems yet to be solved, but that are customer facing with, as you have told me, existing technology. >> Right, can solve. >> Yes. >> Absolutely, I mean, there's a lot of focus in Silicon Valley about, like, scaling machine learning and investing in, you know, GPUs and what have you. But I think there's enough technology there. So where's the gap? The really gap is in understanding how to build AI applications, and how to monetize it, because it is quite different than building traditional applications. It has different characteristics. You know, so it's much more experimental in nature. Although, you know, with lean engineering, we've moved towards iterative to (mumbles) development, for example. Like, for example, 90% of the time, I, you know, after 20 years of building software, I'm quite confident I can build software. It turns out, in the world of data science and AI driven, or AI applications, you can't have that much confidence. It's a lot more like discovering molecules in pharma. So you have to experiment more often, and methods have to be discovered, there's more discovery and less engineering in the early stages. >> Is the discovery centered on do you have the right data? >> Yeah, or are you measuring the right thing, right? If you thought you were going to maximize, work the model to maximize revenue, but really, maybe the end function should be increasing engagement with the customer. So, often, we don't know the end objective function, or incorrectly guess the right or wrong objective function. The only way to do that is to be able to build and end-to-end system in days, and then iterate through the different models in hours and days as quickly as possible with the end goal and customer in mind. >> This is really fascinating because we were, some of the research we're doing is on the really primitive capabilities of the, sort of, analytic data pipeline. >> Yes. >> Where, you know, all the work that has to do with coming up with the features. >> Yeah. >> And then plugging that into a model, and then managing the model's life cycle. That those, that whole process is so fragmented. >> Yeah. >> And it's, you know, chewing gum and bailing wire. >> Sure. >> And I imagine that that's slows dramatically that experimentation process. >> I mean, it slows it down, but it's also mindset, right? >> Okay. >> So, now that we have built, you know, we probably have a hundred machine learning models now, Demandbase, that I've contributed to the build with our data scientists, and in the end we've found out that you can actually do something in a day or two with extremely small amount of data over, using Python and SKLearn, today, very quickly, that will give you, and then, you know, build some simple UI that a human can evaluate and give feedback or whatever action you're trying to get to. And get to that as quickly as possible, rather than worrying about the pipelines, rather than worry about everything else because in 80% of the cases, it will fail anyways. Or you will realize that either you don't have the right data, or nobody wants it, or it can never be done, or you need to approach it completely different, from a completely different objective function. >> Let me parse what you've said in a different way. >> Sure. >> And see if I understand it. Traditional model building is based, not on sampling, but on the full data set. >> That's right. >> And what you're saying, in terms of experimentation. >> Start doing that, yes. >> Is to go back to samples. >> That's right. Go back to, there's a misunderstanding that we need, you know, while Demandbase processes close to a trillion rows of data today, we found that almost all big data, AI solutions, can be initially proven with very small amounts of data, and small amount of features. And if they don't work that, if you cannot have a hundred rows of data and have a human look at some rows and make a judgment, then it's not possible, most likely, with one billion, and with ten billion. So, if you cannot work it, now there are exceptions to this, but in 90% of the cases, if the solution is not at, you know, few thousand or million rows of data. Now the problem is that all the easy, you know, libraries and open-source stuff that's out there, it's all designed to be workable in small amounts of data. So, what we don't want to do is build this whole massive infrastructure, which is getting easier, and worrying about data pipelines and putting it all together, only to realize that this is not going to work. Or, more often, it doesn't solve any problem. >> So, if I were to sort of boil that down into terms of product terms. >> Yeah. >> The notion that you could have something like Spark running on your laptop. >> Yeah. >> And scaling out to a bit cluster. >> Yeah, just run it on laptop. >> That, yeah. >> In fact, you don't even need Spark. >> Or, I was going to say, not even spark. >> No. >> Just use Python. >> Just by scale learning is much better for something like this. >> It's almost like, this is, so it's back to Visual Basic. You know, you're not going to build a production app in >> I wouldn't go that far. >> Well >> It's a prototype. >> No I meant for the prototype GUI app you do in Visual Basic, and then, you know, when you're going to build a production one, you use Microsoft Foundation Class. >> Because most often, right, more often, you don't have the right data, you have the wrong objective function, or your customer is not happy with the results or wants to modify. And that's true for conventional business applications, the old school whatever internet applications. But it is more true for here because it's much, the data is much more noisy, the problems are much more complex, and ultimately you need to be able to take real world action, and so build something that can take the real world action, be it for a very narrow problem or use case. And get to it, even without any model. And the first model that I recommend, or I do, or my data scientists do, is I just do it yourself by hand. Just label the data and say as if, let's pretend that this was the right answer, and we can take this action and the workflow works, like, did something good happen? You know, will it be something that will satisfy some problem? And if that's not true, then why build it? And you can do that manually, right? So I guess it's no different than any other entrepreneurial endeavor. But it's more true in data science projects, firstly, because they're more likely to be wrong than I think we have learned now how to build good software. >> Imperative software. >> The imperative software. And data science is called data science for a reason. It's much more experimental, right? Like, in science, you don't know. A negative experiment is a fine experiment. >> This is actually, of all that we've been talking about, it might sound the most abstract, but it's also the most profound because what you're saying is this elaborate process and the technology to support it, you know, this whole pipeline, that it's like you only do that once you've proven the prototype. >> That's right. And get the prototype in a day. >> You don't want that elaborate structure and process when you're testing something out. >> No, yeah, exactly. And, you know, like when we build our own machine learning models, obviously coming out of academia, you know, there was a class project that it took us a year or six months to really design the best models, and test it, and prove it out intrinsic, intrinsic testing, and we knew it was working. But what we should really have done, what should we do now is we build models, we do experiments daily. And get to, in essence, the patient with our molecule every day, so, you know, we have the advantage given that we entail the marketing, that we can get to test our molecules or drugs on a daily basis. And we have enough data to test it, and we have enough customers, thankfully, to test it. And some of them are collaborating with us. So, we get to end solution on a daily basis. >> So, now I understand why you said, we don't need these radical algorithmic breakthroughs or, you know, new super, turbo-charged processors. So, with this approach of really fast prototyping, what are some of the unmet needs in, you know, it's just a matter of cycling through these experiments? >> Yeah, so I think one of the biggest unmet need today, we're able to understand language, we're able to predict who should you do business with and what should you talk about, but I think natural language generation, or creating a personalized email, really personalized and really beautifully written, is still something that we haven't quite, you know, have a full grasp on. And to be able to communicate at human level personalization, to be able to talk, you know, we can generate ads today, but that's not really, you know, language, right? It is language, but not as sophisticated as what we're talking here. Or to be able to generate text or have a bot speak to you, right? We can have a bot, we can now understand and respond in text, but really speak to you fluently with context about you is definitely an area we're heavily investing in, or looking to invest in in the near future. >> And with existing technology. >> With existing technology. I think, we think if you can narrow it down, we can generate emails that are much better than what are salesperson would write. In fact, we already have a product that can personalize a website, automatically, using AI, reinforcement learning, all the data we have. And it can rewrite a website to be customized for each visitor, personalized to each visitor, >> Give us an example of what. >> So, you know, for example if you go to Siemens or SAP and you come from pharma, it will take you and surface different content about pharmaceuticals. And, you know, in essence, at some point you can generate a whole page that's personalized to if somebody comes to pharma from a CFO versus an IT person, it will change the entire page content, right? To that, to, in essence, the entire buyer journey could be personalized. Because, you know, today buying from B2B, it's quite jarring, it's filled with spam, it's, you know, it's not a pleasant experience. It's not concierge level experience. And really, in an ideal world, you want B2B or marketing to be personalized. You want it to be like you're being, you know, guided through, if you need something, you can ask a question and you have a personalized assistant talking to you about it. >> So that there's, the journey is not coded in. >> It isn't, yeah. >> The journey, or the conversation response reacts to the >> To the customer. >> To the customer. >> Right, and B2B buyers want, you know, they want something like that. They don't have time to waste to it. Who want's to be lost on a website? >> Right. >> You know, you go to any Fortune 500 company's website and you, it's a mess. >> Okay, so let's back up to the Demandbase in the Bay Area, software ecosystem. >> Sure. >> So, SalesForce is a big company. >> Yes. >> Marketing is one of their pillars. >> Yes. >> Tell us, what is it about this next gen technology that is so, we touched on this before, but so anathema to the way traditional software companies build their products? >> Yeah, I mean, SalesForce is a very close partner, they're a customer, we work with them very closely. I think they're also an investor, small investor for Demandbase. We have a deep relationship with them. And I, myself, come from the traditional software background, you know, I've been building CRM, so I'll talk about myself, because I've seen how different and, you know, I have to sort of transition at a very early stage from a human centric CRM to a data driven CRM, or a human driven versus data driven. And it's, you have to think about things differently. So, one difference is that, when you look at data in human driven CRM, you trust it implicitly because somebody in your org put it in. You may challenge it, it's old, it's stale, but there's no fear that it's a machine recommending you and driving you. And it requires the interfaces to be much different. You have to think about how do you build trust between the person, you know, who's being driven in a Tesla, also, similar problem. And, you know, how do you give them the controls so they can turn of the autopilot, right? And how do you, you know, take feedback from humans to improve the models? So, it's a different way that human interface even becomes more different, and simpler. The other interesting thing is that if you look at traditional applications, they're quite complicated. They have all these fields because, you know, just enter all this data and you type it in. But the way you interact with our application, is that we already know everything, or a lot. So, why bother asking you? We already know where you are, who you are, what you should do, so we are in essence, guiding you more of a, using the Tesla autopilot example, it already knows where you are. It knows you're sitting in the car and it knows that you need to break because, you know, you're going to crash, so it'll just break by itself. So, you know, the interface is. >> That's really an interesting analogy. Tesla is a data driven piece of software. >> It is. >> Whereas, you know, my old BMW or whatever is a human driven piece of software. >> And there's some things in the middle. So, I recently, I mean, looking at cars, I just had a baby, and Volvo is something in the middle. Where, if you're going to have an accident or somebody comes close, it blinks. So, it's like advanced analytics, right? Which is analogous to that. Tesla just stops if you're going to have an accident. And that's the right idea, because if I'm going to have an accident, you don't want to rely on me to look at some light, what if I'm talking on the phone or looking at my kid? You know, some blinking light over there. Which is why advanced analytics hasn't been as successful as it should be. >> Because the hand off between the data driven and the human driven is a very difficult hand off. >> It's a very difficult hand off. And whenever possible, the right answer for us today is if you know everything, and you can take the action, like if you're going to have an accident just stop. Or, if you need to go, go, right? So if you come out in the morning, you know, and you go to work at 9 am, it should just put itself out, like, you know, why wait for human to, you know, get rid of all the monotonous problems that we ourselves have, right? >> That's a great example. On that note, let's break and this is George Gilbert. I'm with, and having a great conversation with Aman Naimat, Senior VP and CTO of Demandbase, and we will be back shortly with a member of the data science team. >> Thank you, George.
SUMMARY :
We're back on the ground with Aman Naimat at Demandbase. And, we're having a really interesting conversation It's getting really deep. the foundation technologies. technology, you know, still to work with. That we don't need, you know, a whole lot of breakthroughs, as you have told me, existing technology. and investing in, you know, GPUs and what have you. Yeah, or are you measuring the right thing, right? This is really fascinating because we were, Where, you know, all the work that And then plugging that into a model, And I imagine that that's slows dramatically So, now that we have built, you know, we probably not on sampling, but on the full data set. Now the problem is that all the easy, you know, So, if I were to sort of boil that down The notion that you could have something for something like this. It's almost like, this is, so it's back to Visual Basic. and then, you know, when you're going to build And you can do that manually, right? Like, in science, you don't know. you know, this whole pipeline, that it's like And get the prototype in a day. You don't want that elaborate structure and process every day, so, you know, we have the advantage what are some of the unmet needs in, you know, and respond in text, but really speak to you fluently I think, we think if you can narrow it down, So, you know, for example if you go to Siemens Right, and B2B buyers want, you know, You know, you go to any Fortune 500 company's in the Bay Area, software ecosystem. between the person, you know, who's being driven Tesla is a data driven piece of software. Whereas, you know, my old BMW or whatever is a to have an accident, you don't want to rely on me and the human driven is a very difficult hand off. to, you know, get rid of all the monotonous problems Senior VP and CTO of Demandbase, and we will be back
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
one billion | QUANTITY | 0.99+ |
Volvo | ORGANIZATION | 0.99+ |
Siemens | ORGANIZATION | 0.99+ |
BMW | ORGANIZATION | 0.99+ |
Visual Basic | TITLE | 0.99+ |
80% | QUANTITY | 0.99+ |
ten billion | QUANTITY | 0.99+ |
9 am | DATE | 0.99+ |
a year | QUANTITY | 0.99+ |
SalesForce | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Tesla | ORGANIZATION | 0.99+ |
Demandbase | ORGANIZATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Python | TITLE | 0.99+ |
George | PERSON | 0.99+ |
six months | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
20 years | QUANTITY | 0.98+ |
Microsoft | ORGANIZATION | 0.98+ |
Bay Area | LOCATION | 0.98+ |
a day | QUANTITY | 0.98+ |
each visitor | QUANTITY | 0.98+ |
SAP | ORGANIZATION | 0.98+ |
two | QUANTITY | 0.97+ |
first model | QUANTITY | 0.96+ |
SKLearn | TITLE | 0.96+ |
one | QUANTITY | 0.94+ |
Wikibon | ORGANIZATION | 0.93+ |
firstly | QUANTITY | 0.92+ |
one difference | QUANTITY | 0.85+ |
Aman Naimat | ORGANIZATION | 0.77+ |
hundred machine | QUANTITY | 0.73+ |
trillion rows of data | QUANTITY | 0.73+ |
million rows of data | QUANTITY | 0.72+ |
Chaper | PERSON | 0.69+ |
Aman Naimat | PERSON | 0.68+ |
thousand | QUANTITY | 0.68+ |
few | QUANTITY | 0.59+ |
Foundation Class | TITLE | 0.59+ |
500 | QUANTITY | 0.44+ |
Fortune | TITLE | 0.36+ |
Wikibon Conversation with John Furrier and George Gilbert
(upbeat electronic music) >> Hello, everyone. Welcome to the Cube Studios in Palo Alto, California. I'm John Furrier, the co-host of the Cube and co-founder of SiliconANGLE Media Inc. I'm here with George Gilbert for a Wikibon conversation on the state of the big data. George Gilbert is the analyst at Wikibon covering big data. George, great to see you. Looking good. (laughing) >> Good to see you, John. >> So George, you're obviously covering big data. Everyone knows you. You always ask the tough questions, you're always drilling down, going under the hood, and really inspecting all the trends, and also looking at the technology. What are you working on these days as the big data analyst? What's the hot thing that you're covering? >> OK, so, what's really interesting is we've got this emerging class of applications. The name that we've used so far is modern operational analytic applications. Operational in the sense that they help drive business operations, but analytical in the sense that the analytics either inform or drive transactions, or anticipate and inform interactions with people. That's the core of this class of apps. And then there are some sort of big challenges that customers are having in trying to build, and deploy, and operate these things. That's what I want to go through. >> George, you know, this is a great piece. I can't wait to (mumbling) some of these questions and ask you some pointed questions. But I would agree with you that to me, the number one thing I see customers either fumbling with or accelerating value with is how to operationalize some of the data in a way that they've never done it before. So you start to see disciplines come together. You're starting to see people with a notion of digital business being something that's not a department, it's not a marketing department. Data is everywhere, it's horizontally scalable, and the smart executives are really looking at new operational tactics to do that. With that, let me kick off the first question to you. People are trying to balance the cloud, On Premise, and The Edge, OK. And that's classic, you're seeing that now. I've got a data center, I have to go to the cloud, a hybrid cloud. And now the edge of the network. We were just taking about Block Chain today, there's this huge problem. They've got the balance that, but they've got to balance it versus leveraging specialized services. How do you respond to that? What is your reaction? What is your presentation? >> OK, so let's turn it into something really concrete that everyone can relate to, and then I'll generalize it. The concrete version is for a number of years, everyone associated Hadoop with big data. And Hadoop, you tried to stand up on a cluster on your own premises, for the most part. It was on had EMR, but sort of the big company activity outside, even including the big tech companies was stand up a Hadoop cluster as a pilot and start building a data lake. Then see what you could do with sort of huge amounts of data that you couldn't normally sort of collect and analyze. The operational challenges of standing up that sort of cluster was rather overwhelming, and I'll explain that later, so sort of park that thought. Because of that complexity, more and more customers, all but the most sophisticated, are saying we need a cloud strategy for that. But once you start taking Hadoop into the cloud, the components of this big data analytic system, you have tons more alternatives. So whereas in Cloudera's version of Hadoop you had Impala as your MPP sequel database. On Amazon, you've got Amazon Redshift, you've got Snowflake, you've got dozens up MPP sequel databases. And so the whole playing field shifts. And not only that, Amazon has instrumented their, in that particular case, their application, to be more of a more managed service, so there's a whole lot less for admins to do. And you take that on sort of, if you look at the slides, you take every step in that pipeline. And when you put it on a different cloud, it's got different competitors. And even if you take the same step in a pipeline, let's say Spark on HDFS to do your ETL, and your analysis, and your shaping of data, and even some of the machine learning, you put that on Azure and on Amazon, it's actually on different storage foundation. So even if you're using the same component, it's different. There's a lot of complexity and a lot of trade off that you got to make. >> Is that a problem for customers? >> Yes, because all of a sudden, they have to evaluate what those trade offs are. They have to evaluate the trade off between specialization. Do I use the best to breed thing on one platform. And if I do, it's not compatible with what I might be running on prem. >> That'll slow a lot of things down. I can tell you right now, people want to have the same code base on all environments, and then just have the same seamless operational role. OK, that's a great point, George. Thanks for sharing that. The second point here is harmonizing and simplifying management across hybrid clouds. Again, back to your point. You set that up beautifully. Great example, open source innovation hits a roadblock. And the roadblock is incompatible components in multiple clouds. That's a problem. It's a management nightmare. How do harmonization about hybrid cloud work? >> You couldn't have asked it better. Let me put it up in terms of an X Y chart where on the x-axis, you have the components of an analytic pipeline. Ingest, process, analyze, predict, serve. But then on the y-axis, this is for an admin, not a developer. These are just some of the tasks they have to worry about. Data governance, performance monitoring, scheduling and orchestration, availability and recovery, that whole list. Now, if you have a different product for each step in that pipeline, and each product has a different way of handling all those admin tasks, you're basically taking all the unique activities on the y-axis, multiplying it by all the unique products on the x-axis, and you have overwhelming complexity, even if these are managed services on the cloud. Here now you've got several trade offs. Do I use the specialized products that you would call best to breed? Do I try and do end to end integration so I get simplification across the pipeline? Or do I use products that I had on-prem, like you were saying, so that I have seamless compatibility? Or do I use the cloud vendors? That's a tough trade off. There's another similar one for developers. Again, on the y-axis, for all the things that a developer would have to deal with, not all of them, just a sample. The data model and the data itself, how to address it, the programing model, the persistence. So on that y-axis, you multiply all those different things you have to master for each product. And then on the x-axis, all the different products and the pipeline. And you have that same trade off, again. >> Complexity is off the charts. >> Right. And you can trade end to end integration to simplify the complexity, but we don't really have products that are fully fleshed out and mature that stretch from one end of the pipeline to the other, so that's a challenge. Alright. Let's talk about another way of looking at management. This was looking at the administrators and the developers. Now, we're getting better and better software for monitoring performance and operations, and trying to diagnose root cause when something goes wrong and then remediate it. There's two real approaches. One is you go really deep, but on a narrow part of your application and infrastructure landscape. And that narrow part might be, you know, your analytic pipeline, your big data. The broad approach is to get end to end visibility across Edge with your IOT devices, across on-prem, perhaps even across multiple clouds. That's the breadth approach, end to end visibility. Now, there's a trade off here too as in all technology choices. When you go deep, you have bounded visibility, but that bounded visibility allows you to understand exactly what is in that set of services, how they fit together, how they work. Because the vendor, knowing that they're only giving you management of your big data pipeline, they can train their models, their machine learning models, so that whenever something goes wrong, they know exactly what caused it and they can filter out all the false positives, the scattered errors that can confuse administrators. Whereas if you want breadth, you want to see end to end your entire landscape so that you can do capacity planning and see if there was an error way upstream, something might be triggered way downstream or a bunch of things downstream. So the best way to understand this is how much knowledge do you have of all the pieces work together, and how much knowledge you have of all the pieces, the software pieces fit together. >> This is actually an interesting point. So if I kind of connect the dots for you here is the bounded root cause analysis that we see a lot of machine learning, that's where the automation is. >> George: Yeah. >> The unbounded, the breadth, that's where the data volume is. But they can work together, that's what you're saying. >> Yes. And actually, I hadn't even got to that, so thanks for taking it out. >> John: Did I jump ahead on that one? (laughing) >> No, no, you teed it out. (laughing) Because ultimately-- >> Well a lot of people want to know where it's going to be automated away. All the undifferentiated labored and scale can be automated. >> Well, when you talk about them working together. So for the deep depth first, there's a small company called Unravel Data that sort of modeled eight million jobs or workloads of big data workloads from high tech companies, so they know how all that fits together and they can tell you when something goes wrong exactly what goes wrong and how to remediate it. So take something like Rocana or Splunk, they look end to end. The interesting thing that you brought up is at some point, that end to end product is going to be like a data warehouse and the depth products are going to sit on top of it. So you'll have all the contextual data of your end to end landscape, but you'll have the deep knowledge of how things work and what goes wrong sitting on it. >> So just before we jump to the machine learning question which I want to ask you, what you're saying is the industry is evolving to almost looking like a data warehouse model, but in a completely different way. >> Yeah. Think of it as, another cue. (laughing) >> John: That's what I do, George. I help you out with the cues. (laughing) No, but I mean the data warehouse, everyone knows what that was. A huge industry, created a lot of value, but then the world got rocked by unstructured data. And then their bounded, if you will, view has got democratized. So creative destruction happened which is another word for new entrants came in and incumbents got rattled. But now it's kind of going back to what looks like a data warheouse, but it's completely distributed around. >> Yes. And I was going to do one of my movie references, but-- >> No, don't do it. Save us the judge. >> If you look at this starting in the upper right, that's the data lake where you're collecting all the data and it's for search, it's exploratory. As you get more structure, you get to the descriptive place where you can build dashboards to monitor what's going on. And you get really deep, that's when you have the machine learning. >> Well, the machine learning is hitting the low hanging fruit, and that's where I want to get to next to move it along. Sourcing machine learning capability, let's discuss that. >> OK, alright. Just to set contacts before we get there, notice that when you do end to end visibility, you're really seeing across a broad landscape. And when I'm showing my public cloud big data, that would be depth first just for that component. But you would do breadth first, you could do like a Rocana or a Splunk that then sees across everything. The point I wanted to make was when you said we're reverting back to data warehouses and revisiting that dream again, the management applications started out as saying we know how to look inside machine data and tell you what's going on with your landscape. It turns out that machine data and business operations data, your application data, are really becoming one and the same. So what used to be a transaction, there was one transaction. And that, when you summarized them, that went into the data warehouse. Then we had with systems of engagement, you had about 100 interaction events that you tracked or sort of stored for everything business transaction. And then when we went out to the big data world, it's so resource intensive that we actually had 1,000 to 10,000 infrastructure events for every business transaction. So that's why the data volumes have grown so much and why we had to go back first to data lake, and then curate it to the warehouse. >> Classic innovation story, great. Machine learning. Sourcing machine learning capabilities 'cause that's where the rubber starts hitting the road. You're starting to see clear skies when it comes to where machine learning is starting fit in. Sourcing machine learning capabilities. >> You know, even though we sort of didn't really rehearse this, you're helping cue me on perfectly. Let me make the assertion that with machine learning, we have the same shortage of really trained data scientists that we had when we were trying to stand up Hadoop clusters and do big data analytics. We did not have enough administrators because these were open source components built from essentially different projects, and putting them all together required a huge amount of skills. Data science requires, really, knowledge of algorithms that even really sophisticated programmers will tell you, "Jeez, now I need a PhD "to really understand how this stuff works." So the shortage, that means we're not going to get a lot of hand-built machine learning applications for a while. >> John: In a lot of libraries out there right now, you see TensorFlow from Google. Big traction with that application. >> George: But for PhDs, for PhDs. My contention is-- >> John: Well developers too, you could argue developers, but I'm just putting it out there. >> George: I will get to that, actually. A slide just on that. Let me do this one first because my contention is the first big application, widespread application of machine learning, is going to be the depth first management because it comes with a model built in of how all the big data workloads, services, and infrastructure fit together and work together. And if you look at how the machine learning model operates, when it knows something goes wrong, let's say an analytic job takes 17 hours and then just falls over and crashes, the model can actually look at the data layout and say we have way too much on one node, and it can change the settings and change the layout or the data because it knows how all the stuff works. The point about this is the vendor. In this particular example, Unravel Data, they built into their model an understanding of how to keep a big data workload running as opposed to telling the customer, "You have to program it." So that fits into the question you were just asking which is where do you get this talent. When you were talking about like TensorFlow, and Cafe, and Torch, and MXnet, those are all like assembly language. Yes, those are the most powerful places you could go to program machine learning. But the number of people is inversely proportional to the power of those. >> John: Yeah, those are like really unique specialty people. High, you know, the top guys. >> George: Lab coats, rocket scientists. >> John: Well yeah, just high end tier one coders, tier one brains coding away, AI gurus. This is not your working developer. >> George: But if you go up two levels. So go up one level is Amazon machine learning, Spark machine learning. Go up another level, and I'm using Amazon as an example here. Amazon has a vision service called Recognition. They have a speech generation service, Natural Language. Those are developer ready. And when I say developer ready, I mean developer just uses an API, you know, passes in the data that comes out. He doesn't have to know how the model works. >> John: It's kind of like what DevOps was for cloud at the end of the day. This slide is completely accurate in my opinion. And we're at the early days and you're starting to see the platforms develop. It's the classic abstraction layer. Whoever can extract away the complexity as AI and machine learning grows is going to be the winning platform, no doubt about it. Amazon is showing some good moves there. >> George: And you know how they abstracted away. In traditional programming, it was just building higher and higher APIs, more accessible. In machine learning, you can't do that. You have to actually train the models which means you need data. So if you look at the big cloud vendors right now. So Google, Microsoft, Amazon, and IBM. Most of them, the first three, they have a lot of data from their B to C businesses. So you know, people talking to Echo, people talking to Google Assistant or Siri. That's where they get enough of their speech. >> John: So data equals power? >> George: Yes. >> By having data, you have the ingredients. And the more data that you have, the more data that you know about, the more data that has information around it, the more effective it can be to train machine learning algorithms. >> Yes. >> And the benefit comes back to the people who have the data. >> Yes. And so even though your capabilities get narrower, 'cause you could do anything on TensorFlow. >> John: Well, that's why Facebook is getting killed right now just to kind of change tangents. They have all this data and people are very unhappy, they just released that the Russians were targeting anti-semitic advertising, they enabled that. So it's hard to be a data platform and still provide user utility. This is what's going on. Whoever has the data has the power. It was a Frankenstein moment for Facebook. So there's that out there for everyone. How do companies do the right thing? >> And there's also the issue of customer intellectual property protection. As consumers, we're like you can take our voice, you can take all our speech to Siri or to Echo or whatever and get better at recognizing speech because we've given up control of that 'cause we want those services for free. >> Whoever can shift the data value to the users. >> George: To the developers. >> Or to the developers, or communities, better said, will win. >> OK. >> In my opinion, that's my opinion. >> For the most part, Amazon, Microsoft, and Google have similar data assets. For the most part, so far. IBM has something different which is they work closely with their industry customers and they build progressively. They're working with Mercedes, they're working with BMW. They'll work on the connected car, you know, the autonomous car, and they build out those models slowly. >> So George, this slide is really really interesting and I think this should be a roadmap for all customers to look at to try to peg where they are in the machine learning journey. But then the question comes in. They do the blocking and tackling, they have the foundational low level stuff done, they're building the models, they're understanding the mission, they have the right organizational mindset and personnel. Now, they want to orchestrate it and implement it into action. That's the final question. How do you orchestrate the distributed machine learning feedback and the data coherency? How do you get this thing scaling? How do these machines and the training happen so you have the breadth, and then you could bring the machine learning up the curve into the dashboard? >> OK. We've saved the best for last. It's not easy. When I show the chevrons, that's the analytic data pipeline. And imagine in the serve and predict at the very end, let's take an IOT app, a very sophisticated one. which would be an autonomous car. And it doesn't actually have to be an autonomous one, you could just be collected a lot of information off the car to do a better job insuring it, the insurance company. But the key then is you're collecting data on a fleet of cars, right? You're collecting data off each one, but you're also collecting then the fleet. And that, in the cloud, is where you keep improving your model of how the car works. You run simulations to figure out not just how to design better ones in the future, but how to tune and optimize the ones that are on the road now. That's number three. And then in four, you push that feedback back out to the cars on the road. And you have to manage, and this is tricky, you have to make sure that the models that you trained in step three are coherent, or the same, when you take out the fleet data and then you put the model for a particular instance of a car back out on the highway. >> George, this is a great example, and I think this slide really represents the modern analytical operational role in digital business. You can't look further than Tesla, this is essentially Tesla, and now all cars as a great example 'cause it's complex, it's an internet (mumbling) device, it's on the edge of the network, it's mobility, it's using 5G. It encapsulates everything that you are presenting, so I think this is example, is a great one, of the modern operational analytic applications that supports digital business. Thanks for joining this Wikibon conversaion. >> Thank you, John. >> George Gilbert, the analyst at Wikibon covering big data and the modern operational analytical system supporting digital business. It's data driven. The people with the data can train the machines that have the power. That's the mandate, that's the action item. I'm John Furrier with George Gilbert. Thanks for watching. (upbeat electronic music)
SUMMARY :
George Gilbert is the analyst at Wikibon covering big data. and really inspecting all the trends, that the analytics either inform or drive transactions, With that, let me kick off the first question to you. And even if you take the same step in a pipeline, they have to evaluate what those trade offs are. And the roadblock is These are just some of the tasks they have to worry about. that stretch from one end of the pipeline to the other, So if I kind of connect the dots for you here But they can work together, that's what you're saying. And actually, I hadn't even got to that, No, no, you teed it out. All the undifferentiated labored and scale can be automated. and the depth products are going to sit on top of it. to almost looking like a data warehouse model, Think of it as, another cue. And then their bounded, if you will, view And I was going to do one of my movie references, but-- No, don't do it. that's when you have the machine learning. is hitting the low hanging fruit, and tell you what's going on with your landscape. You're starting to see clear skies So the shortage, that means we're not going to get you see TensorFlow from Google. George: But for PhDs, for PhDs. John: Well developers too, you could argue developers, So that fits into the question you were just asking High, you know, the top guys. This is not your working developer. George: But if you go up two levels. at the end of the day. So if you look at the big cloud vendors right now. And the more data that you have, And the benefit comes back to the people 'cause you could do anything on TensorFlow. Whoever has the data has the power. you can take all our speech to Siri or to Echo or whatever Or to the developers, you know, the autonomous car, and then you could bring the machine learning up the curve or the same, when you take out the fleet data It encapsulates everything that you are presenting, and the modern operational analytical system
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Amazon | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Mercedes | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
John | PERSON | 0.99+ |
BMW | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
1,000 | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
SiliconANGLE Media Inc. | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
second point | QUANTITY | 0.99+ |
17 hours | QUANTITY | 0.99+ |
Siri | TITLE | 0.99+ |
Wikibon | ORGANIZATION | 0.99+ |
Hadoop | TITLE | 0.99+ |
first question | QUANTITY | 0.99+ |
Palo Alto, California | LOCATION | 0.99+ |
eight million jobs | QUANTITY | 0.99+ |
Echo | COMMERCIAL_ITEM | 0.99+ |
two levels | QUANTITY | 0.99+ |
Tesla | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
each product | QUANTITY | 0.99+ |
each step | QUANTITY | 0.99+ |
first three | QUANTITY | 0.98+ |
Cube Studios | ORGANIZATION | 0.98+ |
one level | QUANTITY | 0.98+ |
one platform | QUANTITY | 0.98+ |
Rocana | ORGANIZATION | 0.98+ |
one transaction | QUANTITY | 0.97+ |
about 100 interaction | QUANTITY | 0.97+ |
dozens | QUANTITY | 0.96+ |
four | QUANTITY | 0.96+ |
Cube | ORGANIZATION | 0.96+ |
one end | QUANTITY | 0.96+ |
each one | QUANTITY | 0.96+ |
Google Assistant | TITLE | 0.96+ |
two real approaches | QUANTITY | 0.94+ |
Unravel Data | ORGANIZATION | 0.94+ |
one | QUANTITY | 0.93+ |
today | DATE | 0.92+ |
Breaking Analysis: Databricks faces critical strategic decisions…here’s why
>> From theCUBE Studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. >> Spark became a top level Apache project in 2014, and then shortly thereafter, burst onto the big data scene. Spark, along with the cloud, transformed and in many ways, disrupted the big data market. Databricks optimized its tech stack for Spark and took advantage of the cloud to really cleverly deliver a managed service that has become a leading AI and data platform among data scientists and data engineers. However, emerging customer data requirements are shifting into a direction that will cause modern data platform players generally and Databricks, specifically, we think, to make some key directional decisions and perhaps even reinvent themselves. Hello and welcome to this week's wikibon theCUBE Insights, powered by ETR. In this Breaking Analysis, we're going to do a deep dive into Databricks. We'll explore its current impressive market momentum. We're going to use some ETR survey data to show that, and then we'll lay out how customer data requirements are changing and what the ideal data platform will look like in the midterm future. We'll then evaluate core elements of the Databricks portfolio against that vision, and then we'll close with some strategic decisions that we think the company faces. And to do so, we welcome in our good friend, George Gilbert, former equities analyst, market analyst, and current Principal at TechAlpha Partners. George, good to see you. Thanks for coming on. >> Good to see you, Dave. >> All right, let me set this up. We're going to start by taking a look at where Databricks sits in the market in terms of how customers perceive the company and what it's momentum looks like. And this chart that we're showing here is data from ETS, the emerging technology survey of private companies. The N is 1,421. What we did is we cut the data on three sectors, analytics, database-data warehouse, and AI/ML. The vertical axis is a measure of customer sentiment, which evaluates an IT decision maker's awareness of the firm and the likelihood of engaging and/or purchase intent. The horizontal axis shows mindshare in the dataset, and we've highlighted Databricks, which has been a consistent high performer in this survey over the last several quarters. And as we, by the way, just as aside as we previously reported, OpenAI, which burst onto the scene this past quarter, leads all names, but Databricks is still prominent. You can see that the ETR shows some open source tools for reference, but as far as firms go, Databricks is very impressively positioned. Now, let's see how they stack up to some mainstream cohorts in the data space, against some bigger companies and sometimes public companies. This chart shows net score on the vertical axis, which is a measure of spending momentum and pervasiveness in the data set is on the horizontal axis. You can see that chart insert in the upper right, that informs how the dots are plotted, and net score against shared N. And that red dotted line at 40% indicates a highly elevated net score, anything above that we think is really, really impressive. And here we're just comparing Databricks with Snowflake, Cloudera, and Oracle. And that squiggly line leading to Databricks shows their path since 2021 by quarter. And you can see it's performing extremely well, maintaining an elevated net score and net range. Now it's comparable in the vertical axis to Snowflake, and it consistently is moving to the right and gaining share. Now, why did we choose to show Cloudera and Oracle? The reason is that Cloudera got the whole big data era started and was disrupted by Spark. And of course the cloud, Spark and Databricks and Oracle in many ways, was the target of early big data players like Cloudera. Take a listen to Cloudera CEO at the time, Mike Olson. This is back in 2010, first year of theCUBE, play the clip. >> Look, back in the day, if you had a data problem, if you needed to run business analytics, you wrote the biggest check you could to Sun Microsystems, and you bought a great big, single box, central server, and any money that was left over, you handed to Oracle for a database licenses and you installed that database on that box, and that was where you went for data. That was your temple of information. >> Okay? So Mike Olson implied that monolithic model was too expensive and inflexible, and Cloudera set out to fix that. But the best laid plans, as they say, George, what do you make of the data that we just shared? >> So where Databricks has really come up out of sort of Cloudera's tailpipe was they took big data processing, made it coherent, made it a managed service so it could run in the cloud. So it relieved customers of the operational burden. Where they're really strong and where their traditional meat and potatoes or bread and butter is the predictive and prescriptive analytics that building and training and serving machine learning models. They've tried to move into traditional business intelligence, the more traditional descriptive and diagnostic analytics, but they're less mature there. So what that means is, the reason you see Databricks and Snowflake kind of side by side is there are many, many accounts that have both Snowflake for business intelligence, Databricks for AI machine learning, where Snowflake, I'm sorry, where Databricks also did really well was in core data engineering, refining the data, the old ETL process, which kind of turned into ELT, where you loaded into the analytic repository in raw form and refine it. And so people have really used both, and each is trying to get into the other. >> Yeah, absolutely. We've reported on this quite a bit. Snowflake, kind of moving into the domain of Databricks and vice versa. And the last bit of ETR evidence that we want to share in terms of the company's momentum comes from ETR's Round Tables. They're run by Erik Bradley, and now former Gartner analyst and George, your colleague back at Gartner, Daren Brabham. And what we're going to show here is some direct quotes of IT pros in those Round Tables. There's a data science head and a CIO as well. Just make a few call outs here, we won't spend too much time on it, but starting at the top, like all of us, we can't talk about Databricks without mentioning Snowflake. Those two get us excited. Second comment zeros in on the flexibility and the robustness of Databricks from a data warehouse perspective. And then the last point is, despite competition from cloud players, Databricks has reinvented itself a couple of times over the year. And George, we're going to lay out today a scenario that perhaps calls for Databricks to do that once again. >> Their big opportunity and their big challenge for every tech company, it's managing a technology transition. The transition that we're talking about is something that's been bubbling up, but it's really epical. First time in 60 years, we're moving from an application-centric view of the world to a data-centric view, because decisions are becoming more important than automating processes. So let me let you sort of develop. >> Yeah, so let's talk about that here. We going to put up some bullets on precisely that point and the changing sort of customer environment. So you got IT stacks are shifting is George just said, from application centric silos to data centric stacks where the priority is shifting from automating processes to automating decision. You know how look at RPA and there's still a lot of automation going on, but from the focus of that application centricity and the data locked into those apps, that's changing. Data has historically been on the outskirts in silos, but organizations, you think of Amazon, think Uber, Airbnb, they're putting data at the core, and logic is increasingly being embedded in the data instead of the reverse. In other words, today, the data's locked inside the app, which is why you need to extract that data is sticking it to a data warehouse. The point, George, is we're putting forth this new vision for how data is going to be used. And you've used this Uber example to underscore the future state. Please explain? >> Okay, so this is hopefully an example everyone can relate to. The idea is first, you're automating things that are happening in the real world and decisions that make those things happen autonomously without humans in the loop all the time. So to use the Uber example on your phone, you call a car, you call a driver. Automatically, the Uber app then looks at what drivers are in the vicinity, what drivers are free, matches one, calculates an ETA to you, calculates a price, calculates an ETA to your destination, and then directs the driver once they're there. The point of this is that that cannot happen in an application-centric world very easily because all these little apps, the drivers, the riders, the routes, the fares, those call on data locked up in many different apps, but they have to sit on a layer that makes it all coherent. >> But George, so if Uber's doing this, doesn't this tech already exist? Isn't there a tech platform that does this already? >> Yes, and the mission of the entire tech industry is to build services that make it possible to compose and operate similar platforms and tools, but with the skills of mainstream developers in mainstream corporations, not the rocket scientists at Uber and Amazon. >> Okay, so we're talking about horizontally scaling across the industry, and actually giving a lot more organizations access to this technology. So by way of review, let's summarize the trend that's going on today in terms of the modern data stack that is propelling the likes of Databricks and Snowflake, which we just showed you in the ETR data and is really is a tailwind form. So the trend is toward this common repository for analytic data, that could be multiple virtual data warehouses inside of Snowflake, but you're in that Snowflake environment or Lakehouses from Databricks or multiple data lakes. And we've talked about what JP Morgan Chase is doing with the data mesh and gluing data lakes together, you've got various public clouds playing in this game, and then the data is annotated to have a common meaning. In other words, there's a semantic layer that enables applications to talk to the data elements and know that they have common and coherent meaning. So George, the good news is this approach is more effective than the legacy monolithic models that Mike Olson was talking about, so what's the problem with this in your view? >> So today's data platforms added immense value 'cause they connected the data that was previously locked up in these monolithic apps or on all these different microservices, and that supported traditional BI and AI/ML use cases. But now if we want to build apps like Uber or Amazon.com, where they've got essentially an autonomously running supply chain and e-commerce app where humans only care and feed it. But the thing is figuring out what to buy, when to buy, where to deploy it, when to ship it. We needed a semantic layer on top of the data. So that, as you were saying, the data that's coming from all those apps, the different apps that's integrated, not just connected, but it means the same. And the issue is whenever you add a new layer to a stack to support new applications, there are implications for the already existing layers, like can they support the new layer and its use cases? So for instance, if you add a semantic layer that embeds app logic with the data rather than vice versa, which we been talking about and that's been the case for 60 years, then the new data layer faces challenges that the way you manage that data, the way you analyze that data, is not supported by today's tools. >> Okay, so actually Alex, bring me up that last slide if you would, I mean, you're basically saying at the bottom here, today's repositories don't really do joins at scale. The future is you're talking about hundreds or thousands or millions of data connections, and today's systems, we're talking about, I don't know, 6, 8, 10 joins and that is the fundamental problem you're saying, is a new data error coming and existing systems won't be able to handle it? >> Yeah, one way of thinking about it is that even though we call them relational databases, when we actually want to do lots of joins or when we want to analyze data from lots of different tables, we created a whole new industry for analytic databases where you sort of mung the data together into fewer tables. So you didn't have to do as many joins because the joins are difficult and slow. And when you're going to arbitrarily join thousands, hundreds of thousands or across millions of elements, you need a new type of database. We have them, they're called graph databases, but to query them, you go back to the prerelational era in terms of their usability. >> Okay, so we're going to come back to that and talk about how you get around that problem. But let's first lay out what the ideal data platform of the future we think looks like. And again, we're going to come back to use this Uber example. In this graphic that George put together, awesome. We got three layers. The application layer is where the data products reside. The example here is drivers, rides, maps, routes, ETA, et cetera. The digital version of what we were talking about in the previous slide, people, places and things. The next layer is the data layer, that breaks down the silos and connects the data elements through semantics and everything is coherent. And then the bottom layers, the legacy operational systems feed that data layer. George, explain what's different here, the graph database element, you talk about the relational query capabilities, and why can't I just throw memory at solving this problem? >> Some of the graph databases do throw memory at the problem and maybe without naming names, some of them live entirely in memory. And what you're dealing with is a prerelational in-memory database system where you navigate between elements, and the issue with that is we've had SQL for 50 years, so we don't have to navigate, we can say what we want without how to get it. That's the core of the problem. >> Okay. So if I may, I just want to drill into this a little bit. So you're talking about the expressiveness of a graph. Alex, if you'd bring that back out, the fourth bullet, expressiveness of a graph database with the relational ease of query. Can you explain what you mean by that? >> Yeah, so graphs are great because when you can describe anything with a graph, that's why they're becoming so popular. Expressive means you can represent anything easily. They're conducive to, you might say, in a world where we now want like the metaverse, like with a 3D world, and I don't mean the Facebook metaverse, I mean like the business metaverse when we want to capture data about everything, but we want it in context, we want to build a set of digital twins that represent everything going on in the world. And Uber is a tiny example of that. Uber built a graph to represent all the drivers and riders and maps and routes. But what you need out of a database isn't just a way to store stuff and update stuff. You need to be able to ask questions of it, you need to be able to query it. And if you go back to prerelational days, you had to know how to find your way to the data. It's sort of like when you give directions to someone and they didn't have a GPS system and a mapping system, you had to give them turn by turn directions. Whereas when you have a GPS and a mapping system, which is like the relational thing, you just say where you want to go, and it spits out the turn by turn directions, which let's say, the car might follow or whoever you're directing would follow. But the point is, it's much easier in a relational database to say, "I just want to get these results. You figure out how to get it." The graph database, they have not taken over the world because in some ways, it's taking a 50 year leap backwards. >> Alright, got it. Okay. Let's take a look at how the current Databricks offerings map to that ideal state that we just laid out. So to do that, we put together this chart that looks at the key elements of the Databricks portfolio, the core capability, the weakness, and the threat that may loom. Start with the Delta Lake, that's the storage layer, which is great for files and tables. It's got true separation of compute and storage, I want you to double click on that George, as independent elements, but it's weaker for the type of low latency ingest that we see coming in the future. And some of the threats highlighted here. AWS could add transactional tables to S3, Iceberg adoption is picking up and could accelerate, that could disrupt Databricks. George, add some color here please? >> Okay, so this is the sort of a classic competitive forces where you want to look at, so what are customers demanding? What's competitive pressure? What are substitutes? Even what your suppliers might be pushing. Here, Delta Lake is at its core, a set of transactional tables that sit on an object store. So think of it in a database system, this is the storage engine. So since S3 has been getting stronger for 15 years, you could see a scenario where they add transactional tables. We have an open source alternative in Iceberg, which Snowflake and others support. But at the same time, Databricks has built an ecosystem out of tools, their own and others, that read and write to Delta tables, that's what makes the Delta Lake and ecosystem. So they have a catalog, the whole machine learning tool chain talks directly to the data here. That was their great advantage because in the past with Snowflake, you had to pull all the data out of the database before the machine learning tools could work with it, that was a major shortcoming. They fixed that. But the point here is that even before we get to the semantic layer, the core foundation is under threat. >> Yep. Got it. Okay. We got a lot of ground to cover. So we're going to take a look at the Spark Execution Engine next. Think of that as the refinery that runs really efficient batch processing. That's kind of what disrupted the DOOp in a large way, but it's not Python friendly and that's an issue because the data science and the data engineering crowd are moving in that direction, and/or they're using DBT. George, we had Tristan Handy on at Supercloud, really interesting discussion that you and I did. Explain why this is an issue for Databricks? >> So once the data lake was in place, what people did was they refined their data batch, and Spark has always had streaming support and it's gotten better. The underlying storage as we've talked about is an issue. But basically they took raw data, then they refined it into tables that were like customers and products and partners. And then they refined that again into what was like gold artifacts, which might be business intelligence metrics or dashboards, which were collections of metrics. But they were running it on the Spark Execution Engine, which it's a Java-based engine or it's running on a Java-based virtual machine, which means all the data scientists and the data engineers who want to work with Python are really working in sort of oil and water. Like if you get an error in Python, you can't tell whether the problems in Python or where it's in Spark. There's just an impedance mismatch between the two. And then at the same time, the whole world is now gravitating towards DBT because it's a very nice and simple way to compose these data processing pipelines, and people are using either SQL in DBT or Python in DBT, and that kind of is a substitute for doing it all in Spark. So it's under threat even before we get to that semantic layer, it so happens that DBT itself is becoming the authoring environment for the semantic layer with business intelligent metrics. But that's again, this is the second element that's under direct substitution and competitive threat. >> Okay, let's now move down to the third element, which is the Photon. Photon is Databricks' BI Lakehouse, which has integration with the Databricks tooling, which is very rich, it's newer. And it's also not well suited for high concurrency and low latency use cases, which we think are going to increasingly become the norm over time. George, the call out threat here is customers want to connect everything to a semantic layer. Explain your thinking here and why this is a potential threat to Databricks? >> Okay, so two issues here. What you were touching on, which is the high concurrency, low latency, when people are running like thousands of dashboards and data is streaming in, that's a problem because SQL data warehouse, the query engine, something like that matures over five to 10 years. It's one of these things, the joke that Andy Jassy makes just in general, he's really talking about Azure, but there's no compression algorithm for experience. The Snowflake guy started more than five years earlier, and for a bunch of reasons, that lead is not something that Databricks can shrink. They'll always be behind. So that's why Snowflake has transactional tables now and we can get into that in another show. But the key point is, so near term, it's struggling to keep up with the use cases that are core to business intelligence, which is highly concurrent, lots of users doing interactive query. But then when you get to a semantic layer, that's when you need to be able to query data that might have thousands or tens of thousands or hundreds of thousands of joins. And that's a SQL query engine, traditional SQL query engine is just not built for that. That's the core problem of traditional relational databases. >> Now this is a quick aside. We always talk about Snowflake and Databricks in sort of the same context. We're not necessarily saying that Snowflake is in a position to tackle all these problems. We'll deal with that separately. So we don't mean to imply that, but we're just sort of laying out some of the things that Snowflake or rather Databricks customers we think, need to be thinking about and having conversations with Databricks about and we hope to have them as well. We'll come back to that in terms of sort of strategic options. But finally, when come back to the table, we have Databricks' AI/ML Tool Chain, which has been an awesome capability for the data science crowd. It's comprehensive, it's a one-stop shop solution, but the kicker here is that it's optimized for supervised model building. And the concern is that foundational models like GPT could cannibalize the current Databricks tooling, but George, can't Databricks, like other software companies, integrate foundation model capabilities into its platform? >> Okay, so the sound bite answer to that is sure, IBM 3270 terminals could call out to a graphical user interface when they're running on the XT terminal, but they're not exactly good citizens in that world. The core issue is Databricks has this wonderful end-to-end tool chain for training, deploying, monitoring, running inference on supervised models. But the paradigm there is the customer builds and trains and deploys each model for each feature or application. In a world of foundation models which are pre-trained and unsupervised, the entire tool chain is different. So it's not like Databricks can junk everything they've done and start over with all their engineers. They have to keep maintaining what they've done in the old world, but they have to build something new that's optimized for the new world. It's a classic technology transition and their mentality appears to be, "Oh, we'll support the new stuff from our old stuff." Which is suboptimal, and as we'll talk about, their biggest patron and the company that put them on the map, Microsoft, really stopped working on their old stuff three years ago so that they could build a new tool chain optimized for this new world. >> Yeah, and so let's sort of close with what we think the options are and decisions that Databricks has for its future architecture. They're smart people. I mean we've had Ali Ghodsi on many times, super impressive. I think they've got to be keenly aware of the limitations, what's going on with foundation models. But at any rate, here in this chart, we lay out sort of three scenarios. One is re-architect the platform by incrementally adopting new technologies. And example might be to layer a graph query engine on top of its stack. They could license key technologies like graph database, they could get aggressive on M&A and buy-in, relational knowledge graphs, semantic technologies, vector database technologies. George, as David Floyer always says, "A lot of ways to skin a cat." We've seen companies like, even think about EMC maintained its relevance through M&A for many, many years. George, give us your thought on each of these strategic options? >> Okay, I find this question the most challenging 'cause remember, I used to be an equity research analyst. I worked for Frank Quattrone, we were one of the top tech shops in the banking industry, although this is 20 years ago. But the M&A team was the top team in the industry and everyone wanted them on their side. And I remember going to meetings with these CEOs, where Frank and the bankers would say, "You want us for your M&A work because we can do better." And they really could do better. But in software, it's not like with EMC in hardware because with hardware, it's easier to connect different boxes. With software, the whole point of a software company is to integrate and architect the components so they fit together and reinforce each other, and that makes M&A harder. You can do it, but it takes a long time to fit the pieces together. Let me give you examples. If they put a graph query engine, let's say something like TinkerPop, on top of, I don't even know if it's possible, but let's say they put it on top of Delta Lake, then you have this graph query engine talking to their storage layer, Delta Lake. But if you want to do analysis, you got to put the data in Photon, which is not really ideal for highly connected data. If you license a graph database, then most of your data is in the Delta Lake and how do you sync it with the graph database? If you do sync it, you've got data in two places, which kind of defeats the purpose of having a unified repository. I find this semantic layer option in number three actually more promising, because that's something that you can layer on top of the storage layer that you have already. You just have to figure out then how to have your query engines talk to that. What I'm trying to highlight is, it's easy as an analyst to say, "You can buy this company or license that technology." But the really hard work is making it all work together and that is where the challenge is. >> Yeah, and well look, I thank you for laying that out. We've seen it, certainly Microsoft and Oracle. I guess you might argue that well, Microsoft had a monopoly in its desktop software and was able to throw off cash for a decade plus while it's stock was going sideways. Oracle had won the database wars and had amazing margins and cash flow to be able to do that. Databricks isn't even gone public yet, but I want to close with some of the players to watch. Alex, if you'd bring that back up, number four here. AWS, we talked about some of their options with S3 and it's not just AWS, it's blob storage, object storage. Microsoft, as you sort of alluded to, was an early go-to market channel for Databricks. We didn't address that really. So maybe in the closing comments we can. Google obviously, Snowflake of course, we're going to dissect their options in future Breaking Analysis. Dbt labs, where do they fit? Bob Muglia's company, Relational.ai, why are these players to watch George, in your opinion? >> So everyone is trying to assemble and integrate the pieces that would make building data applications, data products easy. And the critical part isn't just assembling a bunch of pieces, which is traditionally what AWS did. It's a Unix ethos, which is we give you the tools, you put 'em together, 'cause you then have the maximum choice and maximum power. So what the hyperscalers are doing is they're taking their key value stores, in the case of ASW it's DynamoDB, in the case of Azure it's Cosmos DB, and each are putting a graph query engine on top of those. So they have a unified storage and graph database engine, like all the data would be collected in the key value store. Then you have a graph database, that's how they're going to be presenting a foundation for building these data apps. Dbt labs is putting a semantic layer on top of data lakes and data warehouses and as we'll talk about, I'm sure in the future, that makes it easier to swap out the underlying data platform or swap in new ones for specialized use cases. Snowflake, what they're doing, they're so strong in data management and with their transactional tables, what they're trying to do is take in the operational data that used to be in the province of many state stores like MongoDB and say, "If you manage that data with us, it'll be connected to your analytic data without having to send it through a pipeline." And that's hugely valuable. Relational.ai is the wildcard, 'cause what they're trying to do, it's almost like a holy grail where you're trying to take the expressiveness of connecting all your data in a graph but making it as easy to query as you've always had it in a SQL database or I should say, in a relational database. And if they do that, it's sort of like, it'll be as easy to program these data apps as a spreadsheet was compared to procedural languages, like BASIC or Pascal. That's the implications of Relational.ai. >> Yeah, and again, we talked before, why can't you just throw this all in memory? We're talking in that example of really getting down to differences in how you lay the data out on disk in really, new database architecture, correct? >> Yes. And that's why it's not clear that you could take a data lake or even a Snowflake and why you can't put a relational knowledge graph on those. You could potentially put a graph database, but it'll be compromised because to really do what Relational.ai has done, which is the ease of Relational on top of the power of graph, you actually need to change how you're storing your data on disk or even in memory. So you can't, in other words, it's not like, oh we can add graph support to Snowflake, 'cause if you did that, you'd have to change, or in your data lake, you'd have to change how the data is physically laid out. And then that would break all the tools that talk to that currently. >> What in your estimation, is the timeframe where this becomes critical for a Databricks and potentially Snowflake and others? I mentioned earlier midterm, are we talking three to five years here? Are we talking end of decade? What's your radar say? >> I think something surprising is going on that's going to sort of come up the tailpipe and take everyone by storm. All the hype around business intelligence metrics, which is what we used to put in our dashboards where bookings, billings, revenue, customer, those things, those were the key artifacts that used to live in definitions in your BI tools, and DBT has basically created a standard for defining those so they live in your data pipeline or they're defined in their data pipeline and executed in the data warehouse or data lake in a shared way, so that all tools can use them. This sounds like a digression, it's not. All this stuff about data mesh, data fabric, all that's going on is we need a semantic layer and the business intelligence metrics are defining common semantics for your data. And I think we're going to find by the end of this year, that metrics are how we annotate all our analytic data to start adding common semantics to it. And we're going to find this semantic layer, it's not three to five years off, it's going to be staring us in the face by the end of this year. >> Interesting. And of course SVB today was shut down. We're seeing serious tech headwinds, and oftentimes in these sort of downturns or flat turns, which feels like this could be going on for a while, we emerge with a lot of new players and a lot of new technology. George, we got to leave it there. Thank you to George Gilbert for excellent insights and input for today's episode. I want to thank Alex Myerson who's on production and manages the podcast, of course Ken Schiffman as well. Kristin Martin and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hof is our EIC over at Siliconangle.com, he does some great editing. Remember all these episodes, they're available as podcasts. Wherever you listen, all you got to do is search Breaking Analysis Podcast, we publish each week on wikibon.com and siliconangle.com, or you can email me at David.Vellante@siliconangle.com, or DM me @DVellante. Comment on our LinkedIn post, and please do check out ETR.ai, great survey data, enterprise tech focus, phenomenal. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching, and we'll see you next time on Breaking Analysis.
SUMMARY :
bringing you data-driven core elements of the Databricks portfolio and pervasiveness in the data and that was where you went for data. and Cloudera set out to fix that. the reason you see and the robustness of Databricks and their big challenge and the data locked into in the real world and decisions Yes, and the mission of that is propelling the likes that the way you manage that data, is the fundamental problem because the joins are difficult and slow. and connects the data and the issue with that is the fourth bullet, expressiveness and it spits out the and the threat that may loom. because in the past with Snowflake, Think of that as the refinery So once the data lake was in place, George, the call out threat here But the key point is, in sort of the same context. and the company that put One is re-architect the platform and architect the components some of the players to watch. in the case of ASW it's DynamoDB, and why you can't put a relational and executed in the data and manages the podcast, of
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Alex Myerson | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
Mike Olson | PERSON | 0.99+ |
2014 | DATE | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Cheryl Knight | PERSON | 0.99+ |
Ken Schiffman | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Erik Bradley | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Sun Microsystems | ORGANIZATION | 0.99+ |
50 years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
Airbnb | ORGANIZATION | 0.99+ |
60 years | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
2010 | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Kristin Martin | PERSON | 0.99+ |
Rob Hof | PERSON | 0.99+ |
three | QUANTITY | 0.99+ |
15 years | QUANTITY | 0.99+ |
Databricks' | ORGANIZATION | 0.99+ |
two places | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
M&A | ORGANIZATION | 0.99+ |
Frank Quattrone | PERSON | 0.99+ |
second element | QUANTITY | 0.99+ |
Daren Brabham | PERSON | 0.99+ |
TechAlpha Partners | ORGANIZATION | 0.99+ |
third element | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
50 year | QUANTITY | 0.99+ |
40% | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
five years | QUANTITY | 0.99+ |
Is Data Mesh the Killer App for Supercloud | Supercloud2
(gentle bright music) >> Okay, welcome back to our "Supercloud 2" event live coverage here at stage performance in Palo Alto syndicating around the world. I'm John Furrier with Dave Vellante. We've got exclusive news and a scoop here for SiliconANGLE and theCUBE. Zhamak Dehghani, creator of data mesh has formed a new company called NextData.com NextData, she's a cube alumni and contributor to our Supercloud initiative, as well as our coverage and breaking analysis with Dave Vellante on data, the killer app for Supercloud. Zhamak, great to see you. Thank you for coming into the studio and congratulations on your newly formed venture and continued success on the data mesh. >> Thank you so much. It's great to be here. Great to see you in person. >> Dave: Yeah, finally. >> John: Wonderful. Your contributions to the data conversation has been well-documented certainly by us and others in the industry. Data mesh taking the world by storm. Some people are debating it, throwing, you know, cold water on it. Some are, I think, it's the next big thing. Tell us about the data mesh super data apps that are emerging out of cloud. >> I mean, data mesh, as you said, it's, you know, the pain point that it surfaced were universal. Everybody said, "Oh, why didn't I think of that?" You know, it was just an obvious next step and people are approaching it, implementing it. I guess the last few years, I've been involved in many of those implementations, and I guess Supercloud is somewhat a prerequisite for it because it's data mesh and building applications using data mesh is about sharing data responsibly across boundaries. And those boundaries include boundaries, organizational boundaries cloud technology boundaries and trust boundaries. >> I want to bring that up because your venture, NextData which is new, just formed. Tell us about that. What wave is that riding? What specifically are you targeting? What's the pain point? >> Zhamak: Absolutely, yes. So next data is the result of, I suppose, the pains that I suffered from implementing a database for many of the organizations. Basically, a lot of organizations that I've worked with, they want decentralized data. So they really embrace this idea of decentralized ownership of the data, but yet they want interconnectivity through standard APIs, yet they want discoverability and governance. So they want to have policies implemented, they want to govern that data, they want to be able to discover that data and yet they want to decentralize it. And we do that with a developer experience that is easy and native to a generalist developer. So we try to find, I guess, the common denominator that solves those problems and enables that developer experience for data sharing. >> John: Since you just announced the news, what's been the reaction? >> Zhamak: I just announced the news right now, so what's the reaction? >> John: But people in the industry that know you, you did a lot of work in the area. What have been some of the feedback on the new venture in terms of the approach, the customers, problem? >> Yeah, so we've been in stealth modes, so we haven't publicly talked about it, but folks that have been close to us in fact have reached out. We already have implementations of our pilot platform with early customers, which is super exciting. And we're going to have multiple of those. Of course, we're a tiny, tiny company. We can have many of those where we are going to have multiple pilots, implementations of our platform in real world. We're real global large scale organizations that have real world problems. So we're not going to build our platform in vacuum. And that's what's happening right now. >> Zhamak: When I think about your role at ThoughtWorks, you had a very wide observation space with a number of clients helping them implement data mesh and other things as well prior to your data mesh initiative. But when I look at data mesh, at least the ones that I've seen, they're very narrow. I think of JPMC, I think of HelloFresh. They're generally obviously not surprising. They don't include the big vision of inclusivity across clouds across different data stores. But it seems like people are having to go through some gymnastics to get to, you know, the organizational reality of decentralizing data, and at least pushing data ownership to the line of business. How are you approaching or are you approaching, solving that problem? Are you taking a narrow slice? What can you tell us about Next Data? >> Zhamak: Sure, yeah, absolutely. Gymnastics, the cute word to describe what the organizations have to go through. And one of those problems is that, you know, the data, as you know, resides on different platforms. It's owned by different people, it's processed by pipelines that who owns them. So there's this very disparate and disconnected set of technologies that were very useful for when we thought about data and processing as a centralized problem. But when you think about data as a decentralized problem, the cost of integration of these technologies in a cohesive developer experience is what's missing. And we want to focus on that cohesive end-to-end developer experience to share data responsibly in this autonomous units, we call them data products, I guess in data mesh, right? That constitutes computation, that governs that data policies, discoverability. So I guess, I heard this expression in the last talks that you can have your cake and eat it too. So we want people have their cakes, which is, you know, data in different places, decentralization and eat it too, which is interconnected access to it. So we start with standardizing and codifying this idea of a data product container that encapsulates data computation, APIs to get to it in a technology agnostic way, in an open way. And then, sit on top and use existing existing tech, you know, Snowflake, Databricks, whatever exists, you know, the millions of dollars of investments that companies have made, sit on top of those but create this cohesive, integrated experience where data product is a first class primitive. And that's really key here, that the language, and the modeling that we use is really native to data mesh is that I will make a data product, I'm sharing a data product, and that encapsulates on providing metadata about this. I'm providing computation that's constantly changing the data. I'm providing the API for that. So we're trying to kind of codify and create a new developer experience based on that. And developer, both from provider side and user side connected to peer-to-peer data sharing with data product as a primitive first class concept. >> Okay, so the idea would be developers would build applications leveraging those data products which are discoverable and governed. Now, today you see some companies, you know, take a snowflake for example. >> Zhamak: Yeah. >> Attempting to do that within their own little walled garden. They even, at one point, used the term, "Mesh." I dunno if they pull back on that. And then they sort of became aware of some of your work. But a lot of the things that they're doing within their little insulated environment, you know, support that, that, you know, governance, they're building out an ecosystem. What's different in your vision? >> Exactly. So we realize that, you know, and this is a reality, like you go to organizations, they have a snowflake and half of the organization happily operates on Snowflake. And on the other half, oh, we are on, you know, bare infrastructure on AWS, or we are on Databricks. This is the realities, you know, this Supercloud that's written up here. It's about working across boundaries of technology. So we try to embrace that. And even for our own technology with the way we're building it, we say, "Okay, nobody's going to use next data mesh operating system. People will have different platforms." So you have to build with openness in mind, and in case of Snowflake, I think, you know, they have I'm sure very happy customers as long as customers can be on Snowflake. But once you cross that boundary of platforms then that becomes a problem. And we try to keep that in mind in our solution. >> So, it's worth reviewing that basically, the concept of data mesh is that, whether you're a data lake or a data warehouse, an S3 bucket, an Oracle database as well, they should be inclusive inside of the data. >> We did a session with AWS on the startup showcase, data as code. And remember, I wrote a blog post in 2007 called, "Data's the new developer kit." Back then, they used to call 'em developer kits, if you remember. And that we said at that time, whoever can code data >> Zhamak: Yes. >> Will have a competitive advantage. >> Aren't there machines going to be doing that? Didn't we just hear that? >> Well we have, and you know, Hey Siri, hey Cube. Find me that best video for data mesh. There it is. I mean, this is the point, like what's happening is that, now, data has to be addressable >> Zhamak: Yes. >> For machines and for coding. >> Zhamak: Yes. >> Because as you need to call the data. So the question is, how do you manage the complexity of big things as promiscuous as possible, making it available as well as then governing it because it's a trade off. The more you make open >> Zhamak: Definitely. >> The better the machine learning. >> Zhamak: Yes. >> But yet, the governance issue, so this is the, you need an OS to handle this maybe. >> Yes, well, we call our mental model for our platform is an OS operating system. Operating systems, you know, have shown us how you can kind of abstract what's complex and take care of, you know, a lot of complexities, but yet provide an open and, you know, dynamic enough interface. So we think about it that way. We try to solve the problem of policies live with the data. An enforcement of the policies happens at the most granular level which is, in this concept, the data product. And that would happen whether you read, write, or access a data product. But we can never imagine what are these policies could be. So our thinking is, okay, we should have a open policy framework that can allow organizations write their own policy drivers, and policy definitions, and encode it and encapsulated in this data product container. But I'm not going to fool myself to say that, you know, that's going to solve the problem that you just described. I think we are in this, I don't know, if I look into my crystal ball, what I think might happen is that right now, the primitives that we work with to train machine-learning model are still bits and bites in data. They're fields, rows, columns, right? And that creates quite a large surface area, an attack area for, you know, for privacy of the data. So perhaps, one of the trends that we might see is this evolution of data APIs to become more and more computational aware to bring the compute to the data to reduce that surface area so you can really leave the control of the data to the sovereign owners of that data, right? So that data product. So I think the evolution of our data APIs perhaps will become more and more computational. So you describe what you want, and the data owner decides, you know, how to manage the- >> John: That's interesting, Dave, 'cause it's almost like we just talked about ChatGPT in the last segment with you, who's a machine learning, could really been around the industry. It's almost as if you're starting to see reason come into the data, reasoning. It's like you starting to see not just metadata, using the data to reason so that you don't have to expose the raw data. It's almost like a, I won't say curation layer, but an intelligence layer. >> Zhamak: Exactly. >> Can you share your vision on that 'cause that seems to be where the dots are connecting. >> Zhamak: Yes, this is perhaps further into the future because just from where we stand, we have to create still that bridge of familiarity between that future and present. So we are still in that bridge-making mode, however, by just the basic notion of saying, "I'm going to put an API in front of my data, and that API today might be as primitive as a level of indirection as in you tell me what you want, tell me who you are, let me go process that, all the policies and lineage, and insert all of this intelligence that need to happen. And then I will, today, I will still give you a file. But by just defining that API and standardizing it, now we have this amazing extension point that we can say, "Well, the next revision of this API, you not just tell me who you are, but you actually tell me what intelligence you're after. What's a logic that I need to go and now compute on your API?" And you can kind of evolve that, right? Now you have a point of evolution to this very futuristic, I guess, future where you just describe the question that you're asking from the chat. >> Well, this is the Supercloud, Dave. >> I have a question from a fan, I got to get it in. It's George Gilbert. And so, his question is, you're blowing away the way we synchronize data from operational systems to the data stack to applications. So the concern that he has, and he wants your feedback on this, "Is the data product app devs get exposed to more complexity with respect to moving data between data products or maybe it's attributes between data products, how do you respond to that? How do you see, is that a problem or is that something that is overstated, or do you have an answer for that?" >> Zhamak: Absolutely. So I think there's a sweet spot in getting data developers, data product developers closer to the app, but yet not burdening them with the complexity of the application and application logic, and yet reducing their cognitive load by localizing what they need to know about which is that domain where they're operating within. Because what's happening right now? what's happening right now is that data engineers, a ton of empathy for them for their high threshold of pain that they can, you know, deal with, they have been centralized, they've put into the data team, and they have been given this unbelievable task of make meaning out of data, put semantic over it, curates it, cleans it, and so on. So what we are saying is that get those folks embedded into the domain closer to the application developers, these are still separately moving units. Your app and your data products are independent but yet tightly closed with each other, tightly coupled with each other based on the context of the domain, so reduce cognitive load by localizing what they need to know about to the domain, get them closer to the application but yet have them them separate from app because app provides a very different service. Transactional data for my e-commerce transaction, data product provides a very different service, longitudinal data for the, you know, variety of this intelligent analysis that I can do on the data. But yet, it's all within the domain of e-commerce or sales or whatnot. >> So a lot of decoupling and coupling create that cohesiveness. >> Zhamak: Absolutely. >> Architecture. So I have to ask you, this is an interesting question 'cause it came up on theCUBE all last year. Back on the old server, data center days and cloud, SRE, Google coined the term, "Site Reliability Engineer" for someone to look over the hundreds of thousands of servers. We asked a question to data engineering community who have been suffering, by the way, agree. Is there an SRE-like role for data? Because in a way, data engineering, that platform engineer, they are like the SRE for data. In other words, managing the large scale to enable automation and cell service. What's your thoughts and reaction to that? >> Zhamak: Yes, exactly. So, maybe we go through that history of how SRE came to be. So we had the first DevOps movement which was, remove the wall between dev and ops and bring them together. So you have one cross-functional units of the organization that's responsible for, you build it you run it, right? So then there is no, I'm going to just shoot my application over the wall for somebody else to manage it. So we did that, and then we said, "Okay, as we decentralized and had this many microservices running around, we had to create a layer that abstracted a lot of the complexity around running now a lot or monitoring, observing and running a lot while giving autonomy to this cross-functional team." And that's where the SRE, a new generation of engineers came to exist. So I think if I just look- >> Hence Borg, hence Kubernetes. >> Hence, hence, exactly. Hence chaos engineering, hence embracing the complexity and messiness, right? And putting engineering discipline to embrace that and yet give a cohesive and high integrity experience of those systems. So I think, if we look at that evolution, perhaps something like that is happening by bringing data and apps closer and make them these domain-oriented data product teams or domain oriented cross-functional teams, full stop, and still have a very advanced maybe at the platform infrastructure level kind of operational team that they're not busy doing two jobs which is taking care of domains and the infrastructure, but they're building infrastructure that is embracing that complexity, interconnectivity of this data process. >> John: So you see similarities. >> Absolutely, but I feel like we're probably in a more early days of that movement. >> So it's a data DevOps kind of thing happening where scales happening. It's good things are happening yet. Eh, a little bit fast and loose with some complexities to clean up. >> Yes, yes. This is a different restructure. As you said we, you know, the job of this industry as a whole on architects is decompose, recompose, decompose, recomposing a new way, and now we're like decomposing centralized team, recomposing them as domains and- >> John: So is data mesh the killer app for Supercloud? >> You had to do this for me. >> Dave: Sorry, I couldn't- (John and Dave laughing) >> Zhamak: What do you want me to say, Dave? >> John: Yes. >> Zhamak: Yes of course. >> I mean Supercloud, I think it's, really the terminology's Supercloud, Opencloud. But I think, in spirits of it, this embracing of diversity and giving autonomy for people to make decisions for what's right for them and not yet lock them in. I think just embracing that is baked into how data mesh assume the world would work. >> John: Well thank you so much for coming on Supercloud too, really appreciate it. Data has driven this conversation. Your success of data mesh has really opened up the conversation and exposed the slow moving data industry. >> Dave: Been a great catalyst. (John laughs) >> John: That's now going well. We can move faster, so thanks for coming on. >> Thank you for hosting me. It was wonderful. >> Okay, Supercloud 2 live here in Palo Alto. Our stage performance, I'm John Furrier with Dave Vellante. We're back with more after this short break, Stay with us all day for Supercloud 2. (gentle bright music)
SUMMARY :
and continued success on the data mesh. Great to see you in person. and others in the industry. I guess the last few years, What's the pain point? a database for many of the organizations. in terms of the approach, but folks that have been close to us to get to, you know, the data, as you know, resides Okay, so the idea would be developers But a lot of the things that they're doing This is the realities, you know, inside of the data. And that we said at that Well we have, and you know, So the question is, how do so this is the, you need and the data owner decides, you know, so that you don't have 'cause that seems to be where of this API, you not So the concern that he has, into the domain closer to So a lot of decoupling So I have to ask you, this a lot of the complexity of domains and the infrastructure, in a more early days of that movement. to clean up. the job of this industry the world would work. John: Well thank you so much for coming Dave: Been a great catalyst. We can move faster, so Thank you for hosting me. after this short break,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Zhamak | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
2007 | DATE | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
JPMC | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Dav | PERSON | 0.99+ |
two jobs | QUANTITY | 0.99+ |
Supercloud | ORGANIZATION | 0.99+ |
NextData | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Opencloud | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
Siri | TITLE | 0.99+ |
ThoughtWorks | ORGANIZATION | 0.98+ |
NextData.com | ORGANIZATION | 0.98+ |
Supercloud 2 | EVENT | 0.98+ |
both | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
HelloFresh | ORGANIZATION | 0.98+ |
first | QUANTITY | 0.98+ |
millions of dollars | QUANTITY | 0.96+ |
Snowflake | EVENT | 0.96+ |
Oracle | ORGANIZATION | 0.96+ |
SRE | TITLE | 0.94+ |
Snowflake | ORGANIZATION | 0.94+ |
Cube | PERSON | 0.93+ |
Zhama | PERSON | 0.92+ |
Data Mesh the Killer App | TITLE | 0.92+ |
SiliconANGLE | ORGANIZATION | 0.91+ |
Databricks | ORGANIZATION | 0.9+ |
first class | QUANTITY | 0.89+ |
Supercloud 2 | ORGANIZATION | 0.88+ |
theCUBE | ORGANIZATION | 0.88+ |
hundreds of thousands | QUANTITY | 0.85+ |
one point | QUANTITY | 0.84+ |
Zham | PERSON | 0.83+ |
Supercloud | EVENT | 0.83+ |
ChatGPT | ORGANIZATION | 0.72+ |
SRE | ORGANIZATION | 0.72+ |
Borg | PERSON | 0.7+ |
Snowflake | TITLE | 0.66+ |
Supercloud | TITLE | 0.65+ |
half | QUANTITY | 0.64+ |
Discussion about Walmart's Approach | Supercloud2
(upbeat electronic music) >> Okay, welcome back to Supercloud 2, live here in Palo Alto. I'm John Furrier, with Dave Vellante. Again, all day wall-to-wall coverage, just had a great interview with Walmart, we've got a Next interview coming up, you're going to hear from Bob Muglia and Tristan Handy, two experts, both experienced entrepreneurs, executives in technology. We're here to break down what just happened with Walmart, and what's coming up with George Gilbert, former colleague, Wikibon analyst, Gartner Analyst, and now independent investor and expert. George, great to see you, I know you're following this space. Like you read about it, remember the first days when Dataverse came out, we were talking about them coming out of Berkeley? >> Dave: Snowflake. >> John: Snowflake. >> Dave: Snowflake In the early days. >> We, collectively, have been chronicling the data movement since 2010, you were part of our team, now you've got your nose to the grindstone, you're seeing the next wave. What's this all about? Walmart building their own super cloud, we got Bob Muglia talking about how these next wave of apps are coming. What are the super apps? What's the super cloud to you? >> Well, this key's off Dave's really interesting questions to Walmart, which was like, how are they building their supercloud? 'Cause it makes a concrete example. But what was most interesting about his description of the Walmart WCMP, I forgot what it stood for. >> Dave: Walmart Cloud Native Platform. >> Walmart, okay. He was describing where the logic could run in these stateless containers, and maybe eventually serverless functions. But that's just it, and that's the paradigm of microservices, where the logic is in this stateless thing, where you can shoot it, or it fails, and you can spin up another one, and you've lost nothing. >> That was their triplet model. >> Yeah, in fact, and that was what they were trying to move to, where these things move fluidly between data centers. >> But there's a but, right? Which is they're all stateless apps in the cloud. >> George: Yeah. >> And all their stateful apps are on-prem and VMs. >> Or the stateful part of the apps are in VMs. >> Okay. >> And so if they really want to lift their super cloud layer off of this different provider's infrastructure, they're going to need a much more advanced software platform that manages data. And that goes to the -- >> Muglia and Handy, that you and I did, that's coming up next. So the big takeaway there, George, was, I'll set it up and you can chime in, a new breed of data apps is emerging, and this highly decentralized infrastructure. And Tristan Handy of DBT Labs has a sort of a solution to begin the journey today, Muglia is working on something that's way out there, describe what you learned from it. >> Okay. So to talk about what the new data apps are, and then the platform to run them, I go back to the using what will probably be seen as one of the first data app examples, was Uber, where you're describing entities in the real world, riders, drivers, routes, city, like a city plan, these are all defined by data. And the data is described in a structure called a knowledge graph, for lack of a, no one's come up with a better term. But that means the tough, the stuff that Jack built, which was all stateless and sits above cloud vendors' infrastructure, it needs an entirely different type of software that's much, much harder to build. And the way Bob described it is, you're going to need an entirely new data management infrastructure to handle this. But where, you know, we had this really colorful interview where it was like Rock 'Em Sock 'Em, but they weren't really that much in opposition to each other, because Tristan is going to define this layer, starting with like business intelligence metrics, where you're defining things like bookings, billings, and revenue, in business terms, not in SQL terms -- >> Well, business terms, if I can interrupt, he said the one thing we haven't figured out how to APIify is KPIs that sit inside of a data warehouse, and that's essentially what he's doing. >> George: That's what he's doing, yes. >> Right. And so then you can now expose those APIs, those KPIs, that sit inside of a data warehouse, or a data lake, a data store, whatever, through APIs. >> George: And the difference -- >> So what does that do for you? >> Okay, so all of a sudden, instead of working at technical data terms, where you're dealing with tables and columns and rows, you're dealing instead with business entities, using the Uber example of drivers, riders, routes, you know, ETA prices. But you can define, DBT will be able to define those progressively in richer terms, today they're just doing things like bookings, billings, and revenue. But Bob's point was, today, the data warehouse that actually runs that stuff, whereas DBT defines it, the data warehouse that runs it, you can't do it with relational technology >> Dave: Relational totality, cashing architecture. >> SQL, you can't -- >> SQL caching architectures in memory, you can't do it, you've got to rethink down to the way the data lake is laid out on the disk or cache. Which by the way, Thomas Hazel, who's speaking later, he's the chief scientist and founder at Chaos Search, he says, "I've actually done this," basically leave it in an S3 bucket, and I'm going to query it, you know, with no caching. >> All right, so what I hear you saying then, tell me if I got this right, there are some some things that are inadequate in today's world, that's not compatible with the Supercloud wave. >> Yeah. >> Specifically how you're using storage, and data, and stateful. >> Yes. >> And then the software that makes it run, is that what you're saying? >> George: Yeah. >> There's one other thing you mentioned to me, it's like, when you're using a CRM system, a human is inputting data. >> George: Nothing happens till the human does something. >> Right, nothing happens until that data entry occurs. What you're talking about is a world that self forms, polling data from the transaction system, or the ERP system, and then builds a plan without human intervention. >> Yeah. Something in the real world happens, where the user says, "I want a ride." And then the software goes out and says, "Okay, we got to match a driver to the rider, we got to calculate how long it takes to get there, how long to deliver 'em." That's not driven by a form, other than the first person hitting a button and saying, "I want a ride." All the other stuff happens autonomously, driven by data and analytics. >> But my question was different, Dave, so I want to get specific, because this is where the startups are going to come in, this is the disruption. Snowflake is a data warehouse that's in the cloud, they call it a data cloud, they refactored it, they did it differently, the success, we all know it looks like. These areas where it's inadequate for the future are areas that'll probably be either disrupted, or refactored. What is that? >> That's what Muglia's contention is, that the DBT can start adding that layer where you define these business entities, they're like mini digital twins, you can define them, but the data warehouse isn't strong enough to actually manage and run them. And Muglia is behind a company that is rethinking the database, really in a fundamental way that hasn't been done in 40 or 50 years. It's the first, in his contention, the first real rethink of database technology in a fundamental way since the rise of the relational database 50 years ago. >> And I think you admit it's a real Hail Mary, I mean it's quite a long shot right? >> George: Yes. >> Huge potential. >> But they're pretty far along. >> Well, we've been talking on theCUBE for 12 years, and what, 10 years going to AWS Reinvent, Dave, that no one database will rule the world, Amazon kind of showed that with them. What's different, is it databases are changing, or you can have multiple databases, or? >> It's a good question. And the reason we've had multiple different types of databases, each one specialized for a different type of workload, but actually what Muglia is behind is a new engine that would essentially, you'll never get rid of the data warehouse, or the equivalent engine in like a Databricks datalake house, but it's a new engine that manages the thing that describes all the data and holds it together, and that's the new application platform. >> George, we have one minute left, I want to get real quick thought, you're an investor, and we know your history, and the folks watching, George's got a deep pedigree in investment data, and we can testify against that. If you're going to invest in a company right now, if you're a customer, I got to make a bet, what does success look like for me, what do I want walking through my door, and what do I want to send out? What companies do I want to look at? What's the kind of of vendor do I want to evaluate? Which ones do I want to send home? >> Well, the first thing a customer really has to do when they're thinking about next gen applications, all the people have told you guys, "we got to get our data in order," getting that data in order means building an integrated view of all your data landscape, which is data coming out of all your applications. It starts with the data model, so, today, you basically extract data from all your operational systems, put it in this one giant, central place, like a warehouse or lake house, but eventually you want this, whether you call it a fabric or a mesh, it's all the data that describes how everything hangs together as in one big knowledge graph. There's different ways to implement that. And that's the most critical thing, 'cause that describes your Uber landscape, your Uber platform. >> That's going to power the digital transformation, which will power the business transformation, which powers the business model, which allows the builders to build -- >> Yes. >> Coders to code. That's Supercloud application. >> Yeah. >> George, great stuff. Next interview you're going to see right here is Bob Muglia and Tristan Handy, they're going to unpack this new wave. Great segment, really worth unpacking and reading between the lines with George, and Dave Vellante, and those two great guests. And then we'll come back here for the studio for more of the live coverage of Supercloud 2. Thanks for watching. (upbeat electronic music)
SUMMARY :
remember the first days What's the super cloud to you? of the Walmart WCMP, I and that's the paradigm of microservices, and that was what they stateless apps in the cloud. And all their stateful of the apps are in VMs. And that goes to the -- Muglia and Handy, that you and I did, But that means the tough, he said the one thing we haven't And so then you can now the data warehouse that runs it, Dave: Relational totality, Which by the way, Thomas I hear you saying then, and data, and stateful. thing you mentioned to me, George: Nothing happens polling data from the transaction Something in the real world happens, that's in the cloud, that the DBT can start adding that layer Amazon kind of showed that with them. and that's the new application platform. and the folks watching, all the people have told you guys, Coders to code. for more of the live
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Bob | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
Chaos Search | ORGANIZATION | 0.99+ |
Jack | PERSON | 0.99+ |
Tristan | PERSON | 0.99+ |
12 years | QUANTITY | 0.99+ |
Berkeley | LOCATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
DBT Labs | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
two experts | QUANTITY | 0.99+ |
Supercloud 2 | TITLE | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Muglia | ORGANIZATION | 0.99+ |
one minute | QUANTITY | 0.99+ |
40 | QUANTITY | 0.99+ |
two great guests | QUANTITY | 0.98+ |
Wikibon | ORGANIZATION | 0.98+ |
50 years | QUANTITY | 0.98+ |
John | PERSON | 0.98+ |
Rock 'Em Sock 'Em | TITLE | 0.98+ |
today | DATE | 0.98+ |
first person | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
S3 | COMMERCIAL_ITEM | 0.97+ |
50 years ago | DATE | 0.97+ |
2010 | DATE | 0.97+ |
Mary | PERSON | 0.96+ |
first days | QUANTITY | 0.96+ |
SQL | TITLE | 0.96+ |
one | QUANTITY | 0.95+ |
Supercloud wave | EVENT | 0.95+ |
each one | QUANTITY | 0.93+ |
DBT | ORGANIZATION | 0.91+ |
Supercloud | TITLE | 0.91+ |
Supercloud2 | TITLE | 0.91+ |
Supercloud 2 | ORGANIZATION | 0.89+ |
Snowflake | TITLE | 0.86+ |
Dataverse | ORGANIZATION | 0.83+ |
triplet | QUANTITY | 0.78+ |
Breaking Analysis: Enterprise Technology Predictions 2023
(upbeat music beginning) >> From the Cube Studios in Palo Alto and Boston, bringing you data-driven insights from the Cube and ETR, this is "Breaking Analysis" with Dave Vellante. >> Making predictions about the future of enterprise tech is more challenging if you strive to lay down forecasts that are measurable. In other words, if you make a prediction, you should be able to look back a year later and say, with some degree of certainty, whether the prediction came true or not, with evidence to back that up. Hello and welcome to this week's Wikibon Cube Insights, powered by ETR. In this breaking analysis, we aim to do just that, with predictions about the macro IT spending environment, cost optimization, security, lots to talk about there, generative AI, cloud, and of course supercloud, blockchain adoption, data platforms, including commentary on Databricks, snowflake, and other key players, automation, events, and we may even have some bonus predictions around quantum computing, and perhaps some other areas. To make all this happen, we welcome back, for the third year in a row, my colleague and friend Eric Bradley from ETR. Eric, thanks for all you do for the community, and thanks for being part of this program. Again. >> I wouldn't miss it for the world. I always enjoy this one. Dave, good to see you. >> Yeah, so let me bring up this next slide and show you, actually come back to me if you would. I got to show the audience this. These are the inbounds that we got from PR firms starting in October around predictions. They know we do prediction posts. And so they'll send literally thousands and thousands of predictions from hundreds of experts in the industry, technologists, consultants, et cetera. And if you bring up the slide I can show you sort of the pattern that developed here. 40% of these thousands of predictions were from cyber. You had AI and data. If you combine those, it's still not close to cyber. Cost optimization was a big thing. Of course, cloud, some on DevOps, and software. Digital... Digital transformation got, you know, some lip service and SaaS. And then there was other, it's kind of around 2%. So quite remarkable, when you think about the focus on cyber, Eric. >> Yeah, there's two reasons why I think it makes sense, though. One, the cybersecurity companies have a lot of cash, so therefore the PR firms might be working a little bit harder for them than some of their other clients. (laughs) And then secondly, as you know, for multiple years now, when we do our macro survey, we ask, "What's your number one spending priority?" And again, it's security. It just isn't going anywhere. It just stays at the top. So I'm actually not that surprised by that little pie chart there, but I was shocked that SaaS was only 5%. You know, going back 10 years ago, that would've been the only thing anyone was talking about. >> Yeah. So true. All right, let's get into it. First prediction, we always start with kind of tech spending. Number one is tech spending increases between four and 5%. ETR has currently got it at 4.6% coming into 2023. This has been a consistently downward trend all year. We started, you know, much, much higher as we've been reporting. Bottom line is the fed is still in control. They're going to ease up on tightening, is the expectation, they're going to shoot for a soft landing. But you know, my feeling is this slingshot economy is going to continue, and it's going to continue to confound, whether it's supply chains or spending. The, the interesting thing about the ETR data, Eric, and I want you to comment on this, the largest companies are the most aggressive to cut. They're laying off, smaller firms are spending faster. They're actually growing at a much larger, faster rate as are companies in EMEA. And that's a surprise. That's outpacing the US and APAC. Chime in on this, Eric. >> Yeah, I was surprised on all of that. First on the higher level spending, we are definitely seeing it coming down, but the interesting thing here is headlines are making it worse. The huge research shop recently said 0% growth. We're coming in at 4.6%. And just so everyone knows, this is not us guessing, we asked 1,525 IT decision-makers what their budget growth will be, and they came in at 4.6%. Now there's a huge disparity, as you mentioned. The Fortune 500, global 2000, barely at 2% growth, but small, it's at 7%. So we're at a situation right now where the smaller companies are still playing a little bit of catch up on digital transformation, and they're spending money. The largest companies that have the most to lose from a recession are being more trepidatious, obviously. So they're playing a "Wait and see." And I hope we don't talk ourselves into a recession. Certainly the headlines and some of their research shops are helping it along. But another interesting comment here is, you know, energy and utilities used to be called an orphan and widow stock group, right? They are spending more than anyone, more than financials insurance, more than retail consumer. So right now it's being driven by mid, small, and energy and utilities. They're all spending like gangbusters, like nothing's happening. And it's the rest of everyone else that's being very cautious. >> Yeah, so very unpredictable right now. All right, let's go to number two. Cost optimization remains a major theme in 2023. We've been reporting on this. You've, we've shown a chart here. What's the primary method that your organization plans to use? You asked this question of those individuals that cited that they were going to reduce their spend and- >> Mhm. >> consolidating redundant vendors, you know, still leads the way, you know, far behind, cloud optimization is second, but it, but cloud continues to outpace legacy on-prem spending, no doubt. Somebody, it was, the guy's name was Alexander Feiglstorfer from Storyblok, sent in a prediction, said "All in one becomes extinct." Now, generally I would say I disagree with that because, you know, as we know over the years, suites tend to win out over, you know, individual, you know, point products. But I think what's going to happen is all in one is going to remain the norm for these larger companies that are cutting back. They want to consolidate redundant vendors, and the smaller companies are going to stick with that best of breed and be more aggressive and try to compete more effectively. What's your take on that? >> Yeah, I'm seeing much more consolidation in vendors, but also consolidation in functionality. We're seeing people building out new functionality, whether it's, we're going to talk about this later, so I don't want to steal too much of our thunder right now, but data and security also, we're seeing a functionality creep. So I think there's further consolidation happening here. I think niche solutions are going to be less likely, and platform solutions are going to be more likely in a spending environment where you want to reduce your vendors. You want to have one bill to pay, not 10. Another thing on this slide, real quick if I can before I move on, is we had a bunch of people write in and some of the answer options that aren't on this graph but did get cited a lot, unfortunately, is the obvious reduction in staff, hiring freezes, and delaying hardware, were three of the top write-ins. And another one was offshore outsourcing. So in addition to what we're seeing here, there were a lot of write-in options, and I just thought it would be important to state that, but essentially the cost optimization is by and far the highest one, and it's growing. So it's actually increased in our citations over the last year. >> And yeah, specifically consolidating redundant vendors. And so I actually thank you for bringing that other up, 'cause I had asked you, Eric, is there any evidence that repatriation is going on and we don't see it in the numbers, we don't see it even in the other, there was, I think very little or no mention of cloud repatriation, even though it might be happening in this in a smattering. >> Not a single mention, not one single mention. I went through it for you. Yep. Not one write-in. >> All right, let's move on. Number three, security leads M&A in 2023. Now you might say, "Oh, well that's a layup," but let me set this up Eric, because I didn't really do a great job with the slide. I hid the, what you've done, because you basically took, this is from the emerging technology survey with 1,181 responses from November. And what we did is we took Palo Alto and looked at the overlap in Palo Alto Networks accounts with these vendors that were showing on this chart. And Eric, I'm going to ask you to explain why we put a circle around OneTrust, but let me just set it up, and then have you comment on the slide and take, give us more detail. We're seeing private company valuations are off, you know, 10 to 40%. We saw a sneak, do a down round, but pretty good actually only down 12%. We've seen much higher down rounds. Palo Alto Networks we think is going to get busy. Again, they're an inquisitive company, they've been sort of quiet lately, and we think CrowdStrike, Cisco, Microsoft, Zscaler, we're predicting all of those will make some acquisitions and we're thinking that the targets are somewhere in this mess of security taxonomy. Other thing we're predicting AI meets cyber big time in 2023, we're going to probably going to see some acquisitions of those companies that are leaning into AI. We've seen some of that with Palo Alto. And then, you know, your comment to me, Eric, was "The RSA conference is going to be insane, hopping mad, "crazy this April," (Eric laughing) but give us your take on this data, and why the red circle around OneTrust? Take us back to that slide if you would, Alex. >> Sure. There's a few things here. First, let me explain what we're looking at. So because we separate the public companies and the private companies into two separate surveys, this allows us the ability to cross-reference that data. So what we're doing here is in our public survey, the tesis, everyone who cited some spending with Palo Alto, meaning they're a Palo Alto customer, we then cross-reference that with the private tech companies. Who also are they spending with? So what you're seeing here is an overlap. These companies that we have circled are doing the best in Palo Alto's accounts. Now, Palo Alto went and bought Twistlock a few years ago, which this data slide predicted, to be quite honest. And so I don't know if they necessarily are going to go after Snyk. Snyk, sorry. They already have something in that space. What they do need, however, is more on the authentication space. So I'm looking at OneTrust, with a 45% overlap in their overall net sentiment. That is a company that's already existing in their accounts and could be very synergistic to them. BeyondTrust as well, authentication identity. This is something that Palo needs to do to move more down that zero trust path. Now why did I pick Palo first? Because usually they're very inquisitive. They've been a little quiet lately. Secondly, if you look at the backdrop in the markets, the IPO freeze isn't going to last forever. Sooner or later, the IPO markets are going to open up, and some of these private companies are going to tap into public equity. In the meantime, however, cash funding on the private side is drying up. If they need another round, they're not going to get it, and they're certainly not going to get it at the valuations they were getting. So we're seeing valuations maybe come down where they're a touch more attractive, and Palo knows this isn't going to last forever. Cisco knows that, CrowdStrike, Zscaler, all these companies that are trying to make a push to become that vendor that you're consolidating in, around, they have a chance now, they have a window where they need to go make some acquisitions. And that's why I believe leading up to RSA, we're going to see some movement. I think it's going to pretty, a really exciting time in security right now. >> Awesome. Thank you. Great explanation. All right, let's go on the next one. Number four is, it relates to security. Let's stay there. Zero trust moves from hype to reality in 2023. Now again, you might say, "Oh yeah, that's a layup." A lot of these inbounds that we got are very, you know, kind of self-serving, but we always try to put some meat in the bone. So first thing we do is we pull out some commentary from, Eric, your roundtable, your insights roundtable. And we have a CISO from a global hospitality firm says, "For me that's the highest priority." He's talking about zero trust because it's the best ROI, it's the most forward-looking, and it enables a lot of the business transformation activities that we want to do. CISOs tell me that they actually can drive forward transformation projects that have zero trust, and because they can accelerate them, because they don't have to go through the hurdle of, you know, getting, making sure that it's secure. Second comment, zero trust closes that last mile where once you're authenticated, they open up the resource to you in a zero trust way. That's a CISO of a, and a managing director of a cyber risk services enterprise. Your thoughts on this? >> I can be here all day, so I'm going to try to be quick on this one. This is not a fluff piece on this one. There's a couple of other reasons this is happening. One, the board finally gets it. Zero trust at first was just a marketing hype term. Now the board understands it, and that's why CISOs are able to push through it. And what they finally did was redefine what it means. Zero trust simply means moving away from hardware security, moving towards software-defined security, with authentication as its base. The board finally gets that, and now they understand that this is necessary and it's being moved forward. The other reason it's happening now is hybrid work is here to stay. We weren't really sure at first, large companies were still trying to push people back to the office, and it's going to happen. The pendulum will swing back, but hybrid work's not going anywhere. By basically on our own data, we're seeing that 69% of companies expect remote and hybrid to be permanent, with only 30% permanent in office. Zero trust works for a hybrid environment. So all of that is the reason why this is happening right now. And going back to our previous prediction, this is why we're picking Palo, this is why we're picking Zscaler to make these acquisitions. Palo Alto needs to be better on the authentication side, and so does Zscaler. They're both fantastic on zero trust network access, but they need the authentication software defined aspect, and that's why we think this is going to happen. One last thing, in that CISO round table, I also had somebody say, "Listen, Zscaler is incredible. "They're doing incredibly well pervading the enterprise, "but their pricing's getting a little high," and they actually think Palo Alto is well-suited to start taking some of that share, if Palo can make one move. >> Yeah, Palo Alto's consolidation story is very strong. Here's my question and challenge. Do you and me, so I'm always hardcore about, okay, you've got to have evidence. I want to look back at these things a year from now and say, "Did we get it right? Yes or no?" If we got it wrong, we'll tell you we got it wrong. So how are we going to measure this? I'd say a couple things, and you can chime in. One is just the number of vendors talking about it. That's, but the marketing always leads the reality. So the second part of that is we got to get evidence from the buying community. Can you help us with that? >> (laughs) Luckily, that's what I do. I have a data company that asks thousands of IT decision-makers what they're adopting and what they're increasing spend on, as well as what they're decreasing spend on and what they're replacing. So I have snapshots in time over the last 11 years where I can go ahead and compare and contrast whether this adoption is happening or not. So come back to me in 12 months and I'll let you know. >> Now, you know, I will. Okay, let's bring up the next one. Number five, generative AI hits where the Metaverse missed. Of course everybody's talking about ChatGPT, we just wrote last week in a breaking analysis with John Furrier and Sarjeet Joha our take on that. We think 2023 does mark a pivot point as natural language processing really infiltrates enterprise tech just as Amazon turned the data center into an API. We think going forward, you're going to be interacting with technology through natural language, through English commands or other, you know, foreign language commands, and investors are lining up, all the VCs are getting excited about creating something competitive to ChatGPT, according to (indistinct) a hundred million dollars gets you a seat at the table, gets you into the game. (laughing) That's before you have to start doing promotion. But he thinks that's what it takes to actually create a clone or something equivalent. We've seen stuff from, you know, the head of Facebook's, you know, AI saying, "Oh, it's really not that sophisticated, ChatGPT, "it's kind of like IBM Watson, it's great engineering, "but you know, we've got more advanced technology." We know Google's working on some really interesting stuff. But here's the thing. ETR just launched this survey for the February survey. It's in the field now. We circle open AI in this category. They weren't even in the survey, Eric, last quarter. So 52% of the ETR survey respondents indicated a positive sentiment toward open AI. I added up all the sort of different bars, we could double click on that. And then I got this inbound from Scott Stevenson of Deep Graham. He said "AI is recession-proof." I don't know if that's the case, but it's a good quote. So bring this back up and take us through this. Explain this chart for us, if you would. >> First of all, I like Scott's quote better than the Facebook one. I think that's some sour grapes. Meta just spent an insane amount of money on the Metaverse and that's a dud. Microsoft just spent money on open AI and it is hot, undoubtedly hot. We've only been in the field with our current ETS survey for a week. So my caveat is it's preliminary data, but I don't care if it's preliminary data. (laughing) We're getting a sneak peek here at what is the number one net sentiment and mindshare leader in the entire machine-learning AI sector within a week. It's beating Data- >> 600. 600 in. >> It's beating Databricks. And we all know Databricks is a huge established enterprise company, not only in machine-learning AI, but it's in the top 10 in the entire survey. We have over 400 vendors in this survey. It's number eight overall, already. In a week. This is not hype. This is real. And I could go on the NLP stuff for a while. Not only here are we seeing it in open AI and machine-learning and AI, but we're seeing NLP in security. It's huge in email security. It's completely transforming that area. It's one of the reasons I thought Palo might take Abnormal out. They're doing such a great job with NLP in this email side, and also in the data prep tools. NLP is going to take out data prep tools. If we have time, I'll discuss that later. But yeah, this is, to me this is a no-brainer, and we're already seeing it in the data. >> Yeah, John Furrier called, you know, the ChatGPT introduction. He said it reminded him of the Netscape moment, when we all first saw Netscape Navigator and went, "Wow, it really could be transformative." All right, number six, the cloud expands to supercloud as edge computing accelerates and CloudFlare is a big winner in 2023. We've reported obviously on cloud, multi-cloud, supercloud and CloudFlare, basically saying what multi-cloud should have been. We pulled this quote from Atif Kahn, who is the founder and CTO of Alkira, thanks, one of the inbounds, thank you. "In 2023, highly distributed IT environments "will become more the norm "as organizations increasingly deploy hybrid cloud, "multi-cloud and edge settings..." Eric, from one of your round tables, "If my sources from edge computing are coming "from the cloud, that means I have my workloads "running in the cloud. "There is no one better than CloudFlare," That's a senior director of IT architecture at a huge financial firm. And then your analysis shows CloudFlare really growing in pervasion, that sort of market presence in the dataset, dramatically, to near 20%, leading, I think you had told me that they're even ahead of Google Cloud in terms of momentum right now. >> That was probably the biggest shock to me in our January 2023 tesis, which covers the public companies in the cloud computing sector. CloudFlare has now overtaken GCP in overall spending, and I was shocked by that. It's already extremely pervasive in networking, of course, for the edge networking side, and also in security. This is the number one leader in SaaSi, web access firewall, DDoS, bot protection, by your definition of supercloud, which we just did a couple of weeks ago, and I really enjoyed that by the way Dave, I think CloudFlare is the one that fits your definition best, because it's bringing all of these aspects together, and most importantly, it's cloud agnostic. It does not need to rely on Azure or AWS to do this. It has its own cloud. So I just think it's, when we look at your definition of supercloud, CloudFlare is the poster child. >> You know, what's interesting about that too, is a lot of people are poo-pooing CloudFlare, "Ah, it's, you know, really kind of not that sophisticated." "You don't have as many tools," but to your point, you're can have those tools in the cloud, Cloudflare's doing serverless on steroids, trying to keep things really simple, doing a phenomenal job at, you know, various locations around the world. And they're definitely one to watch. Somebody put them on my radar (laughing) a while ago and said, "Dave, you got to do a breaking analysis on CloudFlare." And so I want to thank that person. I can't really name them, 'cause they work inside of a giant hyperscaler. But- (Eric laughing) (Dave chuckling) >> Real quickly, if I can from a competitive perspective too, who else is there? They've already taken share from Akamai, and Fastly is their really only other direct comp, and they're not there. And these guys are in poll position and they're the only game in town right now. I just, I don't see it slowing down. >> I thought one of your comments from your roundtable I was reading, one of the folks said, you know, CloudFlare, if my workloads are in the cloud, they are, you know, dominant, they said not as strong with on-prem. And so Akamai is doing better there. I'm like, "Okay, where would you want to be?" (laughing) >> Yeah, which one of those two would you rather be? >> Right? Anyway, all right, let's move on. Number seven, blockchain continues to look for a home in the enterprise, but devs will slowly begin to adopt in 2023. You know, blockchains have got a lot of buzz, obviously crypto is, you know, the killer app for blockchain. Senior IT architect in financial services from your, one of your insight roundtables said quote, "For enterprises to adopt a new technology, "there have to be proven turnkey solutions. "My experience in talking with my peers are, "blockchain is still an open-source component "where you have to build around it." Now I want to thank Ravi Mayuram, who's the CTO of Couchbase sent in, you know, one of the predictions, he said, "DevOps will adopt blockchain, specifically Ethereum." And he referenced actually in his email to me, Solidity, which is the programming language for Ethereum, "will be in every DevOps pro's playbook, "mirroring the boom in machine-learning. "Newer programming languages like Solidity "will enter the toolkits of devs." His point there, you know, Solidity for those of you don't know, you know, Bitcoin is not programmable. Solidity, you know, came out and that was their whole shtick, and they've been improving that, and so forth. But it, Eric, it's true, it really hasn't found its home despite, you know, the potential for smart contracts. IBM's pushing it, VMware has had announcements, and others, really hasn't found its way in the enterprise yet. >> Yeah, and I got to be honest, I don't think it's going to, either. So when we did our top trends series, this was basically chosen as an anti-prediction, I would guess, that it just continues to not gain hold. And the reason why was that first comment, right? It's very much a niche solution that requires a ton of custom work around it. You can't just plug and play it. And at the end of the day, let's be very real what this technology is, it's a database ledger, and we already have database ledgers in the enterprise. So why is this a priority to move to a different database ledger? It's going to be very niche cases. I like the CTO comment from Couchbase about it being adopted by DevOps. I agree with that, but it has to be a DevOps in a very specific use case, and a very sophisticated use case in financial services, most likely. And that's not across the entire enterprise. So I just think it's still going to struggle to get its foothold for a little bit longer, if ever. >> Great, thanks. Okay, let's move on. Number eight, AWS Databricks, Google Snowflake lead the data charge with Microsoft. Keeping it simple. So let's unpack this a little bit. This is the shared accounts peer position for, I pulled data platforms in for analytics, machine-learning and AI and database. So I could grab all these accounts or these vendors and see how they compare in those three sectors. Analytics, machine-learning and database. Snowflake and Databricks, you know, they're on a crash course, as you and I have talked about. They're battling to be the single source of truth in analytics. They're, there's going to be a big focus. They're already started. It's going to be accelerated in 2023 on open formats. Iceberg, Python, you know, they're all the rage. We heard about Iceberg at Snowflake Summit, last summer or last June. Not a lot of people had heard of it, but of course the Databricks crowd, who knows it well. A lot of other open source tooling. There's a company called DBT Labs, which you're going to talk about in a minute. George Gilbert put them on our radar. We just had Tristan Handy, the CEO of DBT labs, on at supercloud last week. They are a new disruptor in data that's, they're essentially making, they're API-ifying, if you will, KPIs inside the data warehouse and dramatically simplifying that whole data pipeline. So really, you know, the ETL guys should be shaking in their boots with them. Coming back to the slide. Google really remains focused on BigQuery adoption. Customers have complained to me that they would like to use Snowflake with Google's AI tools, but they're being forced to go to BigQuery. I got to ask Google about that. AWS continues to stitch together its bespoke data stores, that's gone down that "Right tool for the right job" path. David Foyer two years ago said, "AWS absolutely is going to have to solve that problem." We saw them start to do it in, at Reinvent, bringing together NoETL between Aurora and Redshift, and really trying to simplify those worlds. There's going to be more of that. And then Microsoft, they're just making it cheap and easy to use their stuff, you know, despite some of the complaints that we hear in the community, you know, about things like Cosmos, but Eric, your take? >> Yeah, my concern here is that Snowflake and Databricks are fighting each other, and it's allowing AWS and Microsoft to kind of catch up against them, and I don't know if that's the right move for either of those two companies individually, Azure and AWS are building out functionality. Are they as good? No they're not. The other thing to remember too is that AWS and Azure get paid anyway, because both Databricks and Snowflake run on top of 'em. So (laughing) they're basically collecting their toll, while these two fight it out with each other, and they build out functionality. I think they need to stop focusing on each other, a little bit, and think about the overall strategy. Now for Databricks, we know they came out first as a machine-learning AI tool. They were known better for that spot, and now they're really trying to play catch-up on that data storage compute spot, and inversely for Snowflake, they were killing it with the compute separation from storage, and now they're trying to get into the MLAI spot. I actually wouldn't be surprised to see them make some sort of acquisition. Frank Slootman has been a little bit quiet, in my opinion there. The other thing to mention is your comment about DBT Labs. If we look at our emerging technology survey, last survey when this came out, DBT labs, number one leader in that data integration space, I'm going to just pull it up real quickly. It looks like they had a 33% overall net sentiment to lead data analytics integration. So they are clearly growing, it's fourth straight survey consecutively that they've grown. The other name we're seeing there a little bit is Cribl, but DBT labs is by far the number one player in this space. >> All right. Okay, cool. Moving on, let's go to number nine. With Automation mixer resurgence in 2023, we're showing again data. The x axis is overlap or presence in the dataset, and the vertical axis is shared net score. Net score is a measure of spending momentum. As always, you've seen UI path and Microsoft Power Automate up until the right, that red line, that 40% line is generally considered elevated. UI path is really separating, creating some distance from Automation Anywhere, they, you know, previous quarters they were much closer. Microsoft Power Automate came on the scene in a big way, they loom large with this "Good enough" approach. I will say this, I, somebody sent me a results of a (indistinct) survey, which showed UiPath actually had more mentions than Power Automate, which was surprising, but I think that's not been the case in the ETR data set. We're definitely seeing a shift from back office to front soft office kind of workloads. Having said that, software testing is emerging as a mainstream use case, we're seeing ML and AI become embedded in end-to-end automations, and low-code is serving the line of business. And so this, we think, is going to increasingly have appeal to organizations in the coming year, who want to automate as much as possible and not necessarily, we've seen a lot of layoffs in tech, and people... You're going to have to fill the gaps with automation. That's a trend that's going to continue. >> Yep, agreed. At first that comment about Microsoft Power Automate having less citations than UiPath, that's shocking to me. I'm looking at my chart right here where Microsoft Power Automate was cited by over 60% of our entire survey takers, and UiPath at around 38%. Now don't get me wrong, 38% pervasion's fantastic, but you know you're not going to beat an entrenched Microsoft. So I don't really know where that comment came from. So UiPath, looking at it alone, it's doing incredibly well. It had a huge rebound in its net score this last survey. It had dropped going through the back half of 2022, but we saw a big spike in the last one. So it's got a net score of over 55%. A lot of people citing adoption and increasing. So that's really what you want to see for a name like this. The problem is that just Microsoft is doing its playbook. At the end of the day, I'm going to do a POC, why am I going to pay more for UiPath, or even take on another separate bill, when we know everyone's consolidating vendors, if my license already includes Microsoft Power Automate? It might not be perfect, it might not be as good, but what I'm hearing all the time is it's good enough, and I really don't want another invoice. >> Right. So how does UiPath, you know, and Automation Anywhere, how do they compete with that? Well, the way they compete with it is they got to have a better product. They got a product that's 10 times better. You know, they- >> Right. >> they're not going to compete based on where the lowest cost, Microsoft's got that locked up, or where the easiest to, you know, Microsoft basically give it away for free, and that's their playbook. So that's, you know, up to UiPath. UiPath brought on Rob Ensslin, I've interviewed him. Very, very capable individual, is now Co-CEO. So he's kind of bringing that adult supervision in, and really tightening up the go to market. So, you know, we know this company has been a rocket ship, and so getting some control on that and really getting focused like a laser, you know, could be good things ahead there for that company. Okay. >> One of the problems, if I could real quick Dave, is what the use cases are. When we first came out with RPA, everyone was super excited about like, "No, UiPath is going to be great for super powerful "projects, use cases." That's not what RPA is being used for. As you mentioned, it's being used for mundane tasks, so it's not automating complex things, which I think UiPath was built for. So if you were going to get UiPath, and choose that over Microsoft, it's going to be 'cause you're doing it for more powerful use case, where it is better. But the problem is that's not where the enterprise is using it. The enterprise are using this for base rote tasks, and simply, Microsoft Power Automate can do that. >> Yeah, it's interesting. I've had people on theCube that are both Microsoft Power Automate customers and UiPath customers, and I've asked them, "Well you know, "how do you differentiate between the two?" And they've said to me, "Look, our users and personal productivity users, "they like Power Automate, "they can use it themselves, and you know, "it doesn't take a lot of, you know, support on our end." The flip side is you could do that with UiPath, but like you said, there's more of a focus now on end-to-end enterprise automation and building out those capabilities. So it's increasingly a value play, and that's going to be obviously the challenge going forward. Okay, my last one, and then I think you've got some bonus ones. Number 10, hybrid events are the new category. Look it, if I can get a thousand inbounds that are largely self-serving, I can do my own here, 'cause we're in the events business. (Eric chuckling) Here's the prediction though, and this is a trend we're seeing, the number of physical events is going to dramatically increase. That might surprise people, but most of the big giant events are going to get smaller. The exception is AWS with Reinvent, I think Snowflake's going to continue to grow. So there are examples of physical events that are growing, but generally, most of the big ones are getting smaller, and there's going to be many more smaller intimate regional events and road shows. These micro-events, they're going to be stitched together. Digital is becoming a first class citizen, so people really got to get their digital acts together, and brands are prioritizing earned media, and they're beginning to build their own news networks, going direct to their customers. And so that's a trend we see, and I, you know, we're right in the middle of it, Eric, so you know we're going to, you mentioned RSA, I think that's perhaps going to be one of those crazy ones that continues to grow. It's shrunk, and then it, you know, 'cause last year- >> Yeah, it did shrink. >> right, it was the last one before the pandemic, and then they sort of made another run at it last year. It was smaller but it was very vibrant, and I think this year's going to be huge. Global World Congress is another one, we're going to be there end of Feb. That's obviously a big big show, but in general, the brands and the technology vendors, even Oracle is going to scale down. I don't know about Salesforce. We'll see. You had a couple of bonus predictions. Quantum and maybe some others? Bring us home. >> Yeah, sure. I got a few more. I think we touched upon one, but I definitely think the data prep tools are facing extinction, unfortunately, you know, the Talons Informatica is some of those names. The problem there is that the BI tools are kind of including data prep into it already. You know, an example of that is Tableau Prep Builder, and then in addition, Advanced NLP is being worked in as well. ThoughtSpot, Intelius, both often say that as their selling point, Tableau has Ask Data, Click has Insight Bot, so you don't have to really be intelligent on data prep anymore. A regular business user can just self-query, using either the search bar, or even just speaking into what it needs, and these tools are kind of doing the data prep for it. I don't think that's a, you know, an out in left field type of prediction, but it's the time is nigh. The other one I would also state is that I think knowledge graphs are going to break through this year. Neo4j in our survey is growing in pervasion in Mindshare. So more and more people are citing it, AWS Neptune's getting its act together, and we're seeing that spending intentions are growing there. Tiger Graph is also growing in our survey sample. I just think that the time is now for knowledge graphs to break through, and if I had to do one more, I'd say real-time streaming analytics moves from the very, very rich big enterprises to downstream, to more people are actually going to be moving towards real-time streaming, again, because the data prep tools and the data pipelines have gotten easier to use, and I think the ROI on real-time streaming is obviously there. So those are three that didn't make the cut, but I thought deserved an honorable mention. >> Yeah, I'm glad you did. Several weeks ago, we did an analyst prediction roundtable, if you will, a cube session power panel with a number of data analysts and that, you know, streaming, real-time streaming was top of mind. So glad you brought that up. Eric, as always, thank you very much. I appreciate the time you put in beforehand. I know it's been crazy, because you guys are wrapping up, you know, the last quarter survey in- >> Been a nuts three weeks for us. (laughing) >> job. I love the fact that you're doing, you know, the ETS survey now, I think it's quarterly now, right? Is that right? >> Yep. >> Yep. So that's phenomenal. >> Four times a year. I'll be happy to jump on with you when we get that done. I know you were really impressed with that last time. >> It's unbelievable. This is so much data at ETR. Okay. Hey, that's a wrap. Thanks again. >> Take care Dave. Good seeing you. >> All right, many thanks to our team here, Alex Myerson as production, he manages the podcast force. Ken Schiffman as well is a critical component of our East Coast studio. Kristen Martin and Cheryl Knight help get the word out on social media and in our newsletters. And Rob Hoof is our editor-in-chief. He's at siliconangle.com. He's just a great editing for us. Thank you all. Remember all these episodes that are available as podcasts, wherever you listen, podcast is doing great. Just search "Breaking analysis podcast." Really appreciate you guys listening. I publish each week on wikibon.com and siliconangle.com, or you can email me directly if you want to get in touch, david.vellante@siliconangle.com. That's how I got all these. I really appreciate it. I went through every single one with a yellow highlighter. It took some time, (laughing) but I appreciate it. You could DM me at dvellante, or comment on our LinkedIn post and please check out etr.ai. Its data is amazing. Best survey data in the enterprise tech business. This is Dave Vellante for theCube Insights, powered by ETR. Thanks for watching, and we'll see you next time on "Breaking Analysis." (upbeat music beginning) (upbeat music ending)
SUMMARY :
insights from the Cube and ETR, do for the community, Dave, good to see you. actually come back to me if you would. It just stays at the top. the most aggressive to cut. that have the most to lose What's the primary method still leads the way, you know, So in addition to what we're seeing here, And so I actually thank you I went through it for you. I'm going to ask you to explain and they're certainly not going to get it to you in a zero trust way. So all of that is the One is just the number of So come back to me in 12 So 52% of the ETR survey amount of money on the Metaverse and also in the data prep tools. the cloud expands to the biggest shock to me "Ah, it's, you know, really and Fastly is their really the folks said, you know, for a home in the enterprise, Yeah, and I got to be honest, in the community, you know, and I don't know if that's the right move and the vertical axis is shared net score. So that's really what you want Well, the way they compete So that's, you know, One of the problems, if and that's going to be obviously even Oracle is going to scale down. and the data pipelines and that, you know, Been a nuts three I love the fact I know you were really is so much data at ETR. and we'll see you next time
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Alex Myerson | PERSON | 0.99+ |
Eric | PERSON | 0.99+ |
Eric Bradley | PERSON | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Rob Hoof | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
10 | QUANTITY | 0.99+ |
Ravi Mayuram | PERSON | 0.99+ |
Cheryl Knight | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Ken Schiffman | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Atif Kahn | PERSON | 0.99+ |
November | DATE | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
APAC | ORGANIZATION | 0.99+ |
Zscaler | ORGANIZATION | 0.99+ |
Palo | ORGANIZATION | 0.99+ |
David Foyer | PERSON | 0.99+ |
February | DATE | 0.99+ |
January 2023 | DATE | 0.99+ |
DBT Labs | ORGANIZATION | 0.99+ |
October | DATE | 0.99+ |
Rob Ensslin | PERSON | 0.99+ |
Scott Stevenson | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
69% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
CrowdStrike | ORGANIZATION | 0.99+ |
4.6% | QUANTITY | 0.99+ |
10 times | QUANTITY | 0.99+ |
2023 | DATE | 0.99+ |
Scott | PERSON | 0.99+ |
1,181 responses | QUANTITY | 0.99+ |
Palo Alto | ORGANIZATION | 0.99+ |
third year | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
Alex | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
OneTrust | ORGANIZATION | 0.99+ |
45% | QUANTITY | 0.99+ |
33% | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
two reasons | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
last year | DATE | 0.99+ |
BeyondTrust | ORGANIZATION | 0.99+ |
7% | QUANTITY | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Is Data Mesh the Next Killer App for Supercloud?
(upbeat music) >> Welcome back to our Supercloud 2 event live coverage here of stage performance in Palo Alto syndicating around the world. I'm John Furrier with Dave Vellante. We got exclusive news and a scoop here for SiliconANGLE in theCUBE. Zhamak Dehghani, creator of data mesh has formed a new company called Nextdata.com, Nextdata. She's a cube alumni and contributor to our supercloud initiative, as well as our coverage and Breaking Analysis with Dave Vellante on data, the killer app for supercloud. Zhamak, great to see you. Thank you for coming into the studio and congratulations on your newly formed venture and continued success on the data mesh. >> Thank you so much. It's great to be here. Great to see you in person. >> Dave: Yeah, finally. >> Wonderful. Your contributions to the data conversation has been well documented certainly by us and others in the industry. Data mesh taking the world by storm. Some people are debating it, throwing cold water on it. Some are thinking it's the next big thing. Tell us about the data mesh, super data apps that are emerging out of cloud. >> I mean, data mesh, as you said, the pain point that it surface were universal. Everybody said, "Oh, why didn't I think of that?" It was just an obvious next step and people are approaching it, implementing it. I guess the last few years I've been involved in many of those implementations and I guess supercloud is somewhat a prerequisite for it because it's data mesh and building applications using data mesh is about sharing data responsibly across boundaries. And those boundaries include organizational boundaries, cloud technology boundaries, and trust boundaries. >> I want to bring that up because your venture, Nextdata, which is new just formed. Tell us about that. What wave is that riding? What specifically are you targeting? What's the pain point? >> Absolutely. Yes, so Nextdata is the result of, I suppose the pains that I suffered from implementing data mesh for many of the organizations. Basically a lot of organizations that I've worked with they want decentralized data. So they really embrace this idea of decentralized ownership of the data, but yet they want interconnectivity through standard APIs, yet they want discoverability and governance. So they want to have policies implemented, they want to govern that data, they want to be able to discover that data, and yet they want to decentralize it. And we do that with a developer experience that is easy and native to a generalist developer. So we try to find the, I guess the common denominator that solves those problems and enables that developer experience for data sharing. >> Since you just announced the news, what's been the reaction? >> I just announced the news right now, so what's the reaction? >> But people in the industry know you did a lot of work in the area. What have been some of the feedback on the new venture in terms of the approach, the customers, problem? >> Yeah, so we've been in stealth mode so we haven't publicly talked about it, but folks that have been close to us, in fact have reached that we already have implementations of our pilot platform with early customers, which is super exciting. And we going to have multiple of those. Of course, we're a tiny, tiny company. We can have many of those, but we are going to have multiple pilot implementations of our platform in real world where real global large scale organizations that have real world problems. So we're not going to build our platform in vacuum. And that's what's happening right now. >> Zhamak, when I think about your role at ThoughtWorks, you had a very wide observation space with a number of clients, helping them implement data mesh and other things as well prior to your data mesh initiative. But when I look at data mesh, at least the ones that I've seen, they're very narrow. I think of JPMC, I think of HelloFresh. They're generally, obviously not surprising, they don't include the big vision of inclusivity across clouds, across different data storage. But it seems like people are having to go through some gymnastics to get to the organizational reality of decentralizing data and at least pushing data ownership to the line of business. How are you approaching, or are you approaching solving that problem? Are you taking a narrow slice? What can you tell us about Nextdata? >> Yeah, absolutely. Gymnastics, the cute word to describe what the organizations have to go through. And one of those problems is that the data as you know resides on different platforms, it's owned by different people, is processed by pipelines that who knows who owns them. So there's this very disparate and disconnected set of technologies that were very useful for when we thought about data and processing as a centralized problem. But when you think about data as a decentralized problem the cost of integration of these technologies in a cohesive developer experience is what's missing. And we want to focus on that cohesive end-to-end developer experience to share data responsibly in these autonomous units. We call them data products, I guess in data mesh. That constitutes computation. That governs that data policies, discoverability. So I guess, I heard this expression in the last talks that you can have your cake and eat it too. So we want people have their cakes, which is data in different places, decentralization, and eat it too, which is interconnected access to it. So we start with standardizing and codifying this idea of a data product container that encapsulates data computation APIs to get to it in a technology agnostic way, in an open way. And then sit on top and use existing tech, Snowflake, Databricks, whatever exists, the millions of dollars of investments that companies have made, sit on top of those but create this cohesive, integrated experience where data product is a first class primitive. And that's really key here. The language and the modeling that we use is really native to data mesh, which is that I'm building a data product I'm sharing a data product, and that encapsulates I'm providing metadata about this. I'm providing computation that's constantly changing the data. I'm providing the API for that. So we we're trying to kind of codify and create a new developer experience based on that. And developer, both from provider side and user side, connected to peer-to-peer data sharing with data product as a primitive first class concept. >> So the idea would be developers would build applications leveraging those data products, which are discoverable and governed. Now today you see some companies, take a Snowflake for example, attempting to do that within their own little walled garden. They even at one point used the term mesh. I don't know if they pull back on that. And then they became aware of some of your work. But a lot of the things that they're doing within their little insulated environment support that governance, they're building out an ecosystem. What's different in your vision? >> Exactly. So we realized that, and this is a reality, like you go to organizations, they have a Snowflake and half of the organization happily operates on Snowflake. And on the other half, "oh, we are on Bare infrastructure on AWS or we are on Databricks." This is the reality. This supercloud that's written up here, it's about working across boundaries of technology. So we try to embrace that. And even for our own technology with the way we're building it, we say, "Okay, nobody's going to use Nextdata, data mesh operating system. People will have different platforms." So you have to build with openness in mind and in case of Snowflake, I think, they have very, I'm sure very happy customers as long as customers can be on Snowflake. But once you cross that boundary of platforms then that becomes a problem. And we try to keep that in mind in our solution. >> So it's worth reviewing that basically the concept of data mesh is that whether you're a data lake or a data warehouse, an S3 bucket, an Oracle database as well, they should be inclusive inside of the data. >> We did a session with AWS on the startup showcase, data as code. And remember I wrote a blog post in 2007 called "Data as the New Developer Kit" back then we used to call them developer kits if you remember. And that we said at that time, whoever can code data will have a competitive advantage. >> Aren't the machines going to be doing that? Didn't we just hear that? >> Well, we have. Hey, Siri. Hey, Cube, find me that best video for data mesh. There it is. But this is the point, like what's happening is that now data has to be addressable. for machines and for coding because as you need to call the data. So the question is how do you manage the complexity of big things as promiscuous as possible, making it available, as well as then governing it? Because it's a trade off. The more you make open, the better the machine learning. But yet the governance issue, so this is the, you need an OS to handle this maybe. >> Yes. So yes, well we call, our mental model for our platform is an OS operating system. Operating systems have shown us how you can abstract what's complex and take care of a lot of complexities, but yet provide an open and dynamic enough interface. So we think about it that way. Just, we try to solve the problem of policies live with the data, an enforcement of the policies happens at the most granular level, which is in this concept of the data product. And that would happen whether you read, write or access a data product. But we can never imagine what are these policies could be. So our thinking is we should have a policy, open policy framework that can allow organizations write their own policy drivers and policy definitions and encode it and encapsulated in this data product container. But I'm not going to fool myself to say that, that's going to solve the problem that you just described. I think we are in this, I don't know, if I look into my crystal ball, what I think might happen is that right now the primitives that we work with to train machine learning model are still bits and bytes and data. They're fields, rows, columns and that creates quite a large surface area and attack area for privacy of the data. So perhaps one of the trends that we might see is this evolution of data APIs to become more and more computational aware to bring the compute to the data to reduce that surface area. So you can really leave the control of the data to the sovereign owners of that data. So that data product. So I think that evolution of our data APIs perhaps will become more and more computational. So you describe what you want and the data owner decides how to manage. >> That's interesting, Dave, 'cause it's almost like we just talked about ChatGPT in the last segment we had with you. It was a machine learning have been around the industry. It's almost as if you're starting to see reason come into, the data reasoning is like starting to see not just metadata. Using the data to reason so that you don't have to expose the raw data. So almost like a, I won't say curation layer, but an intelligence layer. >> Zhamak: Exactly. >> Can you share your vision on that? 'Cause that seems to be where the dots are connecting. >> Yes, perhaps further into the future because just from where we stand, we have to create still that bridge of familiarity between that future and present. So we are still in that bridge making mode. However, by just the basic notion of saying, "I'm going to put an API in front of my data." And that API today might be as primitive as a level of indirection, as in you tell me what you want, tell me who you are, let me go process that, all the policies and lineage and insert all of this intelligence that need to happen. And then today, I will still give you a file. But by just defining that API and standardizing it now we have this amazing extension point that we can say, "Well, the next revision of this API, you not just tell me who you are, but you actually tell me what intelligence you're after. What's a logic that I need to go and now compute on your API?" And you can evolve that. Now you have a point of evolution to this very futuristic, I guess, future where you just described the question that you're asking from the ChatGPT. >> Well, this is the supercloud, go ahead, Dave. >> I have a question from a fan, I got to get it in. It's George Gilbert. And so his question is, you're blowing away the way we synchronize data from operational systems to the data stack to applications. So the concern that he has and he wants your feedback on this, is the data product app devs get exposed to more complexity with respect to moving data between data products or maybe it's attributes between data products? How do you respond to that? How do you see? Is that a problem? Is that something that is overstated or do you have an answer for that? >> Absolutely. So I think there's a sweet spot in getting data developers, data product developers closer to the app, but yet not overburdening them with the complexity of the application and application logic and yet reducing their cognitive load by localizing what they need to know about, which is that domain where they're operating within. Because what's happening right now? What's happening right now is that data engineers with, a ton of empathy for them for their high threshold of pain that they can deal with, they have been centralized, they've put into the data team, and they have been given this unbelievable task of make meaning out of data, put semantic over it, curate it, cleans it, and so on. So what we are saying is that get those folks embedded into the domain closer to the application developers. These are still separately moving units. Your app and your data products are independent, but yet tightly closed with each other, tightly coupled with each other based on the context of the domain. So reduce cognitive load by localizing what they need to know about to the domain, get them closer to the application, but yet have them separate from app because app provides a very different service. Transactional data for my e-commerce transaction. Data product provides a very different service. Longitudinal data for the variety of this intelligent analysis that I can do on the data. But yet it's all within the domain of e-commerce or sales or whatnot. >> It's a lot of decoupling and coupling create that cohesiveness architecture. So I have to ask you, this is an interesting question 'cause it came up on theCUBE all last year. Back on the old server data center days and cloud, SRE, Google coined the term, site reliability engineer, for someone to look over the hundreds of thousands of servers. We asked the question to data engineering community who have been suffering, by the way, I agree. Is there an SRE like role for data? Because in a way data engineering, that platform engineer, they are like the SRE for data. In other words managing the large scale to enable automation and cell service. What's your thoughts and reaction to that? >> Yes, exactly. So maybe we go through that history of how SRE came to be. So we had the first DevOps movement, which was remove the wall between dev and ops and bring them together. So you have one unit of one cross-functional units of the organization that's responsible for you build it, you run it. So then there is no, I'm going to just shoot my application over the wall for somebody else to manage it. So we did that and then we said, okay, there is a ton, as we decentralized and had these many microservices running around, we had to create a layer that abstracted a lot of the complexity around running now a lot or monitoring, observing, and running a lot while giving autonomy to this cross-functional team. And that's where the SRE, a new generation of engineers came to exist. So I think if I just look at. >> Hence, Kubernetes. >> Hence, hence, exactly. Hence, chaos engineering. Hence, embracing the complexity and messiness. And putting engineering discipline to embrace that and yet give a cohesive and high integrity experience of those systems. So I think if we look at that evolution, perhaps something like that is happening by bringing data and apps closer and make them these domain-oriented data product teams or domain-oriented cross-functional teams full stop and still have a very advanced maybe at the platform level, infrastructure level operational team that they're not busy doing two jobs, which is taking care of domains and the infrastructure, but they're building infrastructure that is embracing that complexity, interconnectivity of this data process. >> So you see similarities? >> I see, absolutely. But I feel like we're probably in a more early days of that movement. >> So it's a data DevOps kind of thing happening where scales happening. It's good things are happening, yet a little bit fast and loose with some complexities to clean up. >> Yes. This is a different restructure. As you said, the job of this industry as a whole, an architect, is decompose recompose, decompose recompose in new way and now we're like decomposing centralized team, recomposing them as domains. >> So is data mesh the killer app for supercloud? >> You had to do this to me. >> Sorry, I couldn't resist. >> I know. Of course you want me to say this. >> Yes. >> Yes, of course. I mean, supercloud, I think it's really, the terminology supercloud, open cloud, but I think in spirits of it this embracing of diversity and giving autonomy for people to make decisions for what's right for them and not yet lock them in. I think just embracing that is baked into how data mesh assume the world would work. >> Well, thank you so much for coming on Supercloud 2. We really appreciate it. Data has driven this conversation. Your success of data mesh has really opened up the conversation and exposed the slow moving data industry. >> Dave: Been a great catalyst. >> That's now going well. We can move faster. So thanks for coming on. >> Thank you for hosting me. It was wonderful. >> Supercloud 2 live here in Palo Alto, our stage performance. I'm John Furrier with Dave Vellante. We'll back with more after this short break. Stay with us all day for Supercloud 2. (upbeat music)
SUMMARY :
and continued success on the data mesh. Great to see you in person. and others in the industry. I guess the last few What's the pain point? for many of the organizations. But people in the industry know you did but folks that have been close to us, at least the ones that I've is that the data as you know But a lot of the things that they're doing and half of the organization that basically the concept of data mesh And that we said at that time, is that now data has to be addressable. and the data owner decides how to manage. the data reasoning is like starting to see 'Cause that seems to be where What's a logic that I need to go Well, this is the So the concern that he has into the domain closer to We asked the question to of the organization that's responsible So I think if we look at that evolution, in a more early days of that movement. So it's a data DevOps As you said, the job of Of course you want me to say this. assume the world would work. the conversation and exposed So thanks for coming on. Thank you for hosting me. I'm John Furrier with Dave Vellante.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
2007 | DATE | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
Nextdata | ORGANIZATION | 0.99+ |
Zhamak | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
John Furrier | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Nextdata.com | ORGANIZATION | 0.99+ |
two jobs | QUANTITY | 0.99+ |
JPMC | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
HelloFresh | ORGANIZATION | 0.99+ |
ThoughtWorks | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
Supercloud 2 | EVENT | 0.99+ |
Oracle | ORGANIZATION | 0.98+ |
first | QUANTITY | 0.98+ |
Siri | TITLE | 0.98+ |
Cube | PERSON | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
Snowflake | ORGANIZATION | 0.97+ |
Supercloud | ORGANIZATION | 0.97+ |
both | QUANTITY | 0.97+ |
one unit | QUANTITY | 0.97+ |
Snowflake | TITLE | 0.96+ |
SRE | TITLE | 0.95+ |
millions of dollars | QUANTITY | 0.94+ |
first class | QUANTITY | 0.94+ |
hundreds of thousands of servers | QUANTITY | 0.92+ |
supercloud | ORGANIZATION | 0.92+ |
one point | QUANTITY | 0.92+ |
Supercloud 2 | TITLE | 0.89+ |
ChatGPT | ORGANIZATION | 0.81+ |
half | QUANTITY | 0.81+ |
Data Mesh the Next Killer App | TITLE | 0.78+ |
supercloud | TITLE | 0.75+ |
a ton | QUANTITY | 0.73+ |
Supercloud 2 | ORGANIZATION | 0.72+ |
SiliconANGLE | ORGANIZATION | 0.7+ |
DevOps | TITLE | 0.66+ |
Snowflake | EVENT | 0.59+ |
S3 | TITLE | 0.54+ |
last | DATE | 0.54+ |
supercloud | EVENT | 0.48+ |
Kubernetes | TITLE | 0.47+ |
Closing Remarks | Supercloud2
>> Welcome back everyone to the closing remarks here before we kick off our ecosystem portion of the program. We're live in Palo Alto for theCUBE special presentation of Supercloud 2. It's the second edition, the first one was in August. I'm John Furrier with Dave Vellante. Here to wrap up with our special guest analyst George Gilbert, investor and industry legend former colleague of ours, analyst at Wikibon. George great to see you. Dave, you know, wrapping up this day what in a phenomenal program. We had a contribution from industry vendors, industry experts, practitioners and customers building and redefining their company's business model. Rolling out technology for Supercloud and multicloud and ultimately changing how they do data. And data was the theme today. So very, very great program. Before we jump into our favorite parts let's give a shout out to the folks who make this possible. Free contents our mission. We'll always stay true to that mission. We want to thank VMware, alkira, ChaosSearch, prosimo for being sponsors of this great program. We will have Supercloud 3 coming up in a month or so, or two months. We'll see. Or sooner, we don't know. But it'll be more about security, but a lot more momentum. Okay, so that's... >> And don't forget too that this program not going to end now. We've got a whole ecosystem speaks track so stay tuned for that. >> John: Yeah, we got another 20 interviews. Feels like it. >> Well, you're going to hear from Saks, Veronika Durgin. You're going to hear from Western Union, Harveer Singh. You're going to hear from Ionis Pharmaceuticals, Nick Taylor. Brian Gracely chimes in on Supecloud. So he's the man behind the cloud cast. >> Yeah, and you know, the practitioners again, pay attention to also to the cloud networking interviews. Lot of change going on there that's going to be disruptive and actually change the landscape as well. Again, as Supercloud progresses to be the next big thing. If you're not on this next wave, you'll drift what, as Pat Gelsinger says. >> Yep. >> To kick off the closing segments, George, Dave, this is a wave that's been identified. Again, people debate the word all you want Supercloud. It is a gateway to multicloud eventually it is the standard for new applications, new ways to do data. There's new computer science being generated and customer requirements being addressed. So it's the confluence of, you know, tectonic plates shifting in the industry, new computer science seeing things like AI and machine learning and data at the center of it and new infrastructure all kind of coming together. So, to me, that's my takeaway so far. That is the big story and it's going to change society and ultimately the business models of these companies. >> Well, we've had 10, you know, you think about it we came out of the financial crisis. We've had 10, 12 years despite the Covid of tech success, right? And just now CIOs are starting to hit the brakes. And so my point is you've had all this innovation building up for a decade and you've got this massive ecosystem that is running on the cloud and the ecosystem is saying, hey, we can have even more value by tapping best of of breed across clouds. And you've got customers saying, hey, we need help. We want to do more and we want to point our business and our intellectual property, our software tooling at our customers and monetize our data. So you have all these forces coming together and it's sort of entering a new era. >> George, I want to go to you for a second because you are big contributor to this event. Your interview with Bob Moglia with Dave was I thought a watershed moment for me to hear that the data apps, how databases are being rethought because we've been seeing a diversity of databases with Amazon Web services, you know, promoting no one database rules of the world. Now it's not one database kind of architecture that's puling these new apps. What's your takeaway from this event? >> So if you keep your eye on this North Star where instead of building apps that are based on code you're building apps that are defined by data coming off of things that are linked to the real world like people, places, things and activities. Then the idea is, and the example we use is, you know, Uber but it could be, you know, amazon.com is defined by stuff coming off data in the Amazon ecosystem or marketplace. And then the question is, and everyone was talking at different angles on this, which was, where's the data live? How much do you hide from the developer? You know, and when can you offer that? You know, and you started with Walmart which was describing apps, traditional apps that are just code. And frankly that's easier to make that cross cloud and you know, essentially location independent. As soon as you have data you need data management technology that a customer does not have the sophistication to build. And then the argument was like, so how much can you hide from the developer who's building data apps? Tristan's version was you take the modern data stack and you start adding these APIs that define business concepts like bookings, billings and revenue, you know, or in the Uber example like drivers and riders, you know, and ETA's and prices. But those things execute still on the data warehouse or data lakehouse. Then Bob Muglia was saying you're not really hiding enough from the developer because you still got to say how to do all that. And his vision is not only do you hide where the data is but you hide how to sort of get at all that code by just saying what you want. You define how a car and how a driver and how a rider works. And then those things automatically figure out underneath the cover. >> So huge challenges, right? There's governance, there's security, they could be big blockers to, you know, the Supercloud but the industry's going to be attacking that problem. >> Well, what's your take? What's your favorite segment? Zhamak Dehghani came on, she's starting in that company, exclusive news. That was big notable moment for theCUBE. She launched her company. She pioneered the data mesh concept. And I think what George is saying and what data mesh points to is something that we've been saying for a long time. That data is now going to flip the script on how apps behave. And the Uber example I think is illustrated 'cause people can relate to Uber. But imagine that for every business whether it's a manufacturing business or retail or oil and gas or FinTech, they can look at their business like a game almost gamify it with data, riders, cars you know, moving data around the value of data. This is something that Adam Selipsky teased out at AWS, Dave. So what's your takeaway from this Supercloud? Where are we in your mind? Well big thing is data products and decentralizing your data architecture, but putting data in the hands of domain experts who can actually monetize the data. And I think that's, to me that's really exciting. Because look, data products financial industry has always been doing building data products. Mortgage backed securities is a data product. But why should the financial industry have all the fun? I mean virtually every organization can tap its ecosystem build data products, take its internal IP and processes and software and point it to the world and actually begin to make money out of it. >> Okay, so let's go around the horn. I'll start, I'll get you guys some time to think. Next question, what did you learn today? I learned that I think it's an infrastructure game and talking to Kit Colbert at VMware, I think it's all about infrastructure refactoring and I think the data's going to be an ingredient that's going to be operating system like. I think you're going to see the infrastructure influencing operations that will enable Superclouds to be real. And developers won't even know what a Supercloud is because they'll be using it. It's the operations focus is going to be very critical. Just like DevOps movements started Cloud native I think you're going to see a data native movement and I think infrastructure is critical as people go to the next level. That's my big takeaway today. And I'll say the data conversation is at the center. I think security, data are going to be always active horizontally scalable concepts, but every company's going to reset their infrastructure, how it looks and if it's not set up for data and or things that there need to be agile on, it's going to be a non-starter. So I think that's the cloud NextGen, distributed computing. >> I mean, what came into focus for me was I think the hyperscaler is going to continue to do their thing, you know, and be very, very successful and they're each coming at it from different approaches. We talk about this all the time in theCUBE. Amazon the best infrastructure, you know, Google's got its you know, data and AI thing and it's playing catch up and Microsoft's got this massive estate. Okay, cool. Check. The next wave of innovation which is coming from data, I've always said follow the data. That's where the where the money's going to be is going to come from other places. People want to be able to, organizations want to be able to share data across clouds across their organization, outside of their ecosystem and make money with that data sharing. They don't want to FTP it anymore. I got it. You take it. They want to work with live data in real time and I think the edge, we didn't talk much about the edge today is going to even take that to a new level real time inferencing at the edge, AI and and being able to do new things with data that we haven't even seen. But playing around with ChatGPT, it's blowing our mind. And I think you're right, it's like when we first saw the browser, holy crap, this is going to change the world. >> Yeah. And the ChatGPT by the way is going to create a wave of machine learning and data refactoring for sure. But also Howie Liu had an interesting comment, he was asked by a VC how much to replicate that and he said it's in the hundreds of millions, not billions. Now if you asked that same question how much does it cost to replicate AWS? The CapEx alone is unstoppable, they're already done. So, you know, the hyperscalers are going to continue to boom. I think they're going to drive the infrastructure. I think Amazon's going to be really strong at silicon and physics and squeeze every ounce atom out of every physical thing and then get latency as your bottleneck and the rest is all going to be... >> That never blew me away, a hundred million to create kind of an open AI, you know, competitor. Look at companies like Lacework. >> John: Some people have that much cash on the balance sheet. >> These are security companies that have raised a billion dollars, right? To compete. You know, so... >> If you're not shifting left what do you do with data, shift up? >> But, you know. >> What did you learn, George? >> I'm listening to you and I think you're helping me crystallize something which is the software infrastructure to enable the data apps is wide open. The way Zhamak described it is like if you want a data product like a sales and operation plan, that is built on other data products, like a sales plan which has a forecast in it, it has a production plan, it has a procurement plan and then a sales and operation plan is actually a composition of all those and they call each other. Now in her current platform, you need to expose to the developer a certain amount of mechanics on how to move all that data, when to move it. Like what happens if something fails. Now Muglia is saying I can hide that completely. So all you have to say is what you want and the underlying machinery takes care of everything. The problem is Muglia stuff is still a few years off. And Tristan is saying, I can give you much of that today but it's got to run in the data warehouse. So this trade offs all different ways. But again, I agree with you that the Cloud platform vendors or the ecosystem participants who can run across Cloud platforms and private infrastructure will be the next platform. And then the cloud platform is sort of where you run the big honking centralized stuff where someone else manages the operations. >> Sounds like middleware to me, Dave >> And key is, I'll just end with this. The key is being able to get to the data, whether it's in a data warehouse or a data lake or a S3 bucket or an object store, Oracle database, whatever. It's got to be inclusive that is critical to execute on the vision that you just talked about 'cause that data's in different systems and you're not going to put it all into some new system. >> So creating middleware in the cloud that sounds what it sounds like to me. >> It's like, you discovered PaaS >> It's a super PaaS. >> But it's platform services 'cause PaaS connotes like a tightly integrated platform. >> Well this is the real thing that's going on. We're going to see how this evolves. George, great to have you on, Dave. Thanks for the summary. I enjoyed this segment a lot today. This ends our stage performance live here in Palo Alto. As you know, we're live stage performance and syndicate out virtually. Our afternoon program's going to kick in now you're going to hear some great interviews. We got ChaosSearch. Defining the network Supercloud from prosimo. Future of Cloud Network, alkira. We got Saks, a retail company here, Veronika Durgin. We got Dave with Western Union. So a lot of customers, a pharmaceutical company Warner Brothers, Discovery, media company. And then you know, what is really needed for Supercloud, good panels. So stay with us for the afternoon program. That's part two of Supercloud 2. This is a wrap up for our stage live performance. I'm John Furrier with Dave Vellante and George Gilbert here wrapping up. Thanks for watching and enjoy the program. (bright music)
SUMMARY :
to the closing remarks here program not going to end now. John: Yeah, we got You're going to hear from Yeah, and you know, It is a gateway to multicloud starting to hit the brakes. go to you for a second the sophistication to build. but the industry's going to And I think that's, to me and talking to Kit Colbert at VMware, to do their thing, you know, I think Amazon's going to be really strong kind of an open AI, you know, competitor. on the balance sheet. that have raised a billion dollars, right? I'm listening to you and I think It's got to be inclusive that is critical So creating middleware in the cloud But it's platform services George, great to have you on, Dave.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tristan | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Adam Selipsky | PERSON | 0.99+ |
Pat Gelsinger | PERSON | 0.99+ |
Bob Moglia | PERSON | 0.99+ |
Veronika Durgin | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Western Union | ORGANIZATION | 0.99+ |
Nick Taylor | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
10 | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Brian Gracely | PERSON | 0.99+ |
Howie Liu | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
hundreds of millions | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Ionis Pharmaceuticals | ORGANIZATION | 0.99+ |
August | DATE | 0.99+ |
Warner Brothers | ORGANIZATION | 0.99+ |
Kit Colbert | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
billions | QUANTITY | 0.99+ |
Zhamak | PERSON | 0.99+ |
Muglia | PERSON | 0.99+ |
20 interviews | QUANTITY | 0.99+ |
Discovery | ORGANIZATION | 0.99+ |
second edition | QUANTITY | 0.99+ |
ChaosSearch | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
two months | QUANTITY | 0.99+ |
Supercloud 2 | TITLE | 0.98+ |
VMware | ORGANIZATION | 0.98+ |
Saks | ORGANIZATION | 0.98+ |
PaaS | TITLE | 0.98+ |
amazon.com | ORGANIZATION | 0.98+ |
first one | QUANTITY | 0.98+ |
Lacework | ORGANIZATION | 0.98+ |
Harveer Singh | PERSON | 0.98+ |
Oracle | ORGANIZATION | 0.97+ |
alkira | PERSON | 0.96+ |
first | QUANTITY | 0.96+ |
Supercloud | ORGANIZATION | 0.95+ |
Supercloud2 | TITLE | 0.94+ |
Wikibon | ORGANIZATION | 0.94+ |
Supecloud | ORGANIZATION | 0.94+ |
each | QUANTITY | 0.93+ |
hundred million | QUANTITY | 0.92+ |
multicloud | ORGANIZATION | 0.92+ |
every ounce atom | QUANTITY | 0.91+ |
Amazon Web | ORGANIZATION | 0.88+ |
Supercloud 3 | TITLE | 0.87+ |
Breaking Analysis: Supercloud2 Explores Cloud Practitioner Realities & the Future of Data Apps
>> Narrator: From theCUBE Studios in Palo Alto and Boston bringing you data-driven insights from theCUBE and ETR. This is breaking analysis with Dave Vellante >> Enterprise tech practitioners, like most of us they want to make their lives easier so they can focus on delivering more value to their businesses. And to do so, they want to tap best of breed services in the public cloud, but at the same time connect their on-prem intellectual property to emerging applications which drive top line revenue and bottom line profits. But creating a consistent experience across clouds and on-prem estates has been an elusive capability for most organizations, forcing trade-offs and injecting friction into the system. The need to create seamless experiences is clear and the technology industry is starting to respond with platforms, architectures, and visions of what we've called the Supercloud. Hello and welcome to this week's Wikibon Cube Insights powered by ETR. In this breaking analysis we give you a preview of Supercloud 2, the second event of its kind that we've had on the topic. Yes, folks that's right Supercloud 2 is here. As of this recording, it's just about four days away 33 guests, 21 sessions, combining live discussions and fireside chats from theCUBE's Palo Alto Studio with prerecorded conversations on the future of cloud and data. You can register for free at supercloud.world. And we are super excited about the Supercloud 2 lineup of guests whereas Supercloud 22 in August, was all about refining the definition of Supercloud testing its technical feasibility and understanding various deployment models. Supercloud 2 features practitioners, technologists and analysts discussing what customers need with real-world examples of Supercloud and will expose thinking around a new breed of cross-cloud apps, data apps, if you will that change the way machines and humans interact with each other. Now the example we'd use if you think about applications today, say a CRM system, sales reps, what are they doing? They're entering data into opportunities they're choosing products they're importing contacts, et cetera. And sure the machine can then take all that data and spit out a forecast by rep, by region, by product, et cetera. But today's applications are largely about filling in forms and or codifying processes. In the future, the Supercloud community sees a new breed of applications emerging where data resides on different clouds, in different data storages, databases, Lakehouse, et cetera. And the machine uses AI to inspect the e-commerce system the inventory data, supply chain information and other systems, and puts together a plan without any human intervention whatsoever. Think about a system that orchestrates people, places and things like an Uber for business. So at Supercloud 2, you'll hear about this vision along with some of today's challenges facing practitioners. Zhamak Dehghani, the founder of Data Mesh is a headliner. Kit Colbert also is headlining. He laid out at the first Supercloud an initial architecture for what that's going to look like. That was last August. And he's going to present his most current thinking on the topic. Veronika Durgin of Sachs will be featured and talk about data sharing across clouds and you know what she needs in the future. One of the main highlights of Supercloud 2 is a dive into Walmart's Supercloud. Other featured practitioners include Western Union Ionis Pharmaceuticals, Warner Media. We've got deep, deep technology dives with folks like Bob Muglia, David Flynn Tristan Handy of DBT Labs, Nir Zuk, the founder of Palo Alto Networks focused on security. Thomas Hazel, who's going to talk about a new type of database for Supercloud. It's several analysts including Keith Townsend Maribel Lopez, George Gilbert, Sanjeev Mohan and so many more guests, we don't have time to list them all. They're all up on supercloud.world with a full agenda, so you can check that out. Now let's take a look at some of the things that we're exploring in more detail starting with the Walmart Cloud native platform, they call it WCNP. We definitely see this as a Supercloud and we dig into it with Jack Greenfield. He's the head of architecture at Walmart. Here's a quote from Jack. "WCNP is an implementation of Kubernetes for the Walmart ecosystem. We've taken Kubernetes off the shelf as open source." By the way, they do the same thing with OpenStack. "And we have integrated it with a number of foundational services that provide other aspects of our computational environment. Kubernetes off the shelf doesn't do everything." And so what Walmart chose to do, they took a do-it-yourself approach to build a Supercloud for a variety of reasons that Jack will explain, along with Walmart's so-called triplet architecture connecting on-prem, Azure and GCP. No surprise, there's no Amazon at Walmart for obvious reasons. And what they do is they create a common experience for devs across clouds. Jack is going to talk about how Walmart is evolving its Supercloud in the future. You don't want to miss that. Now, next, let's take a look at how Veronica Durgin of SAKS thinks about data sharing across clouds. Data sharing we think is a potential killer use case for Supercloud. In fact, let's hear it in Veronica's own words. Please play the clip. >> How do we talk to each other? And more importantly, how do we data share? You know, I work with data, you know this is what I do. So if you know I want to get data from a company that's using, say Google, how do we share it in a smooth way where it doesn't have to be this crazy I don't know, SFTP file moving? So that's where I think Supercloud comes to me in my mind, is like practical applications. How do we create that mesh, that network that we can easily share data with each other? >> Now data mesh is a possible architectural approach that will enable more facile data sharing and the monetization of data products. You'll hear Zhamak Dehghani live in studio talking about what standards are missing to make this vision a reality across the Supercloud. Now one of the other things that we're really excited about is digging deeper into the right approach for Supercloud adoption. And we're going to share a preview of a debate that's going on right now in the community. Bob Muglia, former CEO of Snowflake and Microsoft Exec was kind enough to spend some time looking at the community's supercloud definition and he felt that it needed to be simplified. So in near real time he came up with the following definition that we're showing here. I'll read it. "A Supercloud is a platform that provides programmatically consistent services hosted on heterogeneous cloud providers." So not only did Bob simplify the initial definition he's stressed that the Supercloud is a platform versus an architecture implying that the platform provider eg Snowflake, VMware, Databricks, Cohesity, et cetera is responsible for determining the architecture. Now interestingly in the shared Google doc that the working group uses to collaborate on the supercloud de definition, Dr. Nelu Mihai who is actually building a Supercloud responded as follows to Bob's assertion "We need to avoid creating many Supercloud platforms with their own architectures. If we do that, then we create other proprietary clouds on top of existing ones. We need to define an architecture of how Supercloud interfaces with all other clouds. What is the information model? What is the execution model and how users will interact with Supercloud?" What does this seemingly nuanced point tell us and why does it matter? Well, history suggests that de facto standards will emerge more quickly to resolve real world practitioner problems and catch on more quickly than consensus-based architectures and standards-based architectures. But in the long run, the ladder may serve customers better. So we'll be exploring this topic in more detail in Supercloud 2, and of course we'd love to hear what you think platform, architecture, both? Now one of the real technical gurus that we'll have in studio at Supercloud two is David Flynn. He's one of the people behind the the movement that enabled enterprise flash adoption, that craze. And he did that with Fusion IO and he is now working on a system to enable read write data access to any user in any application in any data center or on any cloud anywhere. So think of this company as a Supercloud enabler. Allow me to share an excerpt from a conversation David Flore and I had with David Flynn last year. He as well gave a lot of thought to the Supercloud definition and was really helpful with an opinionated point of view. He said something to us that was, we thought relevant. "What is the operating system for a decentralized cloud? The main two functions of an operating system or an operating environment are one the process scheduler and two, the file system. The strongest argument for supercloud is made when you go down to the platform layer and talk about it as an operating environment on which you can run all forms of applications." So a couple of implications here that will be exploring with David Flynn in studio. First we're inferring from his comment that he's in the platform camp where the platform owner is responsible for the architecture and there are obviously trade-offs there and benefits but we'll have to clarify that with him. And second, he's basically saying, you kill the concept the further you move up the stack. So the weak, the further you move the stack the weaker the supercloud argument becomes because it's just becoming SaaS. Now this is something we're going to explore to better understand is thinking on this, but also whether the existing notion of SaaS is changing and whether or not a new breed of Supercloud apps will emerge. Which brings us to this really interesting fellow that George Gilbert and I RIFed with ahead of Supercloud two. Tristan Handy, he's the founder and CEO of DBT Labs and he has a highly opinionated and technical mind. Here's what he said, "One of the things that we still don't know how to API-ify is concepts that live inside of your data warehouse inside of your data lake. These are core concepts that the business should be able to create applications around very easily. In fact, that's not the case because it involves a lot of data engineering pipeline and other work to make these available. So if you really want to make it easy to create these data experiences for users you need to have an ability to describe these metrics and then to turn them into APIs to make them accessible to application developers who have literally no idea how they're calculated behind the scenes and they don't need to." A lot of implications to this statement that will explore at Supercloud two versus Jamma Dani's data mesh comes into play here with her critique of hyper specialized data pipeline experts with little or no domain knowledge. Also the need for simplified self-service infrastructure which Kit Colbert is likely going to touch upon. Veronica Durgin of SAKS and her ideal state for data shearing along with Harveer Singh of Western Union. They got to deal with 200 locations around the world in data privacy issues, data sovereignty how do you share data safely? Same with Nick Taylor of Ionis Pharmaceutical. And not to blow your mind but Thomas Hazel and Bob Muglia deposit that to make data apps a reality across the Supercloud you have to rethink everything. You can't just let in memory databases and caching architectures take care of everything in a brute force manner. Rather you have to get down to really detailed levels even things like how data is laid out on disk, ie flash and think about rewriting applications for the Supercloud and the MLAI era. All of this and more at Supercloud two which wouldn't be complete without some data. So we pinged our friends from ETR Eric Bradley and Darren Bramberm to see if they had any data on Supercloud that we could tap. And so we're going to be analyzing a number of the players as well at Supercloud two. Now, many of you are familiar with this graphic here we show some of the players involved in delivering or enabling Supercloud-like capabilities. On the Y axis is spending momentum and on the horizontal accesses market presence or pervasiveness in the data. So netscore versus what they call overlap or end in the data. And the table insert shows how the dots are plotted now not to steal ETR's thunder but the first point is you really can't have supercloud without the hyperscale cloud platforms which is shown on this graphic. But the exciting aspect of Supercloud is the opportunity to build value on top of that hyperscale infrastructure. Snowflake here continues to show strong spending velocity as those Databricks, Hashi, Rubrik. VMware Tanzu, which we all put under the magnifying glass after the Broadcom announcements, is also showing momentum. Unfortunately due to a scheduling conflict we weren't able to get Red Hat on the program but they're clearly a player here. And we've put Cohesity and Veeam on the chart as well because backup is a likely use case across clouds and on-premises. And now one other call out that we drill down on at Supercloud two is CloudFlare, which actually uses the term supercloud maybe in a different way. They look at Supercloud really as you know, serverless on steroids. And so the data brains at ETR will have more to say on this topic at Supercloud two along with many others. Okay, so why should you attend Supercloud two? What's in it for me kind of thing? So first of all, if you're a practitioner and you want to understand what the possibilities are for doing cross-cloud services for monetizing data how your peers are doing data sharing, how some of your peers are actually building out a Supercloud you're going to get real world input from practitioners. If you're a technologist, you're trying to figure out various ways to solve problems around data, data sharing, cross-cloud service deployment there's going to be a number of deep technology experts that are going to share how they're doing it. We're also going to drill down with Walmart into a practical example of Supercloud with some other examples of how practitioners are dealing with cross-cloud complexity. Some of them, by the way, are kind of thrown up their hands and saying, Hey, we're going mono cloud. And we'll talk about the potential implications and dangers and risks of doing that. And also some of the benefits. You know, there's a question, right? Is Supercloud the same wine new bottle or is it truly something different that can drive substantive business value? So look, go to Supercloud.world it's January 17th at 9:00 AM Pacific. You can register for free and participate directly in the program. Okay, that's a wrap. I want to give a shout out to the Supercloud supporters. VMware has been a great partner as our anchor sponsor Chaos Search Proximo, and Alura as well. For contributing to the effort I want to thank Alex Myerson who's on production and manages the podcast. Ken Schiffman is his supporting cast as well. Kristen Martin and Cheryl Knight to help get the word out on social media and at our newsletters. And Rob Ho is our editor-in-chief over at Silicon Angle. Thank you all. Remember, these episodes are all available as podcast. Wherever you listen we really appreciate the support that you've given. We just saw some stats from from Buzz Sprout, we hit the top 25% we're almost at 400,000 downloads last year. So really appreciate your participation. All you got to do is search Breaking Analysis podcast and you'll find those I publish each week on wikibon.com and siliconangle.com. Or if you want to get ahold of me you can email me directly at David.Vellante@siliconangle.com or dm me DVellante or comment on our LinkedIn post. I want you to check out etr.ai. They've got the best survey data in the enterprise tech business. This is Dave Vellante for theCUBE Insights, powered by ETR. Thanks for watching. We'll see you next week at Supercloud two or next time on breaking analysis. (light music)
SUMMARY :
with Dave Vellante of the things that we're So if you know I want to get data and on the horizontal
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Bob Muglia | PERSON | 0.99+ |
Alex Myerson | PERSON | 0.99+ |
Cheryl Knight | PERSON | 0.99+ |
David Flynn | PERSON | 0.99+ |
Veronica | PERSON | 0.99+ |
Jack | PERSON | 0.99+ |
Nelu Mihai | PERSON | 0.99+ |
Zhamak Dehghani | PERSON | 0.99+ |
Thomas Hazel | PERSON | 0.99+ |
Nick Taylor | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Jack Greenfield | PERSON | 0.99+ |
Kristen Martin | PERSON | 0.99+ |
Ken Schiffman | PERSON | 0.99+ |
Veronica Durgin | PERSON | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
Rob Ho | PERSON | 0.99+ |
Warner Media | ORGANIZATION | 0.99+ |
Tristan Handy | PERSON | 0.99+ |
Veronika Durgin | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Ionis Pharmaceutical | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Bob Muglia | PERSON | 0.99+ |
David Flore | PERSON | 0.99+ |
DBT Labs | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Bob | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
21 sessions | QUANTITY | 0.99+ |
Darren Bramberm | PERSON | 0.99+ |
33 guests | QUANTITY | 0.99+ |
Nir Zuk | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Harveer Singh | PERSON | 0.99+ |
Kit Colbert | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Sanjeev Mohan | PERSON | 0.99+ |
Supercloud 2 | TITLE | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
Western Union | ORGANIZATION | 0.99+ |
Cohesity | ORGANIZATION | 0.99+ |
Supercloud | ORGANIZATION | 0.99+ |
200 locations | QUANTITY | 0.99+ |
August | DATE | 0.99+ |
Keith Townsend | PERSON | 0.99+ |
Data Mesh | ORGANIZATION | 0.99+ |
Palo Alto Networks | ORGANIZATION | 0.99+ |
David.Vellante@siliconangle.com | OTHER | 0.99+ |
next week | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
first point | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
VMware | ORGANIZATION | 0.98+ |
Silicon Angle | ORGANIZATION | 0.98+ |
ETR | ORGANIZATION | 0.98+ |
Eric Bradley | PERSON | 0.98+ |
two | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Sachs | ORGANIZATION | 0.98+ |
SAKS | ORGANIZATION | 0.98+ |
Supercloud | EVENT | 0.98+ |
last August | DATE | 0.98+ |
each week | QUANTITY | 0.98+ |
ML & AI Keynote Analysis | AWS re:Invent 2022
>>Hey, welcome back everyone. Day three of eight of us Reinvent 2022. I'm John Farmer with Dave Volante, co-host the q Dave. 10 years for us, the leader in high tech coverage is our slogan. Now 10 years of reinvent day. We've been to every single one except with the original, which we would've come to if Amazon actually marketed the event, but they didn't. It's more of a customer event. This is day three. Is the machine learning ai keynote sws up there. A lot of announcements. We're gonna break this down. We got, we got Andy Thra here, vice President, prince Constellation Research. Andy, great to see you've been on the cube before one of our analysts bringing the, bringing the, the analysis, commentary to the keynote. This is your wheelhouse. Ai. What do you think about Swami up there? I mean, he's awesome. We love him. Big fan Oh yeah. Of of the Cuban we're fans of him, but he got 13 announcements. >>A lot. A lot, >>A lot. >>So, well some of them are, first of all, thanks for having me here and I'm glad to have both of you on the same show attacking me. I'm just kidding. But some of the announcement really sort of like a game changer announcements and some of them are like, meh, you know, just to plug in the holes what they have and a lot of golf claps. Yeah. Meeting today. And you could have also noticed that by, when he was making the announcements, you know, the, the, the clapping volume difference, you could say, which is better, right? But some of the announcements are, are really, really good. You know, particularly we talked about, one of that was Microsoft took that out of, you know, having the open AI in there, doing the large language models. And then they were going after that, you know, having the transformer available to them. And Amazon was a little bit weak in the area, so they couldn't, they don't have a large language model. So, you know, they, they are taking a different route saying that, you know what, I'll help you train the large language model by yourself, customized models. So I can provide the necessary instance. I can provide the instant volume, memory, the whole thing. Yeah. So you can train the model by yourself without depending on them kind >>Of thing. So Dave and Andy, I wanna get your thoughts cuz first of all, we've been following Amazon's deep bench on the, on the infrastructure pass. They've been doing a lot of machine learning and ai, a lot of data. It just seems that the sentiment is that there's other competitors doing a good job too. Like Google, Dave. And I've heard folks in the hallway, even here, ex Amazonians saying, Hey, they're train their models on Google than they bring up the SageMaker cuz it's better interface. So you got, Google's making a play for being that data cloud. Microsoft's obviously putting in a, a great kind of package to kind of make it turnkey. How do they really stand versus the competition guys? >>Good question. So they, you know, each have their own uniqueness and the we variation that take it to the field, right? So for example, if you were to look at it, Microsoft is known for as industry or later things that they are been going after, you know, industry verticals and whatnot. So that's one of the things I looked here, you know, they, they had this omic announcement, particularly towards that healthcare genomics space. That's a huge space for hpz related AIML applications. And they have put a lot of things in together in here in the SageMaker and in the, in their models saying that, you know, how do you, how do you use this transmit to do things like that? Like for example, drug discovery, for genomics analysis, for cancer treatment, the whole, right? That's a few volumes of data do. So they're going in that healthcare area. Google has taken a different route. I mean they want to make everything simple. All I have to do is I gotta call an api, give what I need and then get it done. But Amazon wants to go at a much deeper level saying that, you know what? I wanna provide everything you need. You can customize the whole thing for what you need. >>So to me, the big picture here is, and and Swami references, Hey, we are a data company. We started, he talked about books and how that informed them as to, you know, what books to place front and center. Here's the, here's the big picture. In my view, companies need to put data at the core of their business and they haven't, they've generally put humans at the core of their business and data. And now machine learning are at the, at the outside and the periphery. Amazon, Google, Microsoft, Facebook have put data at their core. So the question is how do incumbent companies, and you mentioned some Toyota Capital One, Bristol Myers Squibb, I don't know, are those data companies, you know, we'll see, but the challenge is most companies don't have the resources as you well know, Andy, to actually implement what Google and Facebook and others have. >>So how are they gonna do that? Well, they're gonna buy it, right? So are they gonna build it with tools that's kind of like you said the Amazon approach or are they gonna buy it from Microsoft and Google, I pulled some ETR data to say, okay, who are the top companies that are showing up in terms of spending? Who's spending with whom? AWS number one, Microsoft number two, Google number three, data bricks. Number four, just in terms of, you know, presence. And then it falls down DataRobot, Anaconda data icu, Oracle popped up actually cuz they're embedding a lot of AI into their products and, and of course IBM and then a lot of smaller companies. But do companies generally customers have the resources to do what it takes to implement AI into applications and into workflows? >>So a couple of things on that. One is when it comes to, I mean it's, it's no surprise that the, the top three or the hyperscalers, because they all want to bring their business to them to run the specific workloads on the next biggest workload. As you was saying, his keynote are two things. One is the A AIML workloads and the other one is the, the heavy unstructured workloads that he was talking about. 80%, 90% of the data that's coming off is unstructured. So how do you analyze that? Such as the geospatial data. He was talking about the volumes of data you need to analyze the, the neural deep neural net drug you ought to use, only hyperscale can do it, right? So that's no wonder all of them on top for the data, one of the things they announced, which not many people paid attention, there was a zero eight L that that they talked about. >>What that does is a little bit of a game changing moment in a sense that you don't have to, for example, if you were to train the data, data, if the data is distributed everywhere, if you have to bring them all together to integrate it, to do that, it's a lot of work to doing the dl. So by taking Amazon, Aurora, and then Rich combine them as zero or no ETL and then have Apaches Apaches Spark applications run on top of analytical applications, ML workloads. That's huge. So you don't have to move around the data, use the data where it is, >>I, I think you said it, they're basically filling holes, right? Yeah. They created this, you know, suite of tools, let's call it. You might say it's a mess. It's not a mess because it's, they're really powerful but they're not well integrated and now they're starting to take the seams as I say. >>Well yeah, it's a great point. And I would double down and say, look it, I think that boring is good. You know, we had that phase in Kubernetes hype cycle where it got boring and that was kind of like, boring is good. Boring means we're getting better, we're invisible. That's infrastructure that's in the weeds, that's in between the toes details. It's the stuff that, you know, people we have to get done. So, you know, you look at their 40 new data sources with data Wrangler 50, new app flow connectors, Redshift Auto Cog, this is boring. Good important shit Dave. The governance, you gotta get it and the governance is gonna be key. So, so to me, this may not jump off the page. Adam's keynote also felt a little bit of, we gotta get these gaps done in a good way. So I think that's a very positive sign. >>Now going back to the bigger picture, I think the real question is can there be another independent cloud data cloud? And that's the, to me, what I try to get at my story and you're breaking analysis kind of hit a home run on this, is there's interesting opportunity for an independent data cloud. Meaning something that isn't aws, that isn't, Google isn't one of the big three that could sit in. And so let me give you an example. I had a conversation last night with a bunch of ex Amazonian engineering teams that left the conversation was interesting, Dave. They were like talking, well data bricks and Snowflake are basically batch, okay, not transactional. And you look at Aerospike, I can see their booth here. Transactional data bases are hot right now. Streaming data is different. Confluence different than data bricks. Is data bricks good at hosting? >>No, Amazon's better. So you start to see these kinds of questions come up where, you know, data bricks is great, but maybe not good for this, that and the other thing. So you start to see the formation of swim lanes or visibility into where people might sit in the ecosystem, but what came out was transactional. Yep. And batch the relationship there and streaming real time and versus you know, the transactional data. So you're starting to see these new things emerge. Andy, what do you, what's your take on this? You're following this closely. This seems to be the alpha nerd conversation and it all points to who's gonna have the best data cloud, say data, super clouds, I call it. What's your take? >>Yes, data cloud is important as well. But also the computational that goes on top of it too, right? Because when, when the data is like unstructured data, it's that much of a huge data, it's going to be hard to do that with a low model, you know, compute power. But going back to your data point, the training of the AIML models required the batch data, right? That's when you need all the, the historical data to train your models. And then after that, when you do inference of it, that's where you need the streaming real time data that's available to you too. You can make an inference. One of the things, what, what they also announced, which is somewhat interesting, is you saw that they have like 700 different instances geared towards every single workload. And there are some of them very specifically run on the Amazon's new chip. The, the inference in two and theran tr one chips that basically not only has a specific instances but also is run on a high powered chip. And then if you have that data to support that, both the training as well as towards the inference, the efficiency, again, those numbers have to be proven. They claim that it could be anywhere between 40 to 60% faster. >>Well, so a couple things. You're definitely right. I mean Snowflake started out as a data warehouse that was simpler and it's not architected, you know, in and it's first wave to do real time inference, which is not now how, how could they, the other second point is snowflake's two or three years ahead when it comes to governance, data sharing. I mean, Amazon's doing what always does. It's copying, you know, it's customer driven. Cuz they probably walk into an account and they say, Hey look, what's Snowflake's doing for us? This stuff's kicking ass. And they go, oh, that's a good idea, let's do that too. You saw that with separating compute from storage, which is their tiering. You saw it today with extending data, sharing Redshift, data sharing. So how does Snowflake and data bricks approach this? They deal with ecosystem. They bring in ecosystem partners, they bring in open source tooling and that's how they compete. I think there's unquestionably an opportunity for a data cloud. >>Yeah, I think, I think the super cloud conversation and then, you know, sky Cloud with Berkeley Paper and other folks talking about this kind of pre, multi-cloud era. I mean that's what I would call us right now. We are, we're kind of in the pre era of multi-cloud, which by the way is not even yet defined. I think people use that term, Dave, to say, you know, some sort of magical thing that's happening. Yeah. People have multiple clouds. They got, they, they end up by default, not by design as Dell likes to say. Right? And they gotta deal with it. So it's more of they're inheriting multiple cloud environments. It's not necessarily what they want in the situation. So to me that is a big, big issue. >>Yeah, I mean, again, going back to your snowflake and data breaks announcements, they're a data company. So they, that's how they made their mark in the market saying that, you know, I do all those things, therefore you have, I had to have your data because it's a seamless data. And, and Amazon is catching up with that with a lot of that announcements they made, how far it's gonna get traction, you know, to change when I to say, >>Yeah, I mean to me, to me there's no doubt about Dave. I think, I think what Swamee is doing, if Amazon can get corner the market on out of the box ML and AI capabilities so that people can make it easier, that's gonna be the end of the day tell sign can they fill in the gaps. Again, boring is good competition. I don't know mean, mean I'm not following the competition. Andy, this is a real question mark for me. I don't know where they stand. Are they more comprehensive? Are they more deeper? Are they have deeper services? I mean, obviously shows to all the, the different, you know, capabilities. Where, where, where does Amazon stand? What's the process? >>So what, particularly when it comes to the models. So they're going at, at a different angle that, you know, I will help you create the models we talked about the zero and the whole data. We'll get the data sources in, we'll create the model. We'll move the, the whole model. We are talking about the ML ops teams here, right? And they have the whole functionality that, that they built ind over the year. So essentially they want to become the platform that I, when you come in, I'm the only platform you would use from the model training to deployment to inference, to model versioning to management, the old s and that's angle they're trying to take. So it's, it's a one source platform. >>What about this idea of technical debt? Adrian Carro was on yesterday. John, I know you talked to him as well. He said, look, Amazon's Legos, you wanna buy a toy for Christmas, you can go out and buy a toy or do you wanna build a, to, if you buy a toy in a couple years, you could break and what are you gonna do? You're gonna throw it out. But if you, if you, if part of your Lego needs to be extended, you extend it. So, you know, George Gilbert was saying, well, there's a lot of technical debt. Adrian was countering that. Does Amazon have technical debt or is that Lego blocks analogy the right one? >>Well, I talked to him about the debt and one of the things we talked about was what do you optimize for E two APIs or Kubernetes APIs? It depends on what team you're on. If you're on the runtime gene, you're gonna optimize for Kubernetes, but E two is the resources you want to use. So I think the idea of the 15 years of technical debt, I, I don't believe that. I think the APIs are still hardened. The issue that he brings up that I think is relevant is it's an end situation, not an or. You can have the bag of Legos, which is the primitives and build a durable application platform, monitor it, customize it, work with it, build it. It's harder, but the outcome is durability and sustainability. Building a toy, having a toy with those Legos glued together for you, you can get the play with, but it'll break over time. Then you gotta replace it. So there's gonna be a toy business and there's gonna be a Legos business. Make your own. >>So who, who are the toys in ai? >>Well, out of >>The box and who's outta Legos? >>The, so you asking about what what toys Amazon building >>Or, yeah, I mean Amazon clearly is Lego blocks. >>If people gonna have out the box, >>What about Google? What about Microsoft? Are they basically more, more building toys, more solutions? >>So Google is more of, you know, building solutions angle like, you know, I give you an API kind of thing. But, but if it comes to vertical industry solutions, Microsoft is, is is ahead, right? Because they have, they have had years of indu industry experience. I mean there are other smaller cloud are trying to do that too. IBM being an example, but you know, the, now they are starting to go after the specific industry use cases. They think that through, for example, you know the medical one we talked about, right? So they want to build the, the health lake, security health lake that they're trying to build, which will HIPPA and it'll provide all the, the European regulations, the whole line yard, and it'll help you, you know, personalize things as you need as well. For example, you know, if you go for a certain treatment, it could analyze you based on your genome profile saying that, you know, the treatment for this particular person has to be individualized this way, but doing that requires a anomalous power, right? So if you do applications like that, you could bring in a lot of the, whether healthcare, finance or what have you, and then easy for them to use. >>What's the biggest mistake customers make when it comes to machine intelligence, ai, machine learning, >>So many things, right? I could start out with even the, the model. Basically when you build a model, you, you should be able to figure out how long that model is effective. Because as good as creating a model and, and going to the business and doing things the right way, there are people that they leave the model much longer than it's needed. It's hurting your business more than it is, you know, it could be things like that. Or you are, you are not building a responsibly or later things. You are, you are having a bias and you model and are so many issues. I, I don't know if I can pinpoint one, but there are many, many issues. Responsible ai, ethical ai. All >>Right, well, we'll leave it there. You're watching the cube, the leader in high tech coverage here at J three at reinvent. I'm Jeff, Dave Ante. Andy joining us here for the critical analysis and breaking down the commentary. We'll be right back with more coverage after this short break.
SUMMARY :
Ai. What do you think about Swami up there? A lot. of, you know, having the open AI in there, doing the large language models. So you got, Google's making a play for being that data cloud. So they, you know, each have their own uniqueness and the we variation that take it to have the resources as you well know, Andy, to actually implement what Google and they gonna build it with tools that's kind of like you said the Amazon approach or are they gonna buy it from Microsoft the neural deep neural net drug you ought to use, only hyperscale can do it, right? So you don't have to move around the data, use the data where it is, They created this, you know, It's the stuff that, you know, people we have to get done. And so let me give you an example. So you start to see these kinds of questions come up where, you know, it's going to be hard to do that with a low model, you know, compute power. was simpler and it's not architected, you know, in and it's first wave to do real time inference, I think people use that term, Dave, to say, you know, some sort of magical thing that's happening. you know, I do all those things, therefore you have, I had to have your data because it's a seamless data. the different, you know, capabilities. at a different angle that, you know, I will help you create the models we talked about the zero and you know, George Gilbert was saying, well, there's a lot of technical debt. Well, I talked to him about the debt and one of the things we talked about was what do you optimize for E two APIs or Kubernetes So Google is more of, you know, building solutions angle like, you know, I give you an API kind of thing. you know, it could be things like that. We'll be right back with more coverage after this short break.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Adrian | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Andy | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
IBM | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Adrian Carro | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Andy Thra | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
15 years | QUANTITY | 0.99+ |
John | PERSON | 0.99+ |
Adam | PERSON | 0.99+ |
13 announcements | QUANTITY | 0.99+ |
Lego | ORGANIZATION | 0.99+ |
John Farmer | PERSON | 0.99+ |
Dave Ante | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
10 years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Legos | ORGANIZATION | 0.99+ |
Bristol Myers Squibb | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Constellation Research | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
Christmas | EVENT | 0.99+ |
second point | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
Anaconda | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Berkeley Paper | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
eight | QUANTITY | 0.98+ |
700 different instances | QUANTITY | 0.98+ |
three years | QUANTITY | 0.98+ |
Swami | PERSON | 0.98+ |
Aerospike | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
Snowflake | ORGANIZATION | 0.98+ |
two things | QUANTITY | 0.98+ |
60% | QUANTITY | 0.98+ |
Breaking Analysis: Five Questions About Snowflake’s Pending IPO
>> From theCUBE Studios in Palo Alto in Boston, bringing you data driven insights from theCUBE and ETR. This is breaking analysis with Dave Vellante. >> In June of this year, Snowflake filed a confidential document suggesting that it would do an IPO. Now of course, everybody knows about it, found out about it and it had a $20 billion valuation. So, many in the community and the investment community and so forth are excited about this IPO. It could be the hottest one of the year, and we're getting a number of questions from investors and practitioners and the entire Wiki bond, ETR and CUBE community. So, welcome everybody. This is Dave Vellante. This is "CUBE Insights" powered by ETR. In this breaking analysis, we're going to unpack five critical questions around Snowflake's IPO or pending IPO. And with me to discuss that is Erik Bradley. He's the Chief Engagement Strategists at ETR and he's also the Managing Director of VENN. Erik, thanks for coming on and great to see you as always. >> Great to see you too. Always enjoy being on the show. Thank you. >> Now for those of you don't know Erik, VENN is a roundtable that he hosts and he brings in CIOs, IT practitioners, CSOs, data experts and they have an open and frank conversation, but it's private to ETR clients. But they know who the individual is, what their role is, what their title is, et cetera and it's a kind of an ask me anything. And I participated in one of them this past week. Outstanding. And we're going to share with you some of that. But let's bring up the agenda slide if we can here. And these are really some of the questions that we're getting from investors and others in the community. There's really five areas that we want to address. The first is what's happening in this enterprise data warehouse marketplace? The second thing is kind of a one area. What about the legacy EDW players like Oracle and Teradata and Netezza? The third question we get a lot is can Snowflake compete with the big cloud players? Amazon, Google, Microsoft. I mean they're right there in the heart, in the thick of things there. And then what about that multi-cloud strategy? Is that viable? How much of a differentiator is that? And then we get a lot of questions on the TAM. Meaning the total available market. How big is that market? Does it justify the valuation for Snowflake? Now, Erik, you've been doing this now. You've run a couple VENNs, you've been following this, you've done some other work that you've done with Eagle Alpha. What's your, just your initial sort of takeaway from all this work that you've been doing. >> Yeah, sure. So my first take on Snowflake was about two and a half years ago. I actually hosted them for one of my VENN interviews and my initial thought was impressed. So impressed. They were talking at the time about their ability to kind of make ease of use of a multi-cloud strategy. At the time although I was impressed, I did not expect the growth and the hyper growth that we have seen now. But, looking at the company in its current iteration, I understand where the hype is coming from. I mean, it's 12 and a half billion private valuation in the last round. The least confidential IPO (laughs) anyone's ever seen (Dave laughs) with a 15 to $20 billion valuation coming out, which is more than Teradata, Margo and Cloudera combined. It's a great question. So obviously the success to this point is warranted, but we need to see what they're going to be able to do next. So I think the agenda you laid out is a great one and I'm looking forward to getting into some of those details. >> So let's start with what's happening in the marketplace and let's pull up a slide that I very much love to use. It's the classic X-Y. On the vertical axis here we show net score. And remember folks, net score is an indicator of spending momentum. ETR every quarter does like a clockwork survey where they're asking people, "Essentially are you spending more or less?" They subtract the less from the more and comes up with a net score. It's more complicated than, but like NPS, it's a very simple and reliable methodology. That's the vertical axis. And the horizontal axis is what's called market share. Market share is the pervasiveness within the data set. So it's calculated by the number of mentions of the vendor divided by the number of mentions within that sector. And what we're showing here is the EDW sector. And we've pulled out a few companies that I want to talk about. So the big three, obviously Microsoft, AWS and Google. And you can see Microsoft has a huge presence far to the right. AWS, very, very strong. A lot of Redshift in there. And then they're pretty high on the vertical axis. And then Google, not as much share, but very solid in that. Close to 60% net score. And then you can see above all of them from a vertical standpoint is Snowflake with a 77.5% net score. You can see them in the upper right there in the green. One of the highest Erik in the entire data set. So, let's start with some sort of initial comments on the big guys and Snowflakes. Your thoughts? >> Sure. Just first of all to comment on the data, what we're showing there is just the data warehousing sector, but Snowflake's actual net score is that high amongst the entire universe that we follow. Their data strength is unprecedented and we have forward-looking spending intention. So this bodes very well for them. Now, what you did say very accurately is there's a difference between their spending intentions on a net revenue level compared to AWS, Microsoft. There no one's saying that this is an apples-to-apples comparison when it comes to actual revenue. So we have to be very cognizant of that. There is domination (laughs) quite frankly from AWS and from Azure. And Snowflake is a necessary component for them not only to help facilitate a multi-cloud, but look what's happening right now in the US Congress, right? We have these tech leaders being grilled on their actual dominance. And one of the main concerns they have is the amount of data that they're collecting. So I think the environment is right to have another player like this. I think Snowflake really has a lot of longevity and our data is supporting that. And the commentary that we hear from our end users, the people that take the survey are supporting that as well. >> Okay, and then let's stay on this X-Y slide for a moment. I want to just pull out a couple of other comments here, because one of the questions we're asking is Whither, the legacy EDW players. So we've got in here, IBM, Oracle, you can see Teradata and then Hortonworks and MapR. We're going to talk a little bit about Hortonworks 'cause it's now Cloudera. We're going to talk a little bit about Hadoop and some of the data lakes. So you can see there they don't have nearly the net score momentum. Oracle obviously has a huge install base and is investing quite frankly in R&D and do an Exadata and it has its own cloud. So, it's got a lock on it's customers and if it keeps investing and adding value, it's not going away. IBM with Netezza, there's really been some questions around their commitment to that base. And I know that a lot of the folks in the VENNs that we've talked to Erik have said, "Well, we're replacing Netezza." Frank Slootman has been very vocal about going after Teradata. And then we're going to talk a little bit about the Hadoop space. But, can you summarize for us your thoughts in your research and the commentary from your community, what's going on with the legacy guys? Are these guys cooked? Can they hang on? What's your take? >> Sure. We focus on this quite a bit actually. So, I'm going to talk about it from the data perspective first, and then we'll go into some of the commentary and the panel. You even joined one yesterday. You know that it was touched upon. But, first on the data side, what we're noticing and capturing is a widening bifurcation between these cloud native and the legacy on-prem. It is undeniable. There is nothing that you can really refute. The data is concrete and it is getting worse. That gap is getting wider and wider and wider. Now, the one thing I will say is, nobody's going to rip out their legacy applications tomorrow. It takes years and years. So when you look at Teradata, right? Their market cap's only 2 billion, 2.3 billion. How much revenue growth do they need to stay where they are? Not much, right? No one's expecting them to grow 20%, which is what you're seeing on the left side of that screen. So when you look at the legacy versus the cloud native, there is very clear direction of what's happening. The one thing I would note from the data perspective is if you switched from net score or adoptions and you went to flat spending, you suddenly see Oracle and Teradata move over to that left a little bit, because again what I'm trying to say is I don't think they're going to catch up. No, but also don't think they're going away tomorrow. That these have large install bases, they have relationships. Now to kind of get into what you were saying about each particular one, IBM, they shut down Netezza. They shut it down and then they brought it back to life. How does that make you feel if you're the head of data architecture or you're DevOps and you're trying to build an application for a large company? I'm not going back to that. There's absolutely no way. Teradata on the other hand is known to be incredibly stable. They are known to just not fail. If you need to kind of re-architect or you do a migration, they work. Teradata also has a lot of compliance built in. So if you're a financials, if you have a regulated business or industry, there's still some data sets that you're not going to move up to the cloud. Whether it's a PII compliance or financial reasons, some of that stuff is still going to live on-prem. So Teradata is still has a very good niche. And from what we're hearing from our panels, then this is a direct quote if you don't mind me looking off screen for one second. But this is a great one. Basically said, "Teradata is the only one from the legacy camp who is putting up a fight and not giving up." Basically from a CIO perspective, the rest of them aren't an option anymore. But Teradata is still fighting and that's great to hear. They have their own data as a service offering and listen, they're a small market cap compared to these other companies we're talking about. But, to summarize, the data is very clear. There is a widening bifurcation between the two camps. I do not think legacy will catch up. I think all net new workloads are moving to data as a service, moving to cloud native, moving to hosted, but there are still going to be some existing legacy on-prem applications that will be supported with these older databases. And of those, Oracle and Teradata are still viable options. >> I totally agree with you and my colleague David Floyd is actually quite high on Teradata Vantage because he really does believe that a key component, we're going to talk about the TAM in a minute, but a key component of the TAM he believes must include the on-premises workloads. And Frank Slootman has been very clear, "We're not doing on-prem, we're not doing this halfway house." And so that's an opportunity for companies like Teradata, certainly Oracle I would put it in that camp is putting up a fight. Vertica is another one. They're very small, but another one that's sort of battling it out from the old NPP world. But that's great. Let's go into some of the specifics. Let's bring up here some of the specific commentary that we've curated here from the roundtables. I'm going to go through these and then ask you to comment. The first one is just, I mean, people are obviously very excited about Snowflake. It's easy to use, the whole thing zero to Snowflake in 90 minutes, but Snowflake is synonymous with cloud-native data warehousing. There are no equals. We heard that a lot from your VENN panelist. >> We certainly did. There was even more euphoria around Snowflake than I expected when we started hosting these series of data warehousing panels. And this particular gentleman that said that happens to be the global head of data architecture for a fortune 100 financials company. And you mentioned earlier that we did a report alongside Eagle Alpha. And we noticed that among fortune 100 companies that are also using the big three public cloud companies, Snowflake is growing market share faster than anyone else. They are positioned in a way where even if you're aligned with Azure, even if you're aligned with AWS, if you're a large company, they are gaining share right now. So that particular gentleman's comments was very interesting. He also made a comment that said, "Snowflake is the person who championed the idea that data warehousing is not dead yet. Use that old monthly Python line and you're not dead yet." And back in the day where the Hadoop came along and the data lakes turned into a data swamp and everyone said, "We don't need warehousing anymore." Well, that turned out to be a head fake, right? Hadoop was an interesting technology, but it's a complex technology. And it ended up not really working the way people want it. I think Snowflake came in at that point at an opportune time and said, "No, data warehousing isn't dead. We just have to separate the compute from the storage layer and look at what I can do. That increases flexibility, security. It gives you that ability to run across multi-cloud." So honestly the commentary has been nothing but positive. We can get into some of the commentary about people thinking that there's competition catching up to what they do, but there is no doubt that right now Snowflake is the name when it comes to data as a service. >> The other thing we heard a lot was ETL is going to get completely disrupted, you sort of embedded ETL. You heard one panelist say, "Well, it's interesting to see that guys like Informatica are talking about how fast they can run inside a Snowflake." But Snowflake is making that easy. That data prep is sort of part of the package. And so that does not bode well for ETL vendors. >> It does not, right? So ETL is a legacy of on-prem databases and even when Hadoop came along, it still needed that extra layer to kind of work with the data. But this is really, really disrupting them. Now the Snowflake's credit, they partner well. All the ETL players are partnered with Snowflake, they're trying to play nice with them, but the writings on the wall as more and more of this application and workloads move to the cloud, you don't need the ETL layer. Now, obviously that's going to affect their talent and Informatica the most. We had a recent comment that said, this was a CIO who basically said, "The most telling thing about the ETL players right now is every time you speak to them, all they talk about is how they work in a Snowflake architecture." That's their only metric that they talk about right now. And he said, "That's very telling." That he basically used it as it's their existential identity to be part of Snowflake. If they're not, they don't exist anymore. So it was interesting to have sort of a philosophical comment brought up in one of my roundtables. But that's how important playing nice and finding a niche within this new data as a service is for ETL, but to be quite honest, they might be going the same way of, "Okay, let's figure out our niche on these still the on-prem workloads that are still there." I think over time we might see them maybe as an M&A possibility, whether it's Snowflake or one of these new up and comers, kind of bring them in and sort of take some of the technology that's useful and layer it in. But as a large market cap, solo existing niche, I just don't know how long ETL is for this world. >> Now, yeah. I mean, you're right that if it wasn't for the marketing, they're not fighting fashion. But >> No. >> really there're some challenges there. Now, there were some contrarians in the panel and they signaled some potential icebergs ahead. And I guarantee you're going to see this in Snowflake's Red Herring when we actually get it. Like we're going to see all the risks. One of the comments, I'll mention the two and then we can talk about it. "Their engineering advantage will fade over time." Essentially we're saying that people are going to copycat and we've seen that. And the other point is, "Hey, we might see some similar things that happened to Hadoop." The public cloud players giving away these offerings at zero cost. Essentially marginal cost of adding another service is near zero. So the cloud players will use their heft to compete. Your thoughts? >> Yeah, first of all one of the reasons I love doing panels, right? Because we had three gentlemen on this panel that all had nothing but wonderful things to say. But you always get one. And this particular person is a CTO of a well known online public travel agency. We'll put it that way. And he said, "I'm going to be the contrarian here. I have seven different technologies from private companies that do the same thing that I'm evaluating." So that's the pressure from behind, right? The technology, they're going to catch up. Right now Snowflake has the best engineering which interestingly enough they took a lot of that engineering from IBM and Teradata if you actually go back and look at it, which was brought up in our panel as well. He said, "However, the engineering will catch up. They always do." Now from the other side they're getting squeezed because the big cloud players just say, "Hey, we can do this too. I can bundle it with all the other services I'm giving you and I can squeeze your pay. Pretty much give it a waive at the cost." So I do think that there is a very valid concern. When you come out with a $20 billion IPO evaluation, you need to warrant that. And when you see competitive pressures from both sides, from private emerging technologies and from the more dominant public cloud players, you're going to get squeezed there a little bit. And if pricing gets squeezed, it's going to be very, very important for Snowflake to continue to innovate. That comment you brought up about possibly being the next Cloudera was certainly the best sound bite that I got. And I'm going to use it as Clickbait in future articles, because I think everyone who starts looking to buy a Snowflake stock and they see that, they're going to need to take a look. But I would take that with a grain of salt. I don't think that's happening anytime soon, but what that particular CTO was referring to was if you don't innovate, the technology itself will become commoditized. And he believes that this technology will become commoditized. So therefore Snowflake has to continue to innovate. They have to find other layers to bring in. Whether that's through their massive war chest of cash they're about to have and M&A, whether that's them buying analytics company, whether that's them buying an ETL layer, finding a way to provide more value as they move forward is going to be very important for them to justify this valuation going forward. >> And I want to comment on that. The Cloudera, Hortonworks, MapRs, Hadoop, et cetera. I mean, there are dramatic differences obviously. I mean, that whole space was so hard, very difficult to stand up. You needed science project guys and lab coats to do it. It was very services intensive. As well companies like Cloudera had to fund all these open source projects and it really squeezed their R&D. I think Snowflake is much more focused and you mentioned some of the background of their engineers, of course Oracle guys as well. However, you will see Amazon's going to trot out a ton of customers using their RA3 managed storage and their flash. I think it's the DC two piece. They have a ton of action in the marketplace because it's just so easy. It's interesting one of the comments, you asked this yesterday, was with regard to separating compute from storage, which of course it's Snowflakes they basically invented it, it was one of their climbs to fame. The comment was what AWS has done to separate compute from storage for Redshift is largely a bolt on. Which I thought that was an interesting comment. I've had some other comments. My friend George Gilbert said, "Hey, despite claims to the contrary, AWS still hasn't separated storage from compute. What they have is really primitive." We got to dig into that some more, but you're seeing some data points that suggest there's copycatting going on. May not be as functional, but at the same time, Erik, like I was saying good enough is maybe good enough in this space. >> Yeah, and especially with the enterprise, right? You see what Microsoft has done. Their technology is not as good as all the niche players, but it's good enough and I already have a Microsoft license. So, (laughs) you know why am I going to move off of it. But I want to get back to the comment you mentioned too about that particular gentleman who made that comment about RedShift, their separation is really more of a bolt on than a true offering. It's interesting because I know who these people are behind the scenes and he has a very strong relationship with AWS. So it was interesting to me that in the panel yesterday he said he switched from Redshift to Snowflake because of that and some other functionality issues. So there is no doubt from the end users that are buying this. And he's again a fortune 100 financial organization. Not the same one we mentioned. That's a different one. But again, a fortune 100 well known financials organization. He switched from AWS to Snowflake. So there is no doubt that right now they have the technological lead. And when you look at our ETR data platform, we have that adoption reasoning slide that you show. When you look at the number one reason that people are adopting Snowflake is their feature set of technological lead. They have that lead now. They have to maintain it. Now, another thing to bring up on this to think about is when you have large data sets like this, and as we're moving forward, you need to have machine learning capabilities layered into it, right? So they need to make sure that they're playing nicely with that. And now you could go open source with the Apache suite, but Google is doing so well with BigQuery and so well with their machine learning aspects. And although they don't speak enterprise well, they don't sell to the enterprise well, that's changing. I think they're somebody to really keep an eye on because their machine learning capabilities that are layered into the BigQuery are impressive. Now, of course, Microsoft Azure has Databricks. They're layering that in, but this is an area where I think you're going to see maybe what's next. You have to have machine learning capabilities out of the box if you're going to do data as a service. Right now Snowflake doesn't really have that. Some of the other ones do. So I had one of my guest panelist basically say to me, because of that, they ended up going with Google BigQuery because he was able to run a machine learning algorithm within hours of getting set up. Within hours. And he said that that kind of capability out of the box is what people are going to have to use going forward. So that's another thing we should dive into a little bit more. >> Let's get into that right now. Let's bring up the next slide which shows net score. Remember this is spending momentum across the major cloud players and plus Snowflake. So you've got Snowflake on the left, Google, AWS and Microsoft. And it's showing three survey timeframes last October, April 20, which is right in the middle of the pandemic. And then the most recent survey which has just taken place this month in July. And you can see Snowflake very, very high scores. Actually improving from the last October survey. Google, lower net scores, but still very strong. Want to come back to that and pick up on your comments. AWS dipping a little bit. I think what's happening here, we saw this yesterday with AWS's results. 30% growth. Awesome. Slight miss on the revenue side for AWS, but look, I mean massive. And they're so exposed to so many industries. So some of their industries have been pretty hard hit. Microsoft pretty interesting. A little softness there. But one of the things I wanted to pick up on Erik, when you're talking about Google and BigQuery and it's ML out of the box was what we heard from a lot of the VENN participants. There's no question about it that Google technically I would say is one of Snowflake's biggest competitors because it's cloud native. Remember >> Yep. >> AWS did a license one time. License deal with PowerShell and had a sort of refactor the thing to be cloud native. And of course we know what's happening with Microsoft. They basically were on-prem and then they put stuff in the cloud and then all the updates happen in the cloud. And then they pushed to on-prem. But they have that what Frank Slootman calls that halfway house, but BigQuery no question technically is very, very solid. But again, you see Snowflake right now anyway outpacing these guys in terms of momentum. >> Snowflake is out outpacing everyone (laughs) across our entire survey universe. It really is impressive to see. And one of the things that they have going for them is they can connect all three. It's that multi-cloud ability, right? That portability that they bring to you is such an important piece for today's modern CIO as data architects. They don't want vendor lock-in. They are afraid of vendor lock-in. And this ability to make their data portable and to do that with ease and the flexibility that they offer is a huge advantage right now. However, I think you're a hundred percent right. Google has been so focused on the engineering side and never really focusing on the enterprise sales side. That is why they're playing catch up. I think they can catch up. They're bringing in some really important enterprise salespeople with experience. They're starting to learn how to talk to enterprise, how to sell, how to support. And nobody can really doubt their engineering. How many open sources have they given us, right? They invented Kubernetes and the entire container space. No one's really going to compete with them on that side if they learn how to sell it and support it. Yeah, right now they're behind. They're a distant third. Don't get me wrong. From a pure hosted ability, AWS is number one. Microsoft is yours. Sometimes it looks like it's number one, but you have to recognize that a lot of that is because of simply they're hosted 365. It's a SAS app. It's not a true cloud type of infrastructure as a service. But Google is a distant third, but their technology is really, really great. And their ability to catch up is there. And like you said, in the panels we were hearing a lot about their machine learning capability is right out of the box. And that's where this is going. What's the point of having this huge data if you're not going to be supporting it on new application architecture. And all of those applications require machine learning. >> Awesome. So we're. And I totally agree with what you're saying about Google. They just don't have it figured out how to sell the enterprise yet. And a hundred percent AWS has the best cloud. I mean, hands down. But a very, very competitive market as we heard yesterday in front of Congress. Now we're on the point about, can Snowflake compete with the big cloud players? I want to show one more data point. So let's bring up, this is the same chart as we showed before, but it's new adoptions. And this is really telling. >> Yeah. >> You can see Snowflake with 34% in the yellow, new adoptions, down yes from previous surveys, but still significantly higher than the other players. Interesting to see Google showing momentum on new adoptions, AWS down on new adoptions. And again, exposed to a lot of industries that have been hard hit. And Microsoft actually quite low on new adoption. So this is very impressive for Snowflake. And I want to talk about the multi-cloud strategy now Erik. This came up a lot. The VENN participants who are sort of fans of Snowflake said three things: It was really the flexibility, the security which is really interesting to me. And a lot of that had to do with the flexibility. The ability to easily set up roles and not have to waste a lot of time wrangling. And then the third was multi-cloud. And that was really something that came through heavily in the VENN. Didn't it? >> It really did. And again, I think it just comes down to, I don't think you can ever overstate how afraid these guys are of vendor lock-in. They can't have it. They don't want it. And it's best practice to make sure your sensitive information is being kind of spread out a little bit. We all know that people don't trust Bezos. So if you're in certain industries, you're not going to use AWS at all, right? So yeah, this ability to have your data portability through multi-cloud is the number one reason I think people start looking at Snowflake. And to go to your point about the adoptions, it's very telling and it bodes well for them going forward. Most of the things that we're seeing right now are net new workloads. So let's go again back to the legacy side that we were talking about, the Teradatas, IBMs, Oracles. They still have the monolithic applications and the data that needs to support that, right? Like an old ERP type of thing. But anyone who's now building a new application, bringing something new to market, it's all net new workloads. There is no net new workload that is going to go to SAP or IBM. It's not going to happen. The net new workloads are going to the cloud. And that's why when you switch from net score to adoption, you see Snowflake really stand out because this is about new adoption for net new workloads. And that's really where they're driving everything. So I would just say that as this continues, as data as a service continues, I think Snowflake's only going to gain more and more share for all the reasons you stated. Now get back to your comment about security. I was shocked by that. I really was. I did not expect these guys to say, "Oh, no. Snowflake enterprise security not a concern." So two panels ago, a gentleman from a fortune 100 financials said, "Listen, it's very difficult to get us to sign off on something for security. Snowflake is past it, it is enterprise ready, and we are going full steam ahead." Once they got that go ahead, there was no turning back. We gave it to our DevOps guys, we gave it to everyone and said, "Run with it." So, when a company that's big, I believe their fortune rank is 28. (laughs) So when a company that big says, "Yeah, you've got the green light. That we were okay with the internal compliance aspect, we're okay with the security aspect, this gives us multi-cloud portability, this gives us flexibility, ease of use." Honestly there's a really long runway ahead for Snowflake. >> Yeah, so the big question I have around the multi-cloud piece and I totally and I've been on record saying, "Look, if you're going looking for an agnostic multi-cloud, you're probably not going to go with the cloud vendor." (laughs) But I've also said that I think multi-cloud to date anyway has largely been a symptom as opposed to a strategy, but that's changing. But to your point about lock-in and also I think people are maybe looking at doing things across clouds, but I think that certainly it expands Snowflake's TAM and we're going to talk about that because they support multiple clouds and they're going to be the best at that. That's a mandate for them. The question I have is how much of complex joining are you going to be doing across clouds? And is that something that is just going to be too latency intensive? Is that really Snowflake's expertise? You're really trying to build that data layer. You're probably going to maybe use some kind of Postgres database for that. >> Right. >> I don't know. I need to dig into that, but that would be an opportunity from a TAM standpoint. I just don't know how real that is. >> Yeah, unfortunately I'm going to just be honest with this one. I don't think I have great expertise there and I wouldn't want to lead anyone a wrong direction. But from what I've heard from some of my VENN interview subjects, this is happening. So the data portability needs to be agnostic to the cloud. I do think that when you're saying, are there going to be real complex kind of workloads and applications? Yes, the answer is yes. And I think a lot of that has to do with some of the container architecture as well, right? If I can just pull data from one spot, spin it up for as long as I need and then just get rid of that container, that ethereal layer of compute. It doesn't matter where the cloud lies. It really doesn't. I do think that multi-cloud is the way of the future. I know that the container workloads right now in the enterprise are still very small. I've heard people say like, "Yeah, I'm kicking the tires. We got 5%." That's going to grow. And if Snowflake can make themselves an integral part of that, then yes. I think that's one of those things where, I remember the guy said, "Snowflake has to continue to innovate. They have to find a way to grow this TAM." This is an area where they can do so. I think you're right about that, but as far as my expertise, on this one I'm going to be honest with you and say, I don't want to answer incorrectly. So you and I need to dig in a little bit on this one. >> Yeah, as it relates to question four, what's the viability of Snowflake's multi-cloud strategy? I'll say unquestionably supporting multiple clouds, very viable. Whether or not portability across clouds, multi-cloud joins, et cetera, TBD. So we'll keep digging into that. The last thing I want to focus on here is the last question, does Snowflake's TAM justify its $20 billion valuation? And you think about the data pipeline. You go from data acquisition to data prep. I mean, that really is where Snowflake shines. And then of course there's analysis. You've got to bring in EMI or AI and ML tools. That's not Snowflake's strength. And then you're obviously preparing that, serving that up to the business, visualization. So there's potential adjacencies that they could get into that they may or may not decide to. But so we put together this next chart which is kind of the TAM expansion opportunity. And I just want to briefly go through it. We published this stuff so you can go and look at all the fine print, but it's kind of starts with the data lake disruption. You called it data swamp before. The Hadoop no schema on, right? Basically the ROI of Hadoop became reduction of investment as my friend Abby Meadow would say. But so they're kind of disrupting that data lake which really was a failure. And then really going after that enterprise data warehouse which is kind of I have it here as a 10 billion. It's actually bigger than that. It's probably more like a $20 billion market. I'll update this slide. And then really what Snowflake is trying to do is be data as a service. A data layer across data stores, across clouds, really make it easy to ingest and prepare data and then serve the business with insights. And then ultimately this huge TAM around automated decision making, real-time analytics, automated business processes. I mean, that is potentially an enormous market. We got a couple of hundred billion. I mean, just huge. Your thoughts on their TAM? >> I agree. I'm not worried about their TAM and one of the reasons why as I mentioned before, they are coming out with a whole lot of cash. (laughs) This is going to be a red hot IPO. They are going to have a lot of money to spend. And look at their management team. Who is leading the way? A very successful, wise, intelligent, acquisitive type of CEO. I think there is going to be M&A activity, and I believe that M&A activity is going to be 100% for the mindset of growing their TAM. The entire world is moving to data as a service. So let's take as a backdrop. I'm going to go back to the panel we did yesterday. The first question we asked was, there was an understanding or a theory that when the virus pandemic hit, people wouldn't be taking on any sort of net new architecture. They're like, "Okay, I have Teradata, I have IBM. Let's just make sure the lights are on. Let's stick with it." Every single person I've asked, they're just now eight different experts, said to us, "Oh, no. Oh, no, no." There is the virus pandemic, the shift from work from home. Everything we're seeing right now has only accelerated and advanced our data as a service strategy in the cloud. We are building for scale, adopting cloud for data initiatives. So, across the board they have a great backdrop. So that's going to only continue, right? This is very new. We're in the early innings of this. So for their TAM, that's great because that's the core of what they do. Now on top of it you mentioned the type of things about, yeah, right now they don't have great machine learning. That could easily be acquired and built in. Right now they don't have an analytics layer. I for one would love to see these guys talk to Alteryx. Alteryx is red hot. We're seeing great data and great feedback on them. If they could do that business intelligence, that analytics layer on top of it, the entire suite as a service, I mean, come on. (laughs) Their TAM is expanding in my opinion. >> Yeah, your point about their leadership is right on. And I interviewed Frank Slootman right in the heart of the pandemic >> So impressed. >> and he said, "I'm investing in engineering almost sight unseen. More circumspect around sales." But I will caution people. That a lot of people I think see what Slootman did with ServiceNow. And he came into ServiceNow. I have to tell you. It was they didn't have their unit economics right, they didn't have their sales model and marketing model. He cleaned that up. Took it from 120 million to 1.2 billion and really did an amazing job. People are looking for a repeat here. This is a totally different situation. ServiceNow drove a truck through BMCs install base and with IT help desk and then created this brilliant TAM expansion. Let's learn and expand model. This is much different here. And Slootman also told me that he's a situational CEO. He doesn't have a playbook. And so that's what is most impressive and interesting about this. He's now up against the biggest competitors in the world: AWS, Google and Microsoft and dozens of other smaller startups that have raised a lot of money. Look at the company like Yellowbrick. They've raised I don't know $180 million. They've got a great team. Google, IBM, et cetera. So it's going to be really, really fun to watch. I'm super excited, Erik, but I'll tell you the data right now suggest they've got a great tailwind and if they can continue to execute, this is going to be really fun to watch. >> Yeah, certainly. I mean, when you come out and you are as impressive as Snowflake is, you get a target on your back. There's no doubt about it, right? So we said that they basically created the data as a service. That's going to invite competition. There's no doubt about it. And Yellowbrick is one that came up in the panel yesterday about one of our CIOs were doing a proof of concept with them. We had about seven others mentioned as well that are startups that are in this space. However, none of them despite their great valuation and their great funding are going to have the kind of money and the market lead that Slootman is going to have which Snowflake has as this comes out. And what we're seeing in Congress right now with some antitrust scrutiny around the large data that's being collected by AWS as your Google, I'm not going to bet against this guy either. Right now I think he's got a lot of opportunity, there's a lot of additional layers and because he can basically develop this as a suite service, I think there's a lot of great opportunity ahead for this company. >> Yeah, and I guarantee that he understands well that customer acquisition cost and the lifetime value of the customer, the retention rates. Those are all things that he and Mike Scarpelli, his CFO learned at ServiceNow. Not learned, perfected. (Erik laughs) Well Erik, really great conversation, awesome data. It's always a pleasure having you on. Thank you so much, my friend. I really appreciate it. >> I appreciate talking to you too. We'll do it again soon. And stay safe everyone out there. >> All right, and thank you for watching everybody this episode of "CUBE Insights" powered by ETR. This is Dave Vellante, and we'll see you next time. (soft music)
SUMMARY :
This is breaking analysis and he's also the Great to see you too. and others in the community. I did not expect the And the horizontal axis is And one of the main concerns they have and some of the data lakes. and the legacy on-prem. but a key component of the TAM And back in the day where of part of the package. and Informatica the most. I mean, you're right that if And the other point is, "Hey, and from the more dominant It's interesting one of the comments, that in the panel yesterday and it's ML out of the box the thing to be cloud native. That portability that they bring to you And I totally agree with what And a lot of that had to and the data that needs and they're going to be the best at that. I need to dig into that, I know that the container on here is the last question, and one of the reasons heart of the pandemic and if they can continue to execute, And Yellowbrick is one that and the lifetime value of the customer, I appreciate talking to you too. This is Dave Vellante, and
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Erik Bradley | PERSON | 0.99+ |
Erik | PERSON | 0.99+ |
Frank Slootman | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Mike Scarpelli | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Oracle | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
David Floyd | PERSON | 0.99+ |
Slootman | PERSON | 0.99+ |
Teradata | ORGANIZATION | 0.99+ |
Abby Meadow | PERSON | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
$180 million | QUANTITY | 0.99+ |
$20 billion | QUANTITY | 0.99+ |
Netezza | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
77.5% | QUANTITY | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
20% | QUANTITY | 0.99+ |
10 billion | QUANTITY | 0.99+ |
12 and a half billion | QUANTITY | 0.99+ |
120 million | QUANTITY | 0.99+ |
Oracles | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Yellowbrick | ORGANIZATION | 0.99+ |
Wikibon Action Item | The Roadmap to Automation | April 27, 2018
>> Hi, I'm Peter Burris and welcome to another Wikibon Action Item. (upbeat digital music) >> Cameraman: Three, two, one. >> Hi. Once again, we're broadcasting from our beautiful Palo Alto studios, theCUBE studios, and this week we've got another great group. David Floyer in the studio with me along with George Gilbert. And on the phone we've got Jim Kobielus and Ralph Finos. Hey, guys. >> Hi there. >> So we're going to talk about something that's going to become a big issue. It's only now starting to emerge. And that is, what will be the roadmap to automation? Automation is going to be absolutely crucial for the success of IT in the future and the success of any digital business. At its core, many people have presumed that automation was about reducing labor. So introducing software and other technologies, we would effectively be able to substitute for administrative, operator, and related labor. And while that is absolutely a feature of what we're talking about, the bigger issue is ultimately is that we cannot conceive of more complex workloads that are capable of providing better customer experience, superior operations, all the other things a digital business ultimately wants to achieve. If we don't have a capability for simplifying how those underlying resources get put together, configured, or organized, orchestrated, and ultimately sustained delivery of. So the other part of automation is to allow for much more work that can be performed on the same resources much faster. It's a basis for how we think about plasticity and the ability to reconfigure resources very quickly. Now, the challenge is this industry, the IT industry has always used standards as a weapon. We use standards as a basis of creating eco systems or scale, or mass for even something as, like mainframes. Where there weren't hundreds of millions of potential users. But IBM was successful at using that as a basis for driving their costs down and approving a superior product. That's clearly what Microsoft and Intel did many years ago, was achieve that kind of scale through the driving more, and more, and more, ultimately, volume of the technology, and they won. But along the way though, each time, each generation has featured a significant amount of competition at how those interfaces came together and how they worked. And this is going to be the mother of all standard-oriented competition. How does one automation framework and another automation framework fit together? One being able to create value in a way that serves another automation framework, but ultimately as a, for many companies, a way of creating more scale onto their platform. More volume onto that platform. So this notion of how automation is going to evolve is going to be crucially important. David Floyer, are APIs going to be enough to solve this problem? >> No. That's a short answer to that. This is a very complex problem, and I think it's worthwhile spending a minute just on what are the component parts that need to be brought together. We're going to have a multi-cloud environment. Multiple private clouds, multiple public clouds, and they've got to work together in some way. And the automation is about, and you've got the Edge as well. So you've got a huge amount of data all across all of these different areas. And automation and orchestration across that, are as you said, not just about efficiency, they're about making it work. Making it able to be, to work and to be available. So all of the issues of availability, of security, of compliance, all of these difficult issues are a subject to getting this whole environment to be able to work together through a set of APIs, yes, but a lot lot more than that. And in particular, when you think about it, to me, volume of data is critical. Is who has access to that data. >> Peter: Now, why is that? >> Because if you're dealing with AI and you're dealing with any form of automation like this, the more data you have, the better your models are. And if you can increase that amount of data, as Google show every day, you will maintain that handle on all that control over that area. >> So you said something really important, because the implied assumption, and obviously, it's a major feature of what's going on, is that we've been talking about doing more automation for a long time. But what's different this time is the availability of AI and machine learning, for example, >> Right. as a basis for recognizing patterns, taking remedial action or taking predictive action to avoid the need for remedial action. And it's the availability of that data that's going to improve the quality of those models. >> Yes. Now, George, you've done a lot of work around this a whole notion of ML for ITOM. What are the kind of different approaches? If there's two ways that we're looking at it right now, what are the two ways? >> So there are two ends of the extreme. One is I want to see end to end what's going on across my private cloud or clouds. As well as if I have different applications in different public clouds. But that's very difficult. You get end-to-end visibility but you have to relax a lot of assumptions about what's where. >> And that's called the-- >> Breadth first. So the pro is end-to-end visibility. Con is you don't know how all the pieces fit together quite as well, so you get less fidelity in terms of diagnosing root causes. >> So you're trying to optimize at a macro level while recognizing that you can't optimize at a micro level. >> Right. Now the other approach, the other end of the spectrum, is depth first. Where you constrain the set of workloads and services that you're building and that you know about, and how they fit together. And then the models, based on the data you collect there, can become so rich that you have very very high fidelity root cause determination which allows you to do very precise recommendations or even automated remediation. What we haven't figured out hot to do yet is marry the depth first with the breadth first. So that you have multiple focus depth first. That's very tricky. >> Now, if you think about how the industry has evolved, we wrote some stuff about what we call, what I call the iron triangle. Which is basically a very tight relationship between specialists in technology. So the people who were responsible for a particular asset, be it storage, or the system, or the network. The vendors, who provided a lot of the knowledge about how that worked, and therefore made that specialist more or less successful and competent. And then the automation technology that that vendor ultimately provided. Now, that was not automation technology that was associated with AI or anything along those lines. It was kind of out of the box, buy our tool, and this is how you're going to automate various workflows or scripts, or whatever else it might be. And every effort to try to break that has been met with screaming because, well, you're now breaking my automation routines. So the depth-first approach, even without ML, has been the way that we've done it historically. But, David, you're talking about something different. It's the availability of the data that starts to change that. >> Yeah. >> So are we going to start seeing new compacts put in place between users and vendors and OEMs and a lot of these other folks? And it sounds like it's going to be about access to the data. >> Absolutely. So you're going to start. let's start at the bottom. You've got people who have a particular component, whatever that component is. It might be storage. It might be networking. Whatever that component is. They have products in that area which will be collecting data. And they will need for their particular area to provide a degree of automation. A degree of capability. And they need to do two things. They need to do that optimization and also provide data to other people. So they have to have an OEM agreement not just for the equipment that they provide, but for the data that they're going to give and the data they're going to give back. The automatization of the data, for example, going up and the availability of data to help themselves. >> So contracts effectively mean that you're going to have to negotiate value capture on the data side as well as the revenue side. >> Absolutely. >> The ability to do contracting historically has been around individual products. And so we're pretty good at that. So we can say, you will buy this product. I'm delivering you the value. And then the utility of that product is up to you. When we start going to service contracts, we get a little bit different kind of an arrangement. Now, it's an ongoing continuous delivery. But for the most part, a lot of those service contracts have been predicated to known in advance classes of functions, like Salesforce, for example. Or the SASS business where you're able to write a contract that says over time you will have access to this service. When we start talking about some of this automation though, now we're talking about ongoing, but highly bespoke, and potentially highly divergent, over a relatively short period of time, that you have a hard time writing contracts that will prescribe the range of behaviors and the promise about how those behaviors are actually going to perform. I don't think we're there yet. What do you guys think? >> Well, >> No, no way. I mean, >> Especially when you think about realtime. (laughing) >> Yeah. It has to be realtime to get to the end point of automating the actual reply than the actual action that you take. That's where you have to get to. You can't, It won't be sufficient in realtime. I think it's a very interesting area, this contracts area. If you think about solutions for it, I would be going straight towards blockchain type architectures and dynamic blockchain contracts that would have to be put in place. >> Peter: But they're not realtime. >> The contracts aren't realtime. The contracts will never be realtime, but the >> Accessed? access to the data and the understanding of what data is required. Those will be realtime. >> Well, we'll see. I mean, the theorem's what? Every 12 seconds? >> Well. That's >> Everything gets updated? >> That's To me, that's good enough. >> Okay. >> That's realtime enough. It's not going to solve the problem of somebody >> Peter: It's not going to solve the problem at the edge. >> At the very edge, but it's certainly sufficient to solve the problem of contracts. >> Okay. >> But, and I would add to that and say, in addition to having all this data available. Let's go back like 10, 20 years and look at Cisco. A lot of their differentiation and what entrenched them was sort of universal familiarity with their admin interfaces and they might not expose APIs in a way that would make it common across their competitors. But if you had data from them and a constrained number of other providers for around which you would build let's say, these modern big data applications. It's if you constrain the problem, you can get to the depth first. >> Yeah, but Cisco is a great example of it's an archetype for what I said earlier, that notion of an iron triangle. You had Cisco admins >> Yeah. that were certified to run Cisco gear and therefore had a strong incentive to ensure that more Cisco gear was purchased utilizing a Cisco command line interface that did incorporate a fair amount of automation for that Cisco gear and it was almost impossible for a lot of companies to penetrate that tight arrangement between the Cisco admin that was certified, the Cisco gear, and the COI. >> And the exact same thing happened with Oracle. The Oracle admin skillset was pervasive within large >> Peter: Happened with everybody. >> Yes, absolutely >> But, >> Peter: The only reason it didn't happen in the IBM mainframe, David, was because of a >> It did happen, yeah, >> Well, but it did happen, but governments stepped in and said, this violates antitrust. And IBM was forced by law, by court decree, to open up those interfaces. >> Yes. That's true. >> But are we going to see the same type of thing >> I think it's very interesting to see the shape of this market. When we look a little bit ahead. People like Amazon are going to have IAS, they're going to be running applications. They are going to go for the depth way of doing things across, or what which way around is it? >> Peter: The breadth. They're going to be end to end. >> But they will go depth in individual-- >> Components. Or show of, but they will put together their own type of things for their services. >> Right. >> Equally, other players like Dell, for example, have a lot of different products. A lot of different components in a lot of different areas. They have to go piece by piece and put together a consortium of suppliers to them. Storage suppliers, chip suppliers, and put together that outside and it's going to have to be a different type of solution that they put together. HP will have the same issue there. And as of people like CA, for example, who we'll see an opportunity for them to be come in again with great products and overlooking the whole of all of this data coming in. >> Peter: Oh, sure. Absolutely. >> So there's a lot of players who could be in this area. Microsoft, I missed out, of course they will have the two ends that they can combine together. >> Well, they may have an advantage that nobody else has-- >> Exactly. Yeah. because they're strong in both places. But I have Jim Kobielus. Let me check, are you there now? Do we got Jim back? >> Can you hear me? >> Peter: I can barely hear you, Jim. Could we bring Jim's volume up a little bit? So, Jim, I asked the question earlier, about we have the tooling for AI. We know how to get data. How to build models and how to apply the models in a broad brush way. And we're certainly starting to see that happen within the IT operations management world. The ITOM world, but we don't yet know how we're going to write these contracts that are capable of better anticipating, putting in place a regime that really describes how the, what are the limits of data sharing? What are the limits of derivative use? Et cetera. I argued, and here in the studio we generally agreed, that's we still haven't figured that out and that this is going to be one of the places where the tension between, at least in the B2B world, data availability and derivative use and where you capture value and where those profitables go, is going to be significant. But I want to get your take. Has the AI community >> Yeah. started figuring out how we're going to contractually handle obligations around data, data use, data sharing, data derivative use. >> The short answer is, no they have not. The longer answer is, that can you hear me, first of all? >> Peter: Barely. >> Okay. Should I keep talking? >> Yeah. Go ahead. >> Okay. The short answer is, no that the AI community has not addressed those, those IP protection issues. But there is a growing push in the AI community to leverage blockchain for such requirements in terms of block chains to store smart contracts where related to downstream utilization of data and derivative models. But that's extraordinarily early on in its development in terms of insight in the AI community and in the blockchain community as well. In other words, in fact, in one of the posts that I'm working on right now, is looking at a company called 8base that's actually using blockchain to store all of those assets, those artifacts for the development and lifecycle along with the smart contracts to drive those downstream uses. So what I'm saying is that there's lots of smart people like yourselves are thinking about these problems, but there's no consensus, definitely, in the AI community for how to manage all those rights downstream. >> All right. So very quickly, Ralph Finos, if you're there. I want to get your perspective >> Yeah. on what this means from markets, market leadership. What do you think? How's this going to impact who are the leaders, who's likely to continue to grow and gain even more strength? What're your thoughts on this? >> Yeah. I think, my perspective on this thing in the near term is to focus on simplification. And to focus on depth, because you can get return, you can get payback for that kind of work and it simplifies the overall picture so when you're going broad, you've got less of a problem to deal with. To link all these things together. So I'm going to go with the Shaker kind of perspective on the world is to make things simple. And to focus there. And I think the complexity of what we're talking about for breadth is too difficult to handle at this point in time. I don't see it happening any time in the near future. >> Although there are some companies, like Splunk, for example, that are doing a decent job of presenting a more of a breadth approach, but they're not going deep into the various elements. So, George, really quick. Let's talk to you. >> I beg to disagree on that one. >> Peter: Oh! >> They're actually, they built a platform, originally that was breadth first. They built all these, essentially, forwarders which could understand the formats of the output of all sorts of different devices and services. But then they started building what they called curated experiences which is the equivalent of what we call depth first. They're doing it for IT service management. They're doing it for what's called user behavior. Analytics, which is it's a way of tracking bad actors or bad devices on a network. And they're going to be pumping out more of those. What's not clear yet, is how they're going to integrate those so that IT service management understands security and vice versa. >> And I think that's one of the key things, George, is that ultimately, the real question will be or not the real question, but when we think about the roadmap, it's probably that security is going to be early on one of the things that gets addressed here. And again, it's not just security from a perimeter standpoint. Some people are calling it a software-based perimeter. Our perspective is the data's going to go everywhere and ultimately how do you sustain a zero trust world where you know your data is going to be out in the clear so what are you going to do about it? All right. So look. Let's wrap this one up. Jim Kobielus, let's give you the first Action Item. Jim, Action Item. >> Action Item. Wow. Action Item Automation is just to follow the stack of assets that drive automation and figure out your overall sharing architecture for sharing out these assets. I think the core asset will remain orchestration models. I don't think predictive models in AI are a huge piece of the overall automation pie in terms of the logic. So just focus on building out and protecting and sharing and reusing your orchestration models. Those are critically important. In any domain. End to end or in specific automation domains. >> Peter: David Floyer, Action Item. >> So my Action Item is to acknowledge that the world of building your own automation yourself around a whole lot of piece parts that you put together are over. You won't have access to a sufficient data. So enterprises must take a broad view of getting data, of getting components that have data be giving them data. Make contracts with people to give them data, masking or whatever it is and become part of a broader scheme that will allow them to meet the automation requirements of the 21st century. >> Ralph Finos, Action Item. >> Yeah. Again, I would reiterate the importance of keeping it simple. Taking care of the depth questions and moving forward from there. The complexity is enormous, and-- >> Peter: George Gilbert, Action Item. >> I say, start with what customers always start with with a new technology, which is a constrained environment like a pilot and there's two areas that are potentially high return. One is big data, where it's been a multi vendor or multi-vendor component mix, and a mess. And so you take that and you constrain that and make that a depth-first approach in the cloud where there is data to manage that. And the second one is security, where we have now a more and more trained applications just for that. I say, don't start with a platform. Start with those solutions and then start adding more solutions around that. >> All right. Great. So here's our overall Action Item. The question of automation or roadmap to automation is crucial for multiple reasons. But one of the most important ones is it's inconceivable to us to envision how a business can institute even more complex applications if we don't have a way of improving the degree of automation on the underlying infrastructure. How this is going to play out, we're not exactly sure. But we do think that there are a few principals that are going to be important that users have to focus on. Number one is data. Be very clear that there is value in your data, both to you as well as to your suppliers and as you think about writing contracts, don't write contracts that are focused on a product now. Focus on even that product as a service over time where you are sharing data back and forth in addition to getting some return out of whatever assets you've put in place. And make sure that the negotiations specifically acknowledge the value of that data to your suppliers as well. Number two, that there is certainly going to be a scale here. There's certainly going to be a volume question here. And as we think about where a lot of the new approaches to doing these or this notion of automation, is going to come out of the cloud vendors. Once again, the cloud vendors are articulating what the overall model is going to look like. What that cloud experience is going to look like. And it's going to be a challenge to other suppliers who are providing an on-premises true private cloud and Edge orientation where the data must live sometimes it is not something that they just want to do because they want to do it. Because that data requires it to be able to reflect that cloud operating model. And expect, ultimately, that your suppliers also are going to have to have very clear contractual relationships with the cloud players and each other for how that data gets shared. Ultimately, however, we think it's crucially important that any CIO recognized that the existing environment that they have right now is not converged. The existing environment today remains operators, suppliers of technology, and suppliers of automation capabilities and breaking that up is going to be crucial. Not only to achieving automation objectives, but to achieve a converged infrastructure, hyper converged infrastructure, multi-cloud arrangements, including private cloud, true private cloud, and the cloud itself. And this is going to be a management challenge, goes way beyond just products and technology, to actually incorporating how you think about your shopping, organized, how you institutionalize the work that the business requires, and therefore what you identify as a tasks that will be first to be automated. Our expectation, security's going to be early on. Why? Because your CEO and your board of directors are going to demand it. So think about how automation can be improved and enhanced through a security lens, but do so in a way that ensures that over time you can bring new capabilities on with a depth-first approach at least, to the breadth that you need within your shop and within your business, your digital business, to achieve the success and the results that you want. Okay. Once again, I want to thank David Floyer and George Gilbert here in the studio with us. On the phone, Ralph Finos and Jim Kobielus. Couldn't get Neil Raiden in today, sorry Neil. And I am Peter Burris, and this has been an Action Item. Talk to you again soon. (upbeat digital music)
SUMMARY :
and welcome to another Wikibon Action Item. And on the phone we've got Jim Kobielus and Ralph Finos. and the ability to reconfigure resources very quickly. that need to be brought together. the more data you have, is the availability of AI and machine learning, And it's the availability of that data What are the kind of different approaches? You get end-to-end visibility but you have to relax So the pro is end-to-end visibility. while recognizing that you can't optimize at a micro level. So that you have multiple focus depth first. that starts to change that. And it sounds like it's going to be about access to the data. and the data they're going to give back. have to negotiate value capture on the data side and the promise about how those behaviors I mean, Especially when you think about realtime. than the actual action that you take. but the access to the data and the understanding I mean, the theorem's what? To me, that's good enough. It's not going to solve the problem of somebody but it's certainly sufficient to solve the problem in addition to having all this data available. Yeah, but Cisco is a great example of and therefore had a strong incentive to ensure And the exact same thing happened with Oracle. to open up those interfaces. They are going to go for the depth way of doing things They're going to be end to end. but they will put together their own type of things that outside and it's going to have to be a different type Peter: Oh, sure. the two ends that they can combine together. Let me check, are you there now? and that this is going to be one of the places to contractually handle obligations around data, The longer answer is, that and in the blockchain community as well. I want to get your perspective How's this going to impact who are the leaders, So I'm going to go with the Shaker kind of perspective Let's talk to you. I beg to disagree And they're going to be pumping out more of those. Our perspective is the data's going to go everywhere Action Item Automation is just to follow that the world of building your own automation yourself Taking care of the depth questions and make that a depth-first approach in the cloud Because that data requires it to be able to reflect
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Neil | PERSON | 0.99+ |
Peter | PERSON | 0.99+ |
April 27, 2018 | DATE | 0.99+ |
Ralph Finos | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Neil Raiden | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
21st century | DATE | 0.99+ |
two ways | QUANTITY | 0.99+ |
8base | ORGANIZATION | 0.99+ |
10 | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
Splunk | ORGANIZATION | 0.99+ |
two areas | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
HP | ORGANIZATION | 0.99+ |
each generation | QUANTITY | 0.99+ |
theCUBE | ORGANIZATION | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
both places | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
both | QUANTITY | 0.98+ |
two things | QUANTITY | 0.98+ |
Three | QUANTITY | 0.98+ |
two | QUANTITY | 0.98+ |
SASS | ORGANIZATION | 0.98+ |
this week | DATE | 0.97+ |
each time | QUANTITY | 0.97+ |
two ends | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
ORGANIZATION | 0.96+ | |
first | QUANTITY | 0.96+ |
second one | QUANTITY | 0.94+ |
CA | LOCATION | 0.92+ |
Action Item, Graph DataBases | April 13, 2018
>> Hi, I'm Peter Burris. Welcome to Wikibon's Action Item. (electronic music) Once again, we're broadcasting from our beautiful theCUBE Studios in Palo Alto, California. Here in the studio with me, George Gilbert, and remote, we have Neil Raden, Jim Kobielus, and David Floyer. Welcome, guys! >> Hey. >> Hi, there. >> We've got a really interesting topic today. We're going to be talking about graph databases, which probably just immediately turned off everybody. But we're actually not going to talk so much about it from a technology standpoint. We're really going to spend most of our time talking about it from the standpoint of the business problems that IT and technology are being asked to address, and the degree to which graph databases, in fact, can help us address those problems, and what do we need to do to actually address them. Human beings tend to think in terms of relationships of things to each other. So what the graph community talks about is graphed-shaped problems. And by graph-shaped problem we might mean that someone owns something and someone owns something else, or someone shares an asset, or it could be any number of different things. But we tend to think in terms of things and the relationship that those things have to other things. Now, the relational model has been an extremely successful way of representing data for a lot of different applications over the course of the last 30 years, and it's not likely to go away. But the question is, do these graph-shaped problems actually lend themselves to a new technology that can work with relational technology to accelerate the rate at which we can address new problems, accelerate the performance of those new problems, and ensure the flexibility and plasticity that we need within the application set, so that we can consistently use this as a basis for going out and extending the quality of our applications as we take on even more complex problems in the future. So let's start here. Jim Kobielus, when we think about graph databases, give us a little hint on the technology and where we are today. >> Yeah, well, graph databases have been around for quite a while in various forms, addressing various core-use cases such as social network analysis, recommendation engines, fraud detection, semantic search, and so on. The graph database technology is essentially very closely related to relational, but it's specialized to, when you think about it, Peter, the very heart of a graph-shaped business problem, the entity relationship polygram. And anybody who's studied databases has mastered, at least at a high level, entity relationship diagrams. The more complex these relationships grow among a growing range of entities, the more complex sort of the network structure becomes, in terms of linking them together at a logical level. So graph database technology was developed a while back to be able to support very complex graphs of entities, and relationships, in order to do, a lot of it's analytic. A lot of it's very focused on fast query, they call query traversal, among very large graphs, to find quick answers to questions that might involve who owns which products that they bought at which stores in which cities and are serviced by which support contractors and have which connections or interrelationships with other products they may have bought from us and our partners, so forth and so on. When you have very complex questions of this sort, they lend themselves to graph modeling. And to some degree, to the extent that you need to perform very complex queries of this sort very rapidly, graph databases, and there's a wide range of those on the market, have been optimized for that. But we also have graph abstraction layers over RDBMSes and multi-model databases. You'll find them running in IBM's databases, or Microsoft Cosmos DB, and so forth. You don't need graph-specialized databases in order to do graph queries, in order to manipulate graphs. That's the issue here. When does a specialized graph database serve your needs better than a non-graph-optimized but nonetheless graph-enabling database? That's the core question. >> So, Neil Raden, let's talk a little bit about the classes of business problems that could in fact be served by representing data utilizing a graph model. So these graph-shaped problems, independent of the underlying technology. Let's start there. What kinds of problems can business people start thinking about solving by thinking in terms of graphs of things and relationships amongst things? >> It all comes down to connectedness. That's the basis of a graph database, is how things are connected, either weakly or strongly. And these connected relationships can be very complicated. They can be based on very complex properties. A relational database is not based on, not only is it not based on connectedness, it's not based on connectedness at all. I'd like to say it's based on un-connectedness. And the whole idea in a relational database is that the intelligence about connectedness is buried in the predicate of a query. It's not in the database itself. So I don't know how overlaying graph abstractions on top of a relational database are a good idea. On the other hand, I don't know how stitching a relational database into your existing operation is going to work, either. We're going to have to see. But I can tell you that a major part of data science, machine learning, and AI is going to need to address the issue of causality, not just what's related to each other. And there's a lot of science behind using graphs to get at the causality problem. >> And we've seen, well, let's come back to that. I want to come back to that. But George Gilbert, we've kind of experienced a similar type of thing back in the '90s with the whole concept of object-orientated databases. They were represented as a way of re-conceiving data. The problem was that they had to go from the concept all the way down to the physical thing, and they didn't seem to work. What happened? >> Well it turns out, the big argument was, with object-oriented databases, we can model anything that's so much richer, especially since we're programming with objects. And it turns out, though, that theoretically, especially at that time, you could model anything down at the physical level or even the logical level in a relational database, and so those code bases were able to handle sort of similar, both ends of the use cases, both ends of the spectrum. But now that we have such extreme demands on our data management, rather than look at a whole application or multiple applications even sharing a single relational database, like some of the big enterprise apps, we have workloads within apps like recommendation engines, or a knowledge graph, which explains the relationship between people, places, and things. Or digital twins, or mapping your IT infrastructure and applications, and how they all hold together. You could do that in a relational database, but in a graph database, you can organize it so that you can have really fast analysis of these structures. But, the trade-off is, you're going to be much more restricted in how you can update the stuff. >> Alright, so we think about what happened, then, with some of the object-orientated technology, the original world database, the database was bound to the application, and the developer used the database to tell the application where to go find the data. >> George: Right. >> Relational data allowed us not to tell the applications where to find things, but rather how to find things, and that was persisted, and was very successful for a long time. Object-orientated technologies, in many respects, went back to the idea that the developer had to be very concrete about telling the application where the data was, but we didn't want to do things that way. Now, something's happened, David Floyer. One of the reasons why we had this challenge of representing data in a more abstract way across a lot of different forms without having it also being represented physically, and therefore a lot of different copies and a lot of different representations of the data which broke systems of record and everything else, was that the underlying technology was focused on just persisting data and not necessarily delivering it into these new types of datas, databases, data models, et cetera. But Flash changes that, doesn't it? Can't we imagine a world in which we can have our data in Flash and then, which is a technology that's more focused on delivering data, and then having that data be delivered to a lot of different representations, including things like graph databases, graph models. Is that accurate? >> Absolutely. In a moment I'll take it even further. I think the first point is that when we were designing real-time applications, transactional applications, we were very constrained, indeed, by the amount of data that we could get to. So, as a database administrator, I used to have a rule which you could, database developers could not issue more than 100 database calls. And the reason was that, they could always do more than that, but the applications became very unstable and they became very difficult to maintain. The cost of maintenance went up a lot. The whole area of Flash allows us to do a number of things, and the area of UniGrid enables us to do a number of things very differently. So that we can, for example, share data and have many different views of it. We can use UniGrid to be able to bring far greater amounts of power, compute power, GPUs, et cetera, to bear on specific workloads. I think the most useful thing to think about this is this type of architecture can be used to create systems of intelligence, where you have the traditional relational databases dealing with systems of record, and then you can have the AI systems, graph systems, all the other components there looking at the best way of providing data and making decisions in real time that can be fed back into the systems of record. >> Alright, alright. So let's-- >> George: I want to add to something on this. >> So, Neil, let me come back to you very quickly, sorry, George. Let me come back to Neil. I want to spend, go back to this question of what does a graph-shaped problem look like? Let's kind of run down it. We talked about AI, what about IoT, guys? Is IoT going to help us, is IoT going to drive this notion of looking at the world in terms of graphs more or less? What do you think, Neil? >> I don't know. I hadn't really thought about it, Peter, to tell you the truth. I think that one thing we leave out when we talk about graphs is we talk about, you know, nodes and edges and relationships and so forth, but you can also build a graph with very rich properties. And one thing you can get from a graph query that you can't get from a relational query, unless you write careful predicate, is it can actually do some thinking for you. It can tell you something you don't know. And I think that's important. So, without being too specific about IoT, I have to say that, you know, streaming data and trying to relate it to other data, getting down to, very quickly, what's going on, root-cause analysis, I think graph would be very helpful. >> Great, and, Jim Kobielus, how about you? >> I think, yeah I think that IoT is tailor-made for, or I should say, graph modeling and graph databases are tailor-made for the IoT. Let me explain. I think the IoT, the graph is very much a metadata technology, it's expressing context in a connected universe. Where the IoT is concerned it's all about connectivity, and so graphs, increasingly complex graphs of, say, individuals and the devices and the apps they use and locations and various contexts and so forth, these are increasingly graph-based. They're hierarchical and shifting and changing, and so in order to contextualize and personalize experience in a graph, in an IoT world, I think graph databases will be embedded in the very fabric of these environments. Microsoft has a strategy they announced about a year ago to build more of an intelligent edge around, a distributed graph across all their offerings. So I think graphs will become more important in this era, undoubtedly. >> George, what do you think? Business problems? >> Business problems on IoT. The knowledge graph that holds together digital twin, both of these lend themselves to graph modeling, but to use the object-oriented databases as an example, where object modeling took off was in the applications server, where you had the ability to program, in object-oriented language, and that mapped to a relational database. And that is an option, not the only one, but it's an option for handling graph-model data like a digital twin or IT operations. >> Well that suggests that what we're thinking about here, if we talk about graph as a metadata, and I think, Neil, this partly answers the question that you had about why would anybody want to do this, that we're representing the output of a relational data as a node in a network of data types or data forms so that the data itself may still be relationally structured, but from an application standpoint, the output of that query is, itself, a thing that is then used within the application. >> But to expand on that, if you store it underneath, as fully normalized, in relational language, laid out so that there's no duplicates and things like that, it gives you much faster update performance, but the really complex queries, typical of graph data models, would be very, very slow. So, once we have, say, more in memory technology, or we can manage under the covers the sort of multiple representations of the data-- >> Well that's what Flash is going to allow us to do. >> Okay. >> What David Floyer just talked about. >> George: Okay. >> So we can have a single, persistent, physical storage >> Yeah. >> but it can be represented in a lot of different ways so that we avoid some of the problems that you're starting to raise. If we had to copy the data and have physical, physical copies of the data on disc in a lot of different places then we would run into all kinds of consistency and update. It would probably break the model. We'd probably come back to the notion of a single data store. >> George: (mumbles) >> I want to move on here, guys. One really quick thing, David Floyer, I want to ask you. If there's, you mentioned when you were database administrator and you put restrictions on how many database actions an application or transaction was allowed to generate. When we think about what a business is going to have to do to take advantage of this, are there any particular, like one thing that we need to think about? What's going to change within an IT organization to take advantage of graph database? And we'll do the action items. >> Right. So the key here is the number of database calls can grow by a factor of probably a thousand times what it is now with what we can see is coming as technologies over the next couple of years. >> So let me put that in context, David. That's a single transaction now generating a hundred thousand, >> Correct. >> a hundred thousand database calls. >> Well, access calls to data. >> Right. >> Whatever type of database. And the important thing here is that a lot of that is going to move out, with the discussion of IoT, to where the data is coming in. Because the quicker you can do that, the earlier you can analyze that data, and you talked about IoT with possible different sources coming in, a simple one like traffic lights, for example. The traffic lights are being affected by the traffic lights around them within the city. Those sort of problems are ideal for this sort of graph database. And having all of that data locally and being processed locally in memory very, very close to where those sensors are, is going to be the key to developing solutions in this area. >> So, Neil, I've got one question from you, or one question for you. I'm going to put you on the spot. I just had a thought. And here's the thought. We talk a lot about, in some of the new technologies that could in fact be employed here, whether it be blockchain or even going back to SOA, but when we talk about what a system is going to have the authority to do about the idea of writing contracts that describe very, very discretely, what a system is or is not going to do. I have a feeling those contracts are not going to be written in relational terms. I have a feeling that, like most legal documents, they will be written in what looks more like graph terms. I'm extending that a little bit, but this has rights to do this at this point in time. Is that also, this notion of incorporating more contracts directly to how systems work, to assure that we have the appropriate authorities laid out. What do you think? Is that going to be easier or harder as a consequence of thinking in terms of these graph-shaped models? >> Boy, I don't know. Again, another thing I hadn't really thought about. But I do see some real gaps in thinking. Let me give you an analogy. OLAP databases came on the scene back in the '90s whatever. People in finance departments and whatever they loved OLAP. What they hated was the lack of scalability. And now what we see now is scalability isn't a problem and OLAP solutions are suddenly bursting out all over the place. So I think there's a role for a mental model of how you model your data and how you use it that's different from the relational model. I think the relational model has prominence and has that advantage of, what's it called? Occupancy or something. But I think that the graph is going to show some real capabilities that people are lacking right now. I think some of them are at the very high end, things, like I said, getting to causality. But I think that graph theory itself is so much richer than the simple concept of graphs that's implemented in graph databases today. >> Yeah, I agree with that totally. Okay, let's do the action item round. Jim Kobielus, I want to start with you. Jim, action item. >> Yeah, for data professionals and analytic professionals, focus on what graphs can't do, cannot do, because you hear a lot of hyperbolic, they're not useful for unstructured data or for machine learning in database. They're not as useful for schema on read. What they are useful for is the same core thing that relational is useful for which is schema on write applied to structured data. Number one. Number two, and I'll be quick on this, focus on the core use cases that are already proven out for graph databases. We've already ticked them off here, social network analysis, recommendation engines, influencer analysis, semantic web. There's a rich range of mature use cases for which semantic techniques are suited. And then finally, and I'll be very quick here, bear in mind that relational databases have been supporting graph modeling, graph traversal and so forth, for quite some time, including pretty much all the core mature enterprise databases. If you're using those databases already, and they can perform graph traversals and so forth reasonably well for your intended application, stick with that. No need to investigate the pure play, graph-optimized databases on the market. However, that said, there's plenty of good ones, including AWS is coming out with Neptune. Please explore the other alternatives, but don't feel like you have to go to a graph database first and foremost. >> Alright. David Floyer, action item. >> Action item. You are going to need to move your data center and your applications from the traditional way of thinking about it, of handling things, which is sequential copies going around, usually taking it two or three weeks. You're going to have to move towards a shared data model where the same set of data can have multiple views of it and multiple uses for multiple different types of databases. >> George Gilbert, action item. >> Okay, so when you're looking at, you have a graph-oriented problem, in other words the data is shaped like a graph, question is what type of database do you use? If you have really complex query and analysis use cases, probably best to use a graph database. If you have really complex update requirements, best to use a combination, perhaps of relational and graph or something like multi-model. We can learn from Facebook where, for years, they've built their source of truth for the social graph on a bunch of sharded MySQL databases with some layers on top. That's for analyzing the graph and doing graph searches. I'm sorry, for updating the graph and maintaining it and its integrity. But for reading the graph, they have an entirely different layer for comprehensive queries and manipulating and traversing all those relationships. So, you don't get a free lunch either way. You have to choose your sweet spots and the trade-offs associated with them. >> Alright, Neil Raden, action item. >> Well, first of all, I don't think the graph databases are subject to a lot of hype. I think it's just the opposite. I think they haven't gotten much hype at all. And maybe we're going to see that. But another thing is, a fundamental difference when you're looking at a graph and a graph query, it uses something called open world reasoning. A relational database uses closed world reasoning. I'll give you an example. Country has capital city. Now you have in your graph that China has capital city Beijing, China has capital city Beijing. That doesn't violate the graph. The graph simply understands and intuits that they're different names for the same thing. Now, if you love to write correlated sub-queries for many, many different relationships, I'd say stick to your relational database. I see unique capabilities in a graph that would be difficult to implement in a relational database. >> Alright. Thank you very much, guys. Let's talk about what the action item is for all of us. This week we talked about graph databases. We do believe that they have enormous potential, but we first off have to draw a distinction between graph theory, which is a way of looking at the world and envisioning and conceptualizing solutions to problems, and graph database technology, which has the advantages of being able, for certain classes of data models, to be able to very quickly both write and read data that is based on relationships and hierarchies and network structures that are difficult to represent in a normalized relational database manager. Ultimately, our expectation is that over the next few years, we're going to see an explosion in the class of business problems that lend themselves to a graph-modeling orientation. IoT is an example, very complex analytics systems will be an example, but it is not the only approach or the only way of doing things. But what is interesting, what is especially interesting, is over the last few years, a change in the underlying hardware technology is allowing us to utilize and expand the range of tools that we might use to support these new classes of applications. Specifically, the move to Flash allows us to sustain a single physical copy of data and then have that be represented in a lot of different ways to support a lot of different model forms and a lot of different application types, without undermining the fundamental consistency and integrity of the data itself. So that is going to allow us to utilize new types of technologies in ways that we haven't utilized before, because before, whether it was object-oriented technology or OLAP technology, there was always this problem of having to create new physical copies of data which led to enormous data administrative nightmares. So looking forward, the ability to use Flash as a basis for physically storing the data and delivering it out to a lot of different model and tool forms creates an opportunity for us to use technologies that, in fact, may more naturally map to the way that human beings think about things. Now, where is this likely to really play? We talked about IoT, we talked about other types of technologies. Where it's really likely to play is when the domain expertise of a business person is really pushing the envelope on the nature of the business problem. Historically, applications like accounting or whatnot, were very focused on highly stylized data models, things that didn't necessarily exist in the real world. You don't have double-entry bookkeeping running in the wild. You do have it in the legal code, but for some of the things that we want to build in the future, people, the devices they own, where they are, how they're doing things, that lends itself to a real-world experience and human beings tend to look at those using a graph orientation. And the expectations over the next few years, because of the changes in the physical technology, how we can store data, we will be able to utilize a new set of tools that are going to allow us to more quickly bring up applications, more naturally manage data associated with those applications, and, very important, utilize targeted technology in a broader set of complex application portfolios that are appropriate to solve that particular part of the problem, whether it's a recommendation engine or something else. Alright, so, once again, I want to thank the remote guys, Jim Kobielus, Neil Raden, and David Floyer. Thank you very much for being here. George Gilbert, you're in the studio with me. And, once again, I'm Peter Burris and you've been listening to Action Item. Thank you for joining us and we'll talk to you again soon. (electronic music)
SUMMARY :
Here in the studio with me, George Gilbert, and the degree to which graph databases, And to some degree, to the extent that you need to perform independent of the underlying technology. that the intelligence about connectedness from the concept all the way down both ends of the use cases, both ends of the spectrum. and the developer used the database and a lot of different representations of the data and the area of UniGrid enables us to do a number of things So let's-- So, Neil, let me come back to you very quickly, I have to say that, you know, and so in order to contextualize and personalize experience and that mapped to a relational database. so that the data itself may still be relationally But to expand on that, if you store it underneath, so that we avoid some of the problems What's going to change within an IT organization So the key here is the number of database calls can grow So let me put that in context, David. the earlier you can analyze that data, the authority to do about the idea of writing contracts But I think that the graph is going to show some real Okay, let's do the action item round. focus on the core use cases that are already proven out David Floyer, action item. You are going to need to move your data center and the trade-offs associated with them. are subject to a lot of hype. So looking forward, the ability to use Flash as a basis
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David Floyer | PERSON | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
Neil Raden | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Neil | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
April 13, 2018 | DATE | 0.99+ |
Peter | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
one question | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Palo Alto, California | LOCATION | 0.99+ |
Beijing | LOCATION | 0.99+ |
single | QUANTITY | 0.99+ |
three weeks | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
This week | DATE | 0.99+ |
One | QUANTITY | 0.99+ |
first point | QUANTITY | 0.98+ |
MySQL | TITLE | 0.98+ |
more than 100 database calls | QUANTITY | 0.98+ |
China | LOCATION | 0.98+ |
Flash | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
theCUBE Studios | ORGANIZATION | 0.95+ |
one thing | QUANTITY | 0.94+ |
'90s | DATE | 0.91+ |
single data store | QUANTITY | 0.88+ |
double | QUANTITY | 0.87+ |
both ends | QUANTITY | 0.85+ |
a year ago | DATE | 0.85+ |
first | QUANTITY | 0.84+ |
Number two | QUANTITY | 0.84+ |
next couple of years | DATE | 0.83+ |
years | DATE | 0.82+ |
hundred thousand | QUANTITY | 0.79+ |
last 30 years | DATE | 0.72+ |
hundred thousand database | QUANTITY | 0.72+ |
thousand times | QUANTITY | 0.72+ |
Flash | PERSON | 0.68+ |
Cosmos | TITLE | 0.67+ |
Wikibon | ORGANIZATION | 0.63+ |
database | QUANTITY | 0.57+ |
about | DATE | 0.57+ |
Number one | QUANTITY | 0.56+ |
Greg Fee, Lyft | Flink Forward 2018
>> Narrator: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> This is George Gilbert. We are at Data Artisan's conference Flink Forward. It is for the Apache Flink commmunity, sponsored by Data Artisans, and all the work they're doing to move Flink Forward, and to surround it with additional value that makes building stream-processing applications accessible to mainstream companies. Right now though, we are not talking to a mainstream company, we're talking to Greg Fee from Lyft. Not Uber. (laughs) And Greg tell us a little bit about what you're doing with Flink. What's the first-use case, that comes to mind that really exercises its capabilities? >> Sure, yeah, so the process of adopting Flink at Lyft has really started with a use case, which was, we're trying to make machine learning more accessible across all of Lyft. So we already use machine learning in quite a few applications, but we want to make sure that we use machine learning as much as possible, we really think that's the path forward. And one of the fundamental difficulties with that is having consistent feature generation between these offline batch-y training scenarios and the online real-time streaming scenarios. And the unified processing engine of Flink really helps us bridge that gap, so. >> When you say unified processing engine, are you saying that the fact that you can manage code and data, as sort of an application version, and some of the, either code or data, is part of the model, and so your versioning? >> That's even a step beyond what I'm talking about. >> Okay. >> Just the basic fundamental ability to have one piece of business logic that you can apply at the batch bulk layer, and in the real-time layer. >> George: Yeah. >> So that's sort of like the core of what Flink gives you. >> Are you running both batch and streaming on Flink? >> Yes, that's right. >> And using the, so, you're using the windows? Or just periodic execution on a stream to simulate batch? >> That's right. So we have, so feature generation crosses a broad spectrum of possible use cases in Flink. >> George: Yeah. >> And this is where we sort of transition more into what dA platform could give for us. So, we're looking to have thousands of different features across all of our machine learning models. So having a platform that can help us host many of these little programs running, help with the application life-cycle of each of these features, as we version them over time. So, we're very excited about what dA platform can do for us. >> Can you tell us a little more about how the stream processing helps you with the feature selection engineering, and is it that you're using streaming, or simulated batch, or batch using the same programming model to train these models, and you're using, you're picking up different derived data, is that how it's working? >> So, typical life-cycle is, it's going to be a feature engineering stage, so the data scientist is looking at their data, they're trying figure out patterns in the data, and they're going to, how you apply Flink there, is as you come up with potential algorithms for how you generate your feature, can run that through Flink, generate some data, apply machine learning model on top of it, and sort of play around with that data, prototype things. >> So, what you're doing is offline, or out of the platform, you're doing the feature selection and the engineering. >> Man: Right. >> Then you attach a stream to it that has just the relevant, perhaps, the relevant features. >> Man: Right. >> And then that model gets sort of, well maybe not yet, but eventually versioned as part of the application, which includes the application, the rest of the application logic and the data. >> Right. So, like some of the stuff that was touched on this morning at the keynotes, the versioning and maintaining machine learning applications, is a much, is a very complex ecosystem there. So being able to say, okay, going from the prototype stage, doing stuff in batch, to doing stuff in production, and real-time, then being able to version those over time, to move to better and better versions of the future generation, is very important to us. >> I don't know if this is the most politically correct thing, but you just explained it better than everyone else we have talked to. >> Great. (laughs) >> About how it all fits together with the machine learning. So, once you've got that in place, it sounds like you're using the dA platform, as well as, you know, perhaps some extensions for machine learning, to sort of add that as a separate life-cycle, besides the application code. Then, is that going to be the enterprise-wide platform for deploying, developing and deploying, machine learning applications? >> Yes, certainly we think there's probably a broad ecosystem to do machine learning. It's a very, sort of, wide open area. Certainly my agenda is to push it across the company and get as many things running in this system as possible. I think the real-time aspects of it, a unifying aspect, of what Flink can give us, and the platform can give us, in terms of the life-cycles. >> So, are you set up essentially like where you're the, a shared resource, a shared service, which is the platform group? >> Man: Right. >> And then, all the business units, adopt that platform and build their apps on it. >> Right. So my initiative is part of a greater data science platform at Lyft, so, my goal is to have, we have hundreds of data scientists who are going to be looking at this data, giving me little features that they want to do, and we're probably going to end up numbering in the thousands of features, being able to generate all those, maintain all those little programs. >> And when you say generate all those little programs, that's the application logic, and the models specific to that application? >> That's right, well. >> Or is it this? >> There's features that are typically shared across many models. >> Okay. >> So there's like two layers of things happening. >> So you're managing features separately from the models. >> That's right. >> Interesting. Okay, haven't heard that. And is the application manager tooling going to help address that, or is that custom stuff that you have to do? >> So, I think there's, I think there's a potential that that's the way we're going to manage the model stuff as well, but it's still little new over there. >> That you put it on the application platform? >> Right. >> Then that's sort of at the boundary of what you're doing right now, or what you will be doing shortly. >> Right. It's all, it's a matter of use-case, whether it's online or offline, and how it fits best in with the rest of the Lyft engineering system. >> When you're talking about your application landscape, do you have lots of streaming applications that feed other streaming applications, going through a hub. Or, are they sort of more discrete, you know, artifacts, discrete programs, and then when do you keep, stay within the streaming processors, and when do you have it in a shared database? >> That's a, that's a lot of questions, kind of a deep question. So, the goal is to have a central hub, where sort of all of our event data passes through it, and that allows us to decouple. >> So that's to be careful, that's not a database central hub, that's a, like a? >> An event hub. >> Event hub. >> Right. >> Yeah, okay. >> So, an event hub in the middle allows us to decompose the different, sort of smaller programs, which again are probably going to number in the thousands, so that being able to have different parts of the company maintain their own part of the overall system is very important to us. I think we'll probably see Flink as a major player, in terms of how those programs run, but we'll probably be shooting things off to other systems like Druid, like Hive, like Presto, like Elasticsearch. >> As derived data? >> As all derived data, from these Flink jobs. And then also, pushing data directly out into some of our production systems to feed into machine learning decisions. >> Okay, this is quite, sounds like the most ambitious infrastructure that we've heard, in that it sounds like pretty ubiquitous. >> We want to be a machine-learning first company. So, it's everywhere. >> So, now help me clarify for me, when? Because this is, you know, for mainstream companies who've programmed with, you know, DBMS, as a shared state manager for decades, help explain to them when you would still use a DBMS for shared state, and when you would start using the distributed state that's embedded in Flink, and the derived data, you know, at the endpoints, at the syncs. >> So I mean, I guess this kind of gets into your exact, your use cases and, you know, your opinions and thoughts about how to use these things best, but. >> George: Your opinion is what we're interested in. >> Right. From where I'm coming, I see basically databases as potential one sync for this data. They do things very well, right? They do structured queries very well. You can have indices built off that, aggregates, really feed into a lot of visualization stuff. >> George: Yeah. >> But, from where I am sitting, like we're really moving away from databases as something that feeds production data. We've got other stores to do that, that are sort of more tailored towards those scenarios. >> When you say to feed production data, this is transaction capture, or data capture. >> Right. So we don't have a lot of atomic transactions, outside the payments at Lyft, most of the stuff is eventually consistent. So we have stores, more like Dynamo or Cassandra HBase that feed a lot of our production data. >> And those databases, are they for like ambient information like influencing an interaction, it doesn't sound like automating a transaction. It would be, it sounds like, context that helps with analytics, but very separate from the OLTP apps. >> That's right. So we have, you can kind of bifurcate the company into the data that's used in production to make decisions that are like facing the user, and then our analytics back end, that really helps business analysts and like the executives make decisions about how we proceed. >> And so that second part, that backend, is more like operational efficiency. >> Man: Right. >> And coding new business processes to support new ways of doing business, but the customer-facing stuff specifically like with payments, that still needs a traditional OLTP. >> Man: Right. >> But there not, those use cases aren't growing that much. >> That's right. So, basically we have very specific use-cases for like a traditional database, but in terms of capturing the types of scale, and the type of growth, we're looking for at Lyft, we think some of the other storage engines suit those better. >> So in that use-case, would the OLTP DBMS be at the front end, would it be a source, or a sync? It sounds like it's a source. >> So we actually do it both ways. Right, so, it's great to get our transactional data flowing through our streaming system, it's a lot of value in that, but also then pushing it out, back to some of the aggregate results to DBMS, helps with our analytics pipeline. >> Okay, okay. Well this is actually really interesting. So, where do you see the dA platform helping, you know, going forward; is it something you don't really need because you've built all that scaffolding to help with sort of application life-cycle management, or or do you see it as something that'll help sort of push Flink sort of enterprise-wide? >> I think the dA platform really helps people sort of adopt Flink at an enterprise level. Maintaining the applications is a core part of what it means to run it as a business. And so we're looking at dA platform as a way of managing our applications, and I think, like I'm just talking about one, I'm mostly talking about one application we have for Flink at Lyft. >> Yeah. >> We have many other Flink programs actually running, that are sort of unrelated to my project. >> What about managing non-Flink applications? Do you need an application manager? Is it okay that it's associated with one service, or platform like Flink, or is there a desire you know among bleeding edge customers to have an overall, sort of infrastructure management, application management kind of suite. >> Yes, for sure. You're touching on something I have started to push inside of Lyft, which is the need for an overall application life-cycle management product that's not technology specific. >> Would these sort of plug into the dA platform and whatever the confluent, you know, equivalent is, or is it going to to directly tie to the, you know, operational capabilities, or the functional capabilities, not the management capabilities. In other words would it plug into like core Flink, core Kafka, core Spark, that sort of stuff? >> I think that's sort of largely to be determined. If you go back to sort of how distributed design system works, typically. We have a user plane, which is going to be our data users. Then you end up with the thing we're probably most familiar with, which is our data plane, technologies like Flink and Kafka and Hive, all those guys. What's missing in the middle right now is a control plane. It's a map from the user desire, from the user intention, to what we do with all of that data plane stuff. So launch a new program, maybe you need a new Kafka topic, maybe you need to provision in Kafka. Higher, you need to get some Flink programs running, and whether that talks directly talks to Flink, and goes against Kubernetes, or something like that, or whether it talks to a higher level, like more application-specific platform. >> Man: Yeah. >> I think, you know it's certainly a lot easier, if we have some of these platforms in the way. >> Because they give you better abstractions. >> That's right. >> To talk to the platforms. >> That's right. >> That's interesting. Okay, geesh, we learn something really, really interesting with each interview. I'm curious though, if you look out a couple years, how much of your application landscape will be continuous processing, and is that something you can see mainstream enterprises adopting, or has decades of work with, you know, batch and interactive sort of made people too difficult to learn something so radically new? >> I think it's all going to be driven by the business needs, and whether the value is there for people to make that transition 'cause it is quite expensive to invest in new infrastructure. For companies like Lyft, where we're trying to make decisions very quickly, you know, users get down to two seconds makes a difference for the customer, so we're trying to be as, you know, real-time as possible. I used to work at Salesforce. Salespeople are a little less sensitive to these things, and you know it's very, very traditional world. >> That's interesting. (background applauding) >> But even Salesforce is moving towards that style. >> Even Salesforce is moving? >> Is moving toward streaming processing. >> Really? >> George: So like, I think we're going to see it slowly be adopted across the big enterprises. >> George: I imagine that's probably for their analytics. >> That's where they're starting, of course, yeah. >> Okay. So, this was, a little more affirmation on to how we're going to see the control plane evolve, and the interesting use-cases that you're up to. I hope we can see you back next year. And you can tell us how far you've proceeded. >> I certainly hope so, yeah. >> This was really interesting. So, Greg Fee from Lyft. We will hopefully see you again. And this is George Gilbert. We're at the Data Artisans Flink Forward conference in San Francisco. We'll be back after this break. (techno music)
SUMMARY :
brought to you by Data Artisans. What's the first-use case, that comes to mind And one of the fundamental difficulties with that That's even a step beyond what Just the basic fundamental ability to have So we have, so feature generation crosses a broad So having a platform that can help us host with potential algorithms for how you So, what you're doing is offline, or out of the platform, Then you attach a stream to it that has just of the application logic and the data. So, like some of the stuff that was touched on politically correct thing, but you just explained (laughs) Then, is that going to be the enterprise-wide platform in terms of the life-cycles. and build their apps on it. in the thousands of features, being able to generate There's features that are typically And is the application manager tooling going to help that that's the way we're going to manage the model stuff Then that's sort of at the boundary of what you're of the Lyft engineering system. and when do you have it in a shared database? So, the goal is to have a central hub, So, an event hub in the middle allows us to decompose some of our production systems to feed into Okay, this is quite, sounds like the most ambitious So, it's everywhere. and the derived data, you know, at the endpoints, about how to use these things best, but. into a lot of visualization stuff. We've got other stores to do that, that are sort of When you say to feed production data, outside the payments at Lyft, most of the stuff And those databases, are they for like ambient information So we have, you can kind of bifurcate the company And so that second part, that backend, is more like of doing business, but the customer-facing stuff the types of scale, and the type of growth, we're looking be at the front end, would it be a source, or a sync? some of the aggregate results to DBMS, So, where do you see the dA platform helping, you know, Maintaining the applications is a core part actually running, that are sort of unrelated to my project. you know among bleeding edge customers to have an overall, inside of Lyft, which is the need for an overall application or is it going to to directly tie to the, you know, to what we do with all of that data plane stuff. I think, you know it's certainly a lot easier, or has decades of work with, you know, and you know it's very, That's interesting. that style. adopted across the big enterprises. I hope we can see you back next year. We're at the Data Artisans Flink Forward conference
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
Greg Fee | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Lyft | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
second part | QUANTITY | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
each interview | QUANTITY | 0.99+ |
Dynamo | ORGANIZATION | 0.99+ |
Salesforce | ORGANIZATION | 0.99+ |
Apache | ORGANIZATION | 0.98+ |
Flink | ORGANIZATION | 0.98+ |
one service | QUANTITY | 0.98+ |
two layers | QUANTITY | 0.98+ |
two seconds | QUANTITY | 0.98+ |
each | QUANTITY | 0.97+ |
thousands of features | QUANTITY | 0.97+ |
both ways | QUANTITY | 0.97+ |
Kafka | TITLE | 0.93+ |
first-use case | QUANTITY | 0.92+ |
one application | QUANTITY | 0.92+ |
Druid | TITLE | 0.92+ |
Flink Forward | TITLE | 0.92+ |
decades | QUANTITY | 0.91+ |
Elasticsearch | TITLE | 0.89+ |
Data Artisans Flink Forward | EVENT | 0.89+ |
one | QUANTITY | 0.89+ |
Artisan | EVENT | 0.87+ |
first company | QUANTITY | 0.87+ |
hundreds of data scientists | QUANTITY | 0.87+ |
both batch | QUANTITY | 0.84+ |
one piece | QUANTITY | 0.83+ |
2018 | DATE | 0.81+ |
Flink | TITLE | 0.8+ |
Hive | TITLE | 0.77+ |
Presto | TITLE | 0.76+ |
this morning | DATE | 0.75+ |
features | QUANTITY | 0.74+ |
couple | QUANTITY | 0.73+ |
Flink Forward | EVENT | 0.69+ |
Hive | ORGANIZATION | 0.65+ |
Spark | TITLE | 0.62+ |
Kubernetes | ORGANIZATION | 0.61+ |
Data | ORGANIZATION | 0.6+ |
Cassandra HBase | ORGANIZATION | 0.57+ |
Stephan Ewen, data Artisans | Flink Forward 2018
>> Narrator: Live from San Francisco. It's the CUBE covering Flink Forward brought to you by data Artisans. >> Hi, this is George Gilbert. We are at Flink Forward. The conference put on by data Artisans for the Apache Flink community. This is the second Flink Forward in San Francisco and we are honored to have with us Stephan Ewen, co-founder of data Artisans, co-creator of Apache Flink, and CTO of data Artisans. Stephan, welcome. >> Thank you, George. >> Okay, so with others we were talking about the use cases they were trying to solve but you put together the sort of all the pieces in your head first and are building out, you know, something that's ultimately gets broader and broader in its applicability. Help us, now maybe from the bottom up, help us think through the problems you were trying to solve and and let's start, you know, with the ones that you saw first and then how the platform grows so that you can solve more and more a broader scale of problems. >> Yes, yeah, happy to do that. So, I think we have to take a bunch of step backs and kind of look at what is the let's say the breadth or use cases that we're looking at. How did that, you know, influence some of the inherent decisions and how we've built Flink? How does that relate to what we presented earlier today, the in Austrian processing platform and so on? So, starting to work on Flink and stream processing. Stream processing is an extremely general and broad paradigm, right? We've actually started to say what Flink is underneath the hood. It's an engine to do stateful computations over data streams. It's a system that can process data streams as a batch processor processes, you know, bounded data. It can process data streams as a real-time stream processor produces real-time streams of events. It can handle, you know, data streams as in sophisticated event by event, stateful, timely, logic as you know many applications that are, you know, implemented as data-driven micro services or so and implement their logic. And the basic idea behind how Flink takes its approach to that is just start with the basic ingredients that you need that and try not to impose any form of like various constraints and so on around the use of that. So, when I give the presentations, I very often say the basic building blocks for Flink is just like flowing streams of data, streams being, you know, like received from that systems like Kafka, file systems, databases. So, you route them, you may want to repartition them, organize them by key, broadcast them, depending on what you need to do. You implement computation on these streams, a computation that can keep state almost as if it was, you know, like a standalone java application. You don't think necessarily in terms of writing state or database. Think more in terms of maintaining your own variables or so. Sophisticated access to tracking time and progress or progress of data, completeness of data. That's in some sense what is behind the event time streaming notion. You're tracking completeness of data as for a certain point of time. And then to to round this all up, give this a really nice operational tool by introducing this concept of distributed consistent snapshots. And just sticking with these basic primitives, you have streams that just flow, no barrier, no transactional barriers necessarily there between operations, no microbatches, just streams that flow, state variables that get updated and then full tolerance happening as an asynchronous background process. Now that is what is in some sense the I would say kind of the core idea and what helps Flink generalize from batch processing to, you know, real-time stream processing to event-driven applications. And what we saw today is, in the presentation that I gave earlier, how we use that to build a platform for stream processing and event-driven applications. That's taking some of these things and in that case I'm most prominently the fourth aspect the ability to draw like some application snapshots at any point in time and and use this as an extremely powerful operational tool. You can think of it as being a tool to archive applications, migrate applications, fork applications, modify them independently. >> And these snapshots are essentially your individual snapshots at the node level and then you're sort of organizing them into one big logical snapshot. >> Yeah, each node is its own snapshot but they're consistently organized into a globally consistent snapshot, yes. That has a few very interesting and important implications for example. So just to give you one example where this makes really things much easier. If you have an application that you want to upgrade and you don't have a mechanism like that right, what is the default way that many folks do these updates today? Try to do a rolling upgrade of all your individual nodes. You replace one then the next, then the next, then the next but that has this interesting situation where at some point in time there's actually two versions of the application running at the same time. >> And operating on the same sort of data stream. >> Potentially, yeah, or on some partitions of the data stream, we have one version and some partitions you have another version. You may be at the point we have to maintain two wire formats like all pieces of your logic have to be written in understanding both versions or you try to you know use the data format that makes this a little easier but it's just inherently a thing that you don't even have to worry about it if you have this consistent distributed snapshots. It's just a way to switch from one application to the other as if nothing was like shared or in-flight at any point in time. It just gets many of these problems just out of the way. >> Okay and that snapshot applies to code and data? >> So in Flink's architecture itself, the snapshot applies first of all only to data. And that is very important. >> George: Yeah. >> Because what it actually allows you is to decouple the snapshot from the code if you want to. >> George: Okay. >> That allows you to do things like we showed earlier this morning. If you actually have an earlier snapshot where the data is correct then you change the code but you introduce the back. You can just say, "Okay, let me actually change the code "and apply different code to a different snapshot." So, you can actually, roll back or roll forward different versions of code and different versions of state independently or you can go and say when I'm forking this application I'm actually modifying it. That is a level of flexibility that's incredible to, yeah, once you've actually start to make use of it and practice it, it's incredibly useful. It's been actually almost, it's been one of the maybe least obvious things once you start to look into stream processing but once you actually started production as stream processing, this operational flexibility that you get there is I would say very high up for a lot of users when they said, "Okay this is "why we took Flink to streaming production and not others." The ability to do for example that. >> But this sounds then like with some stream processors the idea of the unbundling the database you have derived data you know at different sync points and that derived data is you know for analysis, views, whatever, but it sounds like what you're doing is taking a derived data of sort of what the application is working on in progress and creating essentially a logically consistent view that's not really derived data for some other application use but for operational use. >> Yeah, so. >> Is that a fair way to explain? >> Yeah, let me try to rephrase it a bit. >> Okay. >> When you start to take this streaming style approach to things, which you know it's been called turning the database inside out, unbundling the database, your input sequence of event is arguably the ground truth and what the stream processor computes is as a view of the state of the world. So, while this sounds you know this sounds at first super easy and you know views, you can always recompute a few, right? Now in practice this view of the world is not just something that's just like a lightweight thing that's only derived from the sequence of events. it's actually the, it's the state of the world that you want to use. It might not be fully reproducible just because either the sequence of events has been truncated or because the sequence events is just like too plain long to feasibly recompute it in a reasonable time. So, having a way to work with this in a way that just complements this whole idea of you know like event-driven, log-driven architecture very cleanly is kind of what this snapshot tool also gives you. >> Okay, so then help us think so that sounds like that was part of core Flink. >> That is part of core Flink's inherent design, yes. >> Okay, so then take us to the the next level of abstraction. The scaffolding that you're building around it with the dA platform and how that should make that sort of thing that makes stream processing more accessible, how it you know it empowers a whole other generation. >> Yeah, so there's different angles to what the dA platform does. So, one angle is just very pragmatically easing rollout of applications by having a one way to integrate the you know the platform with your metrics. Alerting logins, the ICD pipeline, and then every application that you deploy over there just like inherits all of that like every edge in the application developer doesn't have to worry about anything. They just say like this is my piece of code. I'm putting it there and it's just going to be hooked in with everything else. That's not rocket science but it's extremely valuable because there's like a lot of tedious bits here and there that you know otherwise eat up a significant amount of the development time. Like technologically maybe more challenging part that this solves is the part where we're really integrating the application snapshot, the compute resources, the configuration management and everything into this model where you don't think about I'm running a Flink job here. That Flink job has created a snapshot that is running around here. There's also a snapshot here which probably may come from that Flink application. Also, that Flink application was running. That's actually just a new version of that Flink application which is the let's say testing or acceptance run for the version that we're about to deploy here and so like tying all of these things together. >> So, it's not just the artifacts from one program, it's how they all interrelate? >> It gives you the idea of exactly of how they all interrelate because an application over its lifetime will correspond to different configurations different code versions, different different deployments on production a/b testing and so on and like how all of these things kind of work together how they interplay right, Flink, like I said before Flink deliberately couples checkpoints and code and so on in a rather loose way to allow you to to evolve the code differently then and still be able to match a previous snapshot into a newer code version and so on. We make heavy use of that but we we cannot give you a good way of first of all tracking all of these things together how do they how do they relate, when was which version running, what code version was that, having a snapshots we can always go back and reinstate earlier versions, having the ability to always move on a deployment from here to there, like fork it, drop it, and so on. That is one part of it and the other part of it is the tight integration with with Kubernetes which is initially container sweet spot was stateless compute and the way stream processing is, how architecture works is the nodes are inherently not stateless, they have a view of the state of the world. This is recoverable always. You can also change the number of containers and with Flink and other frameworks you have the ability to kind of adjust this and so on, >> Including repartitioning the-- >> Including repartitioning the state, but it's a thing that you have to be often quite careful how to do that so that it all integrates exactly consistency, like the right containers are running at the right point in time with the exact right version and there's not like there's not a split brain situation where this happens to be still running some other partitions at the same time or you're running that container goes down and it's this a situation where you're supposed to recover or rescale like, figuring all of these things out, together this is what they like the idea of integrating these things in a very tight way gives you so think of it as the following way, right? You start with, initially you just start with Docker. Doctor is a way to say, I'm packaging up everything that a process needs, all of its environment to make sure that I can deploy it here and here in here and just always works it's not like, "Oh, I'm missing "the correct version of the library here," or "I'm "interfering with that other process on a port." On top of Docker, people added things like Kubernetes to orchestrate many containers together forming an application and then on top of Kubernetes there are things like Helm or for certain frameworks there's like Kubernetes Operators and so on which try to raise the abstraction to say, "Okay we're taking care of these aspects that this needs in addition to a container orchestration," we're doing exactly that thing like we're raising the abstraction one level up to say, okay we're not just thinking about the containers the computer and maybe they're like local persistent storage but we're looking at the entire state full application with its compute, with its state with its archival storage with all of it together. >> Okay let me sort of peel off with a question about more conventionally trained developers and admins and they're used to databases for a batch and request response type jobs or applications do you see them becoming potential developers of continuous stream processing apps or do you see it only mainly for a new a new generation of developers? >> No, I would I would actually say that that a lot of the like classic... Call it request/response or call it like create update, delete create read update delete or so application working against the database, there's this huge potential for stream processing or that kind of event-driven architectures to help change this view. There's actually a fascinating talk here by the folks from (mumbles) who implemented an entire social network in this in this industry processing architecture so not against the database but against a log in and a stream processor instead it comes with some really cool... with some really cool properties like very unique way of of having operational flexibility too at the same time test, and evolve run and do very rapid iterations over your-- >> Because of the decoupling? >> Exactly, because of the decoupling because you don't have to always worry about okay I'm experimenting here with something. Let me first of all create a copy of the database and then once I actually think that this is working out well then, okay how do I either migrate those changes back or make sure that the copy of the database that I did that bring this up to speed with a production database again before I switch over to the new version and so like so many of these things, the pieces just fall together easily in the streaming world. >> I think I asked this of Kostas, but if a business analyst wants to query the current state of what's in the cluster, do they go through some sort of head node that knows where the partitions lay and then some sort of query optimizer figures out how to execute that with a cost model or something? In other words, if you want it to do some sort of batcher interactive type... >> So there's different answers to that, I think. First of all, there's the ability to log into the state of link as in you have the individual nodes that maintains they're doing the computation and you can look into this but it's more like a look up thing. >> It's you're not running a query as in a sequel query against that particular state. If you would like to do something like that, what Flink gives you as the ability is as always... There's a wide variety of connectors so you can for example say, I'm describing my streaming computation here, you can describe in an SQL, you can say the result of this thing, I'm writing it to a neatly queryable data store and in-memory database or so and then you would actually run the dashboard style exploratory queries against that particular database. So Flink's sweet spot at this point is not to run like many small fast short-lived sequel queries against something that is in Flink running at the moment. That's not what it is yet built and optimized for. >> A more batch oriented one would be the derived data that's in the form of a materialized view. >> Exactly, so this place, these two sites play together very well, right? You have the more exploratory better style queries that go against the view and then you have the stream processor and streaming sequel used to continuously compute that view that you then explore. >> Do you see scenarios where you have traditional OLTP databases that are capturing business transactions but now you want to inform those transactions or potentially automate them with machine learning. And so you capture a transaction, and then there's sort of ambient data, whether it's about the user interaction or it's about the machine data flowing in, and maybe you don't capture the transaction right away but you're capturing data for the transaction and the ambient data. The ambient data you calculate some sort of analytic result. Could be a model score and that informs the transaction that's running at the front end of this pipeline. Is that a model that you see in the future? >> So that sounds like a formal use case that has actually been run. It's not uncommon, yeah. It's actually, in some sense, a model like that is behind many of the fraud detection applications. You have the transaction that you capture. You have a lot of contextual data that you receive from which you either built a model in the stream processor or you built a model offline and push it into the stream processor. As you know, let's say a stream of model updates, and then you're using that stream of model updates. You derive your classifiers or your rule engines, or your predictor state from that set of updates and from the history of the previous transactions and then you use that to attach a classification to the transaction and then once this is actually returned, this stream is fed back to the part of the computation that actually processes that transaction itself to trigger the decision whether to for example hold it back or to let it go forward. >> So this is an application where people who have built traditional architectures would add this capability on for low latency analytics? >> Yeah, that's one way to look at it, yeah. >> As opposed to a rip and replace, like we're going to take out our request/response in our batch and put in stream processing. >> Yeah, so that is definitely a way that stream processing is used, that you you basically capture a change log or so of whatever is happening in either a database or you just immediately capture the events, the interaction from users and devices and then you let the stream processor run side by side with the old infrastructure. And just exactly compute additional information that, even a mainframe database might in the end used to decide what to do with a certain transaction. So it's a way to complement legacy infrastructure with new infrastructure without having to break off or break away the legacy infrastructure. >> So let me ask in a different direction more on the complexity that forms attacks for developers and administrators. Many of the open source community products slash projects solve narrow sort of functions within a broader landscape and there's a tax on developers and admins and trying to make those work together because of the different security models, data models, all that. >> There is a zoo of systems and technologies out there and also of different paradigms to do things. Once systems kind of have a similar paradigm, or a tier in mind, they usually work together well, but there's different philosophical takes-- >> Give me some examples of the different paradigms that don't fit together well. >> For example... Maybe one good example was initially when streaming was a rather new thing. At this point in time stream processors were very much thought of as a bit of an addition to the, let's say, the batch stack or whatever ever other stack you currently have, just look at it as an auxiliary piece to do some approximate computation and a big reason why that was the case is because, the way that these stream processors thought of state was with a different consistency model, the way they thought of time was actually different than you know like the batch processors of the databases at which use time stem fields and the early stream processors-- >> They can't handle event time. >> Exactly, just use processing time, that's why these things you know you could maybe complement the stack with that but it didn't really go well together, you couldn't just say like, okay I can actually take this batch job kind of interpret it also as a streaming job. Once the stream processors got a better interpretation. >> The OEM architecture. >> Exactly. So once the stream processors adopted a stronger consistency model a time model that is more compatible with reprocessing and so on, all of these things all of a sudden fit together much better. >> Okay so, do you see that vendors who are oriented around a single paradigm or unified paradigm, do you see them continuing to broaden their footprint so that they can essentially take some of the complexity off the developer and the admin by providing something that, one throat to choke with the pieces that were designed to work together out-of-the-box, unlike some of the zoos with the former Hadoop community? In other words, lot of vendors seem to be trying to do a broader footprint so that it's something that's just simpler to develop to and to operate? >> There there are a few good efforts happening in that space right now, so one that I really like is the idea of standardizing on some APIs. APIs are hard to standardize on but you can at least standardize on semantics, which is something, that for example Flink and Beam have been very keen on trying to have an open discussion and a road map that is very compatible in thinking about streaming semantics. This has been taken to the next level I would say with the whole streaming sequel design. Beam is adding adding stream sequel and Flink is adding stream sequel, both in collaboration with the Apache CXF project, so very similar standardized semantics and so on, and the sequel compliancy so you start to get common interfaces, which is a very important first step I would say. Standardizing on things like-- >> So sequel semantics are across products that would be within a stream processing architecture? >> Yes and I think this will become really powerful once other vendors start to adopt the same interpretation of streaming sequel and think of it as, yes it's a way to take a changing data table here and project a view of this changing data table, a changing materialized view into another system, and then use this as a starting point to maybe compute another derive, you see. You can actually start and think more high-level about things, think really relational queries, dynamic tables across different pieces of infrastructure. Once you can do something like interplay in architectures become easier to handle, because even if on the runtime level things behave a bit different, at least you start to establish a standardized model, in thinking about how to compose your architecture and even if you decide to change on the way, you frequently saved the problem of having to rip everything out and redesign everything because the next system that you bring in just has a completely different paradigm that it follows. >> Okay, this is helpful. To be continued offline or back online on the CUBE. This is George Gilbert. We were having a very interesting and extended conversation with Stephan Ewen, CTO and co-founder of data Artisans and one of the creators of Apache Flink. And we are at Flink Forward in San Francisco. We will be back after this short break.
SUMMARY :
brought to you by data Artisans. This is the second Flink Forward in San Francisco how the platform grows so that you can solve with the basic ingredients that you need that and then you're sort of organizing them So just to give you one example where this makes have to worry about it if you have this consistent the snapshot applies first of all only to data. the snapshot from the code if you want to. that you get there is I would say very high up and that derived data is you know for analysis, approach to things, which you know it's been called like that was part of core Flink. more accessible, how it you know it empowers and everything into this model where you and so on in a rather loose way to allow you to raise the abstraction to say, "Okay we're taking care that a lot of the like classic... make sure that the copy of the database that I did that In other words, if you want it to do the state of link as in you have the individual nodes or so and then you would actually run of a materialized view. go against the view and then you have the stream processor Is that a model that you see in the future? You have the transaction that you capture. As opposed to a rip and replace, and devices and then you let the stream processor run Many of the open source community there and also of different paradigms to do things. Give me some examples of the different paradigms that the batch stack or whatever ever other stack you currently you know you could maybe complement the stack with that So once the stream processors right now, so one that I really like is the idea of to maybe compute another derive, you see. and one of the creators of Apache Flink.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Stephan Ewen | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Stephan | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
one version | QUANTITY | 0.99+ |
both versions | QUANTITY | 0.99+ |
two sites | QUANTITY | 0.99+ |
Apache Flink | ORGANIZATION | 0.99+ |
two versions | QUANTITY | 0.99+ |
Flink Forward | ORGANIZATION | 0.99+ |
second | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
fourth aspect | QUANTITY | 0.98+ |
java | TITLE | 0.98+ |
Artisans | ORGANIZATION | 0.98+ |
one program | QUANTITY | 0.97+ |
one way | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Kubernetes | TITLE | 0.97+ |
one angle | QUANTITY | 0.97+ |
Kafka | TITLE | 0.96+ |
one part | QUANTITY | 0.96+ |
first step | QUANTITY | 0.96+ |
two wire formats | QUANTITY | 0.96+ |
first | QUANTITY | 0.96+ |
First | QUANTITY | 0.94+ |
each node | QUANTITY | 0.94+ |
Beam | ORGANIZATION | 0.94+ |
one example | QUANTITY | 0.94+ |
CTO | PERSON | 0.93+ |
2018 | DATE | 0.93+ |
Docker | TITLE | 0.92+ |
Apache | ORGANIZATION | 0.91+ |
one good example | QUANTITY | 0.91+ |
single paradigm | QUANTITY | 0.9+ |
one application | QUANTITY | 0.89+ |
Flink | TITLE | 0.86+ |
node | TITLE | 0.79+ |
Kostas | ORGANIZATION | 0.76+ |
earlier this morning | DATE | 0.69+ |
CUBE | ORGANIZATION | 0.67+ |
SQL | TITLE | 0.64+ |
Helm | TITLE | 0.59+ |
CXF | TITLE | 0.59+ |
Enrico Canzonieri, Yelp, Inc. | Flink Forward 2018
>> Narrator: Live from San Francisco, it's The Cube, covering Flink Forward, brought to you by Data Artisans. (upbeat music) >> Hi this is George Gilbert, we are at Flink Forward the conference for the Apache Flink community, sponsored by Data Artisans, which is the company commercializing Flink, and we have with us now Enrico Canzonieri, wait a minute I didn't get that right, Canzonieri, Sorry, Enrico from Yelp. And he's going to tell us how sort of Flink has taken Yelp by storm over the past year. Why don't we start off with where you were last year in terms of your data pipeline and what challenges you were facing. >> Yeah sure, so we had a Python company in the sense we developed most of our software in Python, so until last year we had most of our stream processing was depending on Python. We had developed and announced framework that was doing Python processing, and that was really what we had, there was no Flink running, most of the applications were built around a very simple interface that was process message function, that was what we expected developers to use, so no real obstruction there. >> Okay so, in other words, it sounds like you had a discrete task, request response or a batch, and then hand it off to the next function, and is that what the pipeline looked like? >> The pipeline was more of a streaming pipeline where we had a Kafka topic and input and we had these developers who would write this process message function where each message would be individually processed, there was a kind of the semantic of the pipeline there. And then we will get the result of that processing task into another Kafka topic and then get another processing function on top of that So we could have very easily, two or three processing tasks all connected by Kafka topics. Obviously there were like big limitations of that kind of architecture, especially when you want to do something more advanced that can include windowing, aggregation, or especially state management. >> Well Kafka has several layers of abstraction and I guess you'd have to go pretty low level to get the windowing, all the windowing and state management capabilities. >> Yeah it becomes really hard you basically have to implement by yourself unless you're using a (mumbles) on the conference platform or you're using what they call Kafka streams, but we are not using that. >> Oh, not? Okay. >> Obviously we were trying to implement that on top of what Python or Simple Flame work from zero. >> So tell us how the choice of Flink, sort of where did it hit your awareness and where did you start mapping it into this pipeline that was Python based? >> Yeah so we had the really I think two main use cases, this was last year, that we were struggling to get right and to really get working. The first one was a connector and the challenge there was to aggregate that locally, scale it to hundreds of streams, and then once we aggregated that locally, upload at our own S3. So that was one application, we were really struggling to get that work because in the framework we had, we had no real obstruction for windowing so we had this process message function where it was trying to implement all of that. And also because we were using a very low level Kafka code primitives, getting scalability was not as straight forward. So that was one application that was pretty challenging, the other one was really purer state full application, where we needed to retain the state forever. It was doing a windowed join across streams, so obviously the challenges in that case are even harder more because you have to implement a state management from the ground up. >> And all the time semantics. >> Yeah we had basically no even time semantics. We were not supporting that. >> Okay. >> So we looked at Flink because of even time support so then we could actually do even time processing. State management support, already implemented, it's way different than implementing it from the ground up. And then obviously the abstractions so the streaming primitives, you have windows, you have a nice interface that you can use that makes developers who are writing code it becomes easier for them. >> So let's start with the state management, help us walk through like, what capabilities in state management does Flink have relative to the lowest level abstraction you were using in Kafka or perhaps what spark structured streaming might provide. >> Yeah so I think the nice features in streams are really around the fact that the state management is implemented and fully supports the clusterized approach of Flink. So for example if you're using Kafka, Flink already, in the Kafka connector, Flink already provides a way to represent the Kafka state the state of a Kafka consumer. It also, for open editors, if you have a flat map or you have a window, state for Windows is already fully supported so if you are cumulating events, in your window, you don't really know what to do, then nothing special, these states will be automatically maintained by the Flink framework. That means that if Flink is taking a snapshot so a check point or a save point, all the state that was there will get stored in the check point and then will be able to recover. >> For the full window. >> Yeah. >> It's cause it understands the concept of the window when it does a check point. >> Yeah because there's native support in Flink for that. >> And what's the advantage of having state be integrated with the compute as opposed to compute and then some sort of API to a separate state manager? >> It's definitely like call to clarity, it's a big simplification of how you implement your code, your streaming application. Cause in the end if for every simple string application you need to go ahead and implement or define, implement basically the way your state gets stored, that really makes a very complex application, especially on the maintain, for maintenance. So in Flink you kind of focus on the business logic we actually did some tuning on the state manager that was necessary, but the tuning that we did applies in the same way across all the applications we built. Then users want to build an application, they focus on the business logic that they want, and they have, I would say the state is more kind of declarative, you say you want this map, you need this list and this state as part of the state and Flink will take care of actually making sure that that gets into the check point. >> So the sort of semantics of state management are built in at the compute layer, as opposed to going down to an API for a separate service in other implementations. >> Yeah, yeah. >> Okay. All right we have just a minute left, tell us about some of the things you're looking forward to doing with Flink, and are they similar to what the DA platform that's coming out from Data Artisans or do you have still a whole bunch of things on the data pipeline that you want to accomplish with just the core functionality? >> Yeah we definitely, I will say one of the features that we are really excited about is the stream sequel. I see a lot of potential there for new applications, we actually use a stream sequel at Yelp and we deploy that as a service so it makes it easier for users to deploy and to develop stream plus string applications. We definitely are planning to expand our Flink deployment into just new apps. Especially one of the things that we try to do especially building reusable components, and trying to deploy the reusable components are very coupled with the way we think about our data pipeline. >> Okay so would it be fair to say that can you look at the DA platform and say for companies that are not quite as sophisticated as you, that this is going to make it easier for you know main stream companies to build and deploy, operate? >> I see good potential there, I was looking at the presentation in the morning I like the integration we culminated for sure, since like that's where kind of the current trend for application deployment is going. So yeah I definitely see potential, I think for Yelp we clearly have a complex enough deployment and service integration that won't probably be a good fit for us. But probably companies that are approaching the route of Flink, now and we'll probably have an already existing deployment they may probably give it a try. >> Okay, all right Enrico we got to end it there but that was very helpful and thanks for stopping by. >> Thanks for having me here. >> Okay. And this is George Gilbert we are at Flink Forward, the Data Artisan's conference for the Apache Flink community, and we will be right back after this short break. (upbeat music)
SUMMARY :
covering Flink Forward, brought to you by Data Artisans. and we have with us now Enrico Canzonieri, in the sense we developed most of our software in Python, and we had these developers who would write this and state management capabilities. on the conference platform or you're using Okay. Obviously we were trying to implement that because in the framework we had, Yeah we had basically no even time semantics. so then we could actually do even time processing. relative to the lowest level abstraction you were using all the state that was there It's cause it understands the concept of the window that was necessary, but the tuning that we did So the sort of semantics of state management on the data pipeline that you want to accomplish Especially one of the things that we try to do I like the integration we culminated for sure, but that was very helpful and thanks for stopping by. and we will be right back after this short break.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Enrico Canzonieri | PERSON | 0.99+ |
Enrico | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
Canzonieri | PERSON | 0.99+ |
Python | TITLE | 0.99+ |
last year | DATE | 0.99+ |
two | QUANTITY | 0.99+ |
Kafka | TITLE | 0.99+ |
Apache | ORGANIZATION | 0.99+ |
Yelp | ORGANIZATION | 0.99+ |
each message | QUANTITY | 0.99+ |
one application | QUANTITY | 0.99+ |
Yelp, Inc. | ORGANIZATION | 0.99+ |
S3 | TITLE | 0.99+ |
hundreds of streams | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.98+ |
one | QUANTITY | 0.96+ |
past year | DATE | 0.94+ |
2018 | DATE | 0.93+ |
Simple Flame | TITLE | 0.92+ |
Windows | TITLE | 0.91+ |
first one | QUANTITY | 0.9+ |
two main use cases | QUANTITY | 0.88+ |
Flink | TITLE | 0.87+ |
Data Artisan | EVENT | 0.82+ |
three processing tasks | QUANTITY | 0.76+ |
Flink Forward | ORGANIZATION | 0.73+ |
zero | QUANTITY | 0.72+ |
Narrator | TITLE | 0.69+ |
The Cube | TITLE | 0.69+ |
string | QUANTITY | 0.68+ |
windows | TITLE | 0.64+ |
things | QUANTITY | 0.52+ |
Flink Forward | EVENT | 0.5+ |
Forward | EVENT | 0.49+ |
Holden Karau, Google | Flink Forward 2018
>> Narrator: Live from San Francisco, it's the Cube, covering Flink Forward, brought to you by Data Artisans. (tech music) >> Hi, this is George Gilbert, we're at Flink Forward, the user conference for the Apache Flink Community, sponsored by Data Artisans. We are in San Francisco. This is the second Flink Forward conference here in San Francisco. And we have a very imminent guest, with a long pedigree, Holden Karau, formerly of IBM, and Apache Spark fame, putting Apache Spark and Python together. >> Yes. >> And now, Holden is at Google, focused on the Beam API, which is an API that makes it possible to write portable stream processing applications across Google's Dataflow, as well as Flink and other stream processors. >> Yeah. >> And Holden has been working on integrating it with the Google TensorFlow framework, also open-sourced. Yes. >> So, Holden, tell us about the objective of putting these together. What type of use cases.... >> So, I think it's really exciting. And it's still very early days, I want to be clear. If you go out there and run this code, you are going to get a lot of really weird errors, but please tell us about the errors you get. The goal is really, and we see this in Spark, with the pipeline APIs, that most of our time in machine learning is spent doing data preparation. We have to get our data in a format where we can do our machine learning on top of it. And the tricky thing about the data preparation is that we also often have to have a lot of the same preparation code available to use when we're making our predictions. And what this means is that a lot people essentially end up having to write, like, a stream-processing job to do their data preparation, and they have to write a corresponding online serving job, to do similar data preparation for when they want to make real predictions. And by integrating tf.Transform and things like this into the Beam ecosystem, the idea is that people can write their data preparation in a simple, uniform way, that can be taken from the training time into the online serving time, without them having to rewrite their code, removing the potential for mistakes where we like, change one variable slightly in one place and forget to update it in another. And just really simplifying the deployment process for these models. >> Okay, so help us tie that back to, in this case, Flink. >> Yes. >> And also to clarify, that data prep.... My impression was data prep was a different activity. It was like design time and serving was run time. But you're saying that they can be better integrated? >> So, there's different types of data prep. Some types of data prep would be things like removing invalid records. And if I'm doing that, I don't have to do that at serving time. But one of the classic examples for data prep would be tokenizing my inputs, or performing some kind of hashing transformation. And if I do that, when I get new records to predict, they won't be in a pre-tokenized form, or they won't be hashed correctly. And my model won't be able to serve on these sort of raw inputs. So I have to re-create the data prep logic that I created for training at serving time. >> So, by having common Beam API and the common provider underneath it, like Flink and TensorFlow, it's the repeatable activities for transforming data to make it ready to feed to a machine-learning model that you want those.... It would be ideal to have those transformation activities be common in your prep work, and then in the production serving. >> Yes, very true. >> So, tell us what type of customers want to write to the Beam API and have that portability? >> Yeah, so that's a really good question. So, there's a lot of people who really want portability outside of Google Cloud, and that's one group of people, essentially people who want to adopt different Google Cloud technologies, but they don't want be locked into Google Cloud forever. Which is completely understandable. There are other people who are more interested in being able to switch streaming engines, like, they want to be able to switch between Spark and Flink. And those are people who want to try out different streaming engines without having to rewrite their entire jobs. >> Does Spark Structured Streaming support the Beam API? >> So, right now, the Spark support for Beam is limited. It's in the old Dstream API, it's not on top of the Structured Streaming API. It's a thing we're actively discussing on the mailing list, how to go about doing. Because there's a lot of intricacies involved in bringing new APIs in line. And since it already works there, there's less of a pressure. But it's something that we should look at more of. Where was I going with this? So the other one that I see, is like, Flink is a wonderful API, but it's very Java-focused. And so, Java's great, everyone loves it, but a lot of cool things that are being done nowadays, are being built in Python, like TensorFlow. There's a lot of really interesting machine learning and deep learning stuff happening in Python. Beam gives a way for people to work with Python, across these different engines. Flink supports Python, but it's maybe not a first class citizen. And the Beam Python support is still a work in progress. We're working to get it to be better, but it's.... You can see the demos this afternoon, although if you're not here, you can't see the demo, but you can see the work happening in GitHub. And there's also work being done to support Go. >> In to support Go. >> Which is a little out of left field. >> So, would it be fair to say that the value of Beam, for potential Flink customers, they can work and start on Google Cloud platform. They can start on one of several stream processors. They can move to another one later, and they also inherit the better language support, or bindings from the Beam API? >> I think that's very true. The better language support, it's better for some languages, it's probably not as good for others. It's somewhat subjective, like what better language support is. But I think definitely for Go, it's pretty clear. This stuff is all stuff that's in the master branch, it's not released today. But if people are looking to play with it, I think it's really exciting. They can go and check it out from GitHub, and build it locally. >> So, what type of customers do you see who have moved into production with machine learning? >> So the.... >> And the streaming pipelines? >> The biggest customer that's in production is obviously, or not obviously, is Spotify. One of them is Spotify. They give a lot of talks about it. Because I didn't know we were going to be talking today, I didn't have a chance to go through my customer list and see who's okay with us mentioning them publicly. I'll just stick with Spotify. >> Without the names, the sort of use cases and the general industry.... >> I don't want to get in trouble. >> Okay. >> I'm just going to ... sorry. >> Okay. So then, let's talk about, does Google view Dataflow as their sort of strategic successor to map produce? >> Yes, so.... >> And is that a competitor then to Flink? >> I think Flink and Dataflow can be used in some of the same cases. But, I think they're more complementary. Flink is something you can run on-prem. You can run it in different Defenders. And Dataflow is very much like, "I can run this on Google Cloud." And part of the idea with Beam is to make it so that people who want to write Dataflow jobs but maybe want the flexibility to go back to something else later can still have that. Yeah, we couldn't swap in Flink or Dataflow execution engines if we're on Google Cloud, but.... We're not, how do I put it nicely? Provided people are running this stuff, they're burning CPU cycles, I don't really care if they're running Dataflow or Flink as the execution engine. Either way, it's a party for me, right? >> George: Okay. >> It's probably one of those, sort of, friendly competitions. Where we both push each other to do better and add more features that the respective projects have. >> Okay, 30 second question. >> Cool. >> Do you see people building stream processing applications with machine learning as part of it to extend existing apps or for ground up new apps? >> Totally. I mostly see it as extending existing apps. This is obviously, possibly a bias, just for the people that I talk to. But, going ground up with both streaming and machine learning, at the same time, like, starting both of those projects fresh is a really big hurdle to get over. >> George: For skills. >> For skills. It's really hard to pick up both of those at the same time. It's not impossible, but it's much more likely you'll build something ... maybe you'll build a batch machine learning system, realize you want to productionize your results more quickly. Or you'll build a streaming system, and then want to add some machine learning on top of it. Those are the two paths that I see. I don't see people jumping head first into both at the same time. But this could change. Batch has been King for a long time and streaming is getting it's day in the sun. So, we could start seeing people becoming more adventurous and doing both, at the same time. >> Holden, on that note, we'll have to call it a day. That was most informative. >> It's really good to see you again. >> Likewise. So this is George Gilbert. We're on the ground at Flink Forward, the Apache Flink user conference, sponsored by Data Artisans. And we will be back in a few minutes after this short break. (tech music)
SUMMARY :
Narrator: Live from San Francisco, it's the Cube, This is the second Flink Forward conference focused on the Beam API, which is an API And Holden has been working on integrating it So, Holden, tell us about the objective of the same preparation code available to use And also to clarify, that data prep.... I don't have to do that at serving time. and the common provider underneath it, in being able to switch streaming engines, And the Beam Python support is still a work in progress. or bindings from the Beam API? But if people are looking to play with it, I didn't have a chance to go through my customer list the sort of use cases and the general industry.... as their sort of strategic successor to map produce? And part of the idea with Beam is to make it so that and add more features that the respective projects have. at the same time, and streaming is getting it's day in the sun. Holden, on that note, we'll have to call it a day. We're on the ground at Flink Forward,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Holden Karau | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
Java | TITLE | 0.99+ |
Holden | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Spotify | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
two paths | QUANTITY | 0.99+ |
TensorFlow | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
GitHub | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
Dataflow | TITLE | 0.97+ |
Flink | ORGANIZATION | 0.97+ |
one variable | QUANTITY | 0.97+ |
a day | QUANTITY | 0.97+ |
Go | TITLE | 0.97+ |
Flink Forward | EVENT | 0.96+ |
Flink | TITLE | 0.96+ |
30 second question | QUANTITY | 0.96+ |
one place | QUANTITY | 0.95+ |
Beam | TITLE | 0.95+ |
second | QUANTITY | 0.95+ |
Google Cloud | TITLE | 0.94+ |
Apache | ORGANIZATION | 0.94+ |
one group | QUANTITY | 0.94+ |
one | QUANTITY | 0.93+ |
this afternoon | DATE | 0.9+ |
Dstream | TITLE | 0.88+ |
2018 | DATE | 0.87+ |
first | QUANTITY | 0.79+ |
Beam API | TITLE | 0.75+ |
Beam | ORGANIZATION | 0.74+ |
Apache Flink Community | ORGANIZATION | 0.72+ |
Shuyi Chen, Uber | Flink Forward 2018
>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward, brought to you by data Artisans. (upbeat music) >> This is George Gilbert. We are at Flink Forward, the user conference for the Apache Flink community, sponsored by data Artisans, the company behind Flink. And we are here with Shuyi Chen from Uber, and Shuyi works on a very important project which is the Calcite Query Optimizer, SQL Query Optimizer, that's used in Apache Flink as well as several other projects. Why don't we start with, Shuyi tell us where Calcite's used and its role. >> Calcite is basically used in the Flink Table and SQL API, as the SQL POSSTR and query optimizer in planner for Flink. >> OK. >> Yeah. >> So now let's go to Uber and talk about the pipeline or pipelines you guys have been building and then how you've been using Flink and Calcite to enable the SQL API and the Table API. What workloads are you putting on that platform, or on that pipeline? >> Yeah, so basically I'm the technical lead of the streaming platform, processing platform in Uber, and so we use Apache Flink as the stream processing engine for Uber. Basically we build two different platforms one is the, called AthenaX, which use Flink SQL. So basically enable user to use SQL to compose the stream processing logic. And we have a UI, and with one click, they can just deploy the stream processing job in production. >> When you say UI, did you build a custom UI to take essentially, turn it a business intelligence tool so you have a visual way of constructing your queries? Is that what you're describing, or? >> Yeah, so it's similar to how you compose your, write a SQL query to query database. We have a UI for you to write your SQL query, with all the syntax highlight and all the hint. To write a SQL query so that, even the data scientists and also non engineers in general can actually use that UI to compose stream processing lock jobs. >> Okay, give us an example of some applications 'cause this sounds like it's a high-level API so it makes it more accessible to a wider audience. So what are some of the things they build? >> So for example, in our Uber Eats team, they use the SQL API to, as the stream processing tool to build their Restaurant Manager Dashboard. Restaurant Manager Dashboard. >> Okay. >> So basically, the data log lives in Kafka, get real-time stream into the Flink job, which it's composed using the SQL API and then that got stored in our lab database, P notes, then when the restaurant owners opens the Restaurant Manager, they will see the dashboard of their real-time earnings and everything. And with the SQL API, they no longer need to write the Flink job, they don't need to use Java or skala code, or do any testing or debugging, It's all SQL, so they, yeah. >> And then what's the SQL coverage, the SQL semantics that are implemented in the current Calcite engine? >> So it's about basic transformation, projection, and window hopping and tumbling window and also drawing, and group eye, and having, and also not to mention about the event time and real time, processing time support. >> And you can shuffle from anywhere, you don't have to have two partitions with the same join key on one node. You can have arbitrary, the data placement can be arbitrary for the partitions? >> Well the SQL is the collective, right? And so once the user compose the logic the underlying panel will actually take care of how the key by and group by, everything. >> Okay, 'cause the reason I ask is many of the early Hadoop based MPP sequel engines had the limitation where you had to co-locate the partitions that you were going to join. >> That's the same thing for Flink. >> Oh. >> But it just the SQL part is just take care of that. >> Okay. >> So you do describe what you do, but underlying get translated into a Flink program that actually will do all the co-location. >> Oh it redoes it for you, okay >> Yeah, yeah. So now they don't even need to learn Flink, they just need to learn the SQL, yeah. >> Now you said there a second platform that Uber is building on top of Flink. >> Yeah, the second platform is the, we call it the Flink as a service platform. So the motivation is, we found that SQL actually cannot satisfy all the advanced need in Uber to build stream processing, due to the reason, like for example, they will need to call up RPC services within their stream processing application or even training the RCP call, so which is hard to express in SQL and also when they are having a complicated DAG, like a workflow, it's very difficult to debug individual stages, so they want the control to actually to use delative Flink data stream APL dataset API to build their stream of batch job. >> Is the dataset API the lowest level one? >> No it's on the same level with the data stream, so it's one for streaming, one for batch. >> Okay, data stream and then the other was table? >> Dataset. >> Oh dataset, data stream, data set. >> Yeah. >> And there's one lower than that right? >> Yeah, there's one lower API but it's usually, most people don't use that API. >> So that's system programmers? >> Yeah, yeah. >> So then tell me, who is using, like what type of programmer uses the data stream or the data set API, and what do they build at Uber? >> So for example, in one of the talk later, there's a marketplace team, marketplace dynamics team, it's actually using the platform to do online model update, machinery model update, using Flink, and so basically they need to take in the model that is trained offline and do a few group by, time and location and then apply the model, and then incrementally update the model. >> And so are they taking a window of updates and then updating the model and then somehow promoting it as the candidate or, >> Yeah, yeah, yeah. Something similar, yeah. >> Okay, that's interesting. And what type of, so are these the data scientists who are using this API? >> Well data scientists are not really, it's not designed for data scientists. >> Oh so they're just going the models off, they're preparing the models offline and then they're being updated in line on the stream processing platform. >> Yes. >> And so it's maybe, data engineers who are essentially updating the features that get fed in and are continually training, or updating the models. >> Basically it's a online model update. So as Kafka event comes in, continue to refine the model. >> Okay, and so as Uber looks out couple years, what sorts of things do you see adding to one of these, either of these pipelines, and do you see a shift away from the batch and request response type workloads towards more continuous processing. >> Yes actually there we do see that trend, actually, before becoming entirely of stream processing platform team in Uber, I was in marketplace as well and at that point we always see there's a shift, like people would love to use stream processing technology to actually replace some of the normal backhand service applications. >> Tell me some examples. >> Yeah, for example... So in our dispatch platform, we have the need to actually shard the workload by, for example, writers, to different hosts to process. For example, compute say ETA or compute some of the time average, and this is before done in back hand services and say use our internal distribution system things to do the sharding. But actually with Flink, this can be just done very easily, right. And so actually there's a shift, those people will also want to adopt stream processing technology and, so long as this is not a request response style application. >> So the key thing, just to make sure I understand it's that Flink can take care of the distributed joins, whereas when it was a data base based workload, DBA had to set up the sharding and now it's sort of more transparent like it's more automated? >> I think, it's... More of the support, so if before people writing backhand services they have to write everything: the state management, the sharding, and everything, they need to-- >> George: Oh it's not even data base based-- >> Yeah, it's not data base, it's real time. >> So they have to do the physical data management, and Flink takes care of that now? >> Yeah, yeah. >> Oh got it, got it. >> For some of the application it's real time so we don't really need to store the data all the time in the database, So it's usually keep in memory and somehow gets snapshot, But we have, for normal backhand service writer they have to do everything. But with Flink it has already built in support for state management and all the sharding, partitioning and the time window, aggregation primitive, and it's all built in and they don't need to worry about re-implement the logic and we architect the system again and again. >> So it's a new platform for real time it gives you a whole lot of services, higher abstraction for real time applications. >> Yeah, yeah. >> Okay. Alright with that, Shuyi we're going to have to call it a day. This was Shuyi Chen from Uber talking about how they're building more and more of their real time platforms on Apache Flink and using a whole bunch of services to complement it. We are at Flink Forward, the user conference of data Artisans for the Apache Flink community, we're in San Francisco, this is the second Flink Forward conference and we'll be back in a couple minutes, thanks. (upbeat music)
SUMMARY :
brought to you by data Artisans. the user conference for the Apache Flink community, as the SQL POSSTR and talk about the pipeline or pipelines Yeah, so basically I'm the technical lead Yeah, so it's similar to how you compose your, so it makes it more accessible to a wider audience. as the stream processing tool the Flink job, they don't need to use Java or skala code, and also not to mention about the event time the data placement can be arbitrary for the partitions? And so once the user compose the logic had the limitation where you had to co-locate So you do describe what you do, So now they don't even need to learn Flink, Now you said there a second platform all the advanced need in Uber to build stream processing, No it's on the same level with the data stream, Yeah, there's one lower API but it's usually, and so basically they need to take in the model Yeah, yeah, yeah. so are these the data scientists who are using this API? it's not designed for data scientists. on the stream processing platform. and are continually training, So as Kafka event comes in, continue to refine the model. Okay, and so as Uber looks out couple years, and at that point we always see there's a shift, or compute some of the time average, More of the support, and it's all built in and they don't need to worry about So it's a new platform for real time for the Apache Flink community, we're in San Francisco,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Uber | ORGANIZATION | 0.99+ |
Shuyi Chen | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
George | PERSON | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
second platform | QUANTITY | 0.99+ |
Shuyi | PERSON | 0.99+ |
Java | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
Kafka | TITLE | 0.99+ |
Uber Eats | ORGANIZATION | 0.99+ |
one click | QUANTITY | 0.99+ |
SQL Query Optimizer | TITLE | 0.99+ |
SQL POSSTR | TITLE | 0.98+ |
second | QUANTITY | 0.98+ |
Calcite | TITLE | 0.98+ |
two partitions | QUANTITY | 0.97+ |
SQL API | TITLE | 0.97+ |
Calcite Query Optimizer | TITLE | 0.97+ |
Flink Forward | EVENT | 0.96+ |
a day | QUANTITY | 0.95+ |
one | QUANTITY | 0.95+ |
Flink Table | TITLE | 0.94+ |
Apache Flink | ORGANIZATION | 0.94+ |
one node | QUANTITY | 0.88+ |
Flink | TITLE | 0.83+ |
two different platforms | QUANTITY | 0.82+ |
couple years | QUANTITY | 0.82+ |
Table | TITLE | 0.82+ |
Apache | ORGANIZATION | 0.8+ |
Artisans | ORGANIZATION | 0.78+ |
2018 | DATE | 0.77+ |
Hadoop | TITLE | 0.73+ |
one for | QUANTITY | 0.69+ |
couple minutes | QUANTITY | 0.65+ |
AthenaX | ORGANIZATION | 0.64+ |
Flink Forward | TITLE | 0.56+ |
Forward | EVENT | 0.52+ |
DBA | ORGANIZATION | 0.5+ |
MPP | TITLE | 0.47+ |
Greg Benson, SnapLogic | Flink Forward 2018
>> Announcer: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> Hi this is George Gilbert. We are at Flink Forward on the ground in San Francisco. This is the user conference for the Apache Flink Community. It's the second one in the US and this is sponsored by Data Artisans. We have with us Greg Benson, who's Chief Scientist at Snap Logic and also professor of computer science at University of San Francisco. >> Yeah that's great, thanks for havin' me. >> Good to have you. So, Greg, tell us a little bit about how Snap Logic currently sets up its, well how it builds its current technology to connect different applications. And then talk about, a little bit, where you're headed and what you're trying to do. >> Sure, sure, so Snap Logic is a data and app integration Cloud platform. We provide a graphical interface that lets you drag and drop. You can open components that we call Snaps and you kind of put them together like Lego pieces to define relatively sophisticated tasks so that you don't have to write Java code. We use machine learning to help you build out these pipelines quickly so we can anticipate based on your data sources, what you are going to need next, and that lends itself to rapid building of these pipelines. We have a couple of different ways to execute these pipelines. You can think of it as sort of this specification of what the pipeline's supposed to do. We have a proprietary engine that we can execute on single notes, either in the Cloud or behind your firewall in your data center. We also have a mode which can translate these pipelines into Spark code and then execute those pipelines at scale. So, you can do sort of small, low latency processing to sort of larger, batch processing on very large data sets. >> Okay, and so you were telling me before that you're evaluating Flink or doing research with Flink as another option. Tell us what use cases that would address that the first two don't. >> Yeah, good question. I'd love to just back up a little bit. So, because I have this dual role of Chief Scientist and as a professor of Computer Science, I'm able to get graduate students to work on research projects for credit, and then eventually as interns at SnapLogic. A recent project that we've been working on since we started last fall so working on about six or seven months now is investigating Flink as a possible new back end for the SnapLogic platform. So this allows us to you know, to explore and prototype and just sort of figure out if there's going to be a good match between an emerging technology and our platform. So, to go back to your question. What would this address? Well, so, without going into too much of the technical differences between Flink and Spark which I imagine has come up in some of your conversations or it comes up here because they can solve similar use cases our experience with Flink is the code base is easy to work with both from taking our specification of pipelines and then converting them into Flink code that can run. But there's another benefit that we see from Flink and that is, whenever any product, whether it's our product or anybody else's product, that uses something like Spark or Flink as a back end, there's this challenge because you're converting something that your users understand into this target, right, this Spark API code or Flink API code. And the challenge there is if something goes wrong, how do you propagate that back to the users so the user doesn't have to read log files or get into the nuts and bolts of how Spark really works. >> It's almost like you've compiled the code, and now if something doesn't work right, you need to work at the source level. >> That's exactly right, and that's what we don't want our users to do, right? >> Right. >> So one promising thing about Flink is that we're able to integrate the code base in such a way that we have a better understanding of what's happening in the failure conditions that occur. And we're working on ways to propagate those back to the user so they can take actionable steps to remedy those without having to understand the Flink API code iself. >> And what is it, then, about Flink or its API that gives you that feedback about errors or you know, operational status that gives you better visibility than you would get in something else like Spark. >> Yeah, so without getting too too deep on the subject, what we have found is, one thing nice about the Flink code base is the core is written in Scala, but there's a lot of, all the IO and memory handling is written in Java and that's where we need to do our primary interfacing and the building blocks, sort of the core building blocks to get to, for example, something that you build with a dataset API to execution. We have found it easier to follow the transformation steps that Flink takes to end up with the resulting sort of optimized, optimized Flink pipeline. Now by understanding that transformation, like you were saying, the compilation step, by understanding it, then we can work backwards, and understand how, when something happens, how to trace it back to what the user was originally trying to specify. >> The GUI specification. >> Yeah. Right. >> So, help me understand though it sounds like you're the one essentially building a compiler from a graphical specification language down to Spark as the, you know, sort of, pseudo, you know, psuedo compile code, >> Yep. >> Or Flink. And, but if you're the one doing that compilation, I'm still struggling to understand why you would have better reverse engineering capabilities with one. >> It just is a matter of getting visibility into the steps that the underlying frameworks are taking and so, I'm not saying this is impossible to do in Spark, but we have found that we've had, it's been easier for us to get into the transformation steps that Flink is taking. >> Almost like, for someone who's had as much programming as a one semester in night school, like a variable and specter that's already there, >> Yeah, that's a good, there you go, yeah, yeah, yeah. >> Okay, so you don't have to go try and you can't actually add it, and you don't have to then infer it from all this log data. >> Now, I should add, there's another potential Flink. You were asking about use cases and what does Flink address. As you know, Flink is a streaming platform, in addition to being a batch platform, and Flink does streaming differently than how Spark does. Spark takes a microbatch approach. What we're also looking at in my research effort is how to take advantage of Flink's streaming approach to allow the SnapLogic GUI to be used to specify streaming Flink applications. Initially we're just focused on the batch mode but now we're also looking at the potential to convert these graphical pipelines into streaming Flink applications, which would be a great benefit to customers who want-- >> George: Real time integration. >> Want to do what Alibaba and all the other companies are doing but take advantage of it without having to get to the nuts and bolts of the programming. Do it through the GUI. >> Wow, so it's almost like, it's like, Flink, Beam, in terms of obstruction layers, >> Sure. >> And then SnapLogic. >> Greg: Sure, yes. >> Not that you would compile the beam, but the idea that you would have perv and processing and a real-time pipeline. >> Yes. >> Okay. So that's actually interesting, so that would open up a whole new set of capabilities. >> Yeah and, you know, it follows our you know, company's vision in allowing lots of users to do very sophisticated things without being, you know, Hadoop developers or Spark developers, or even Flink developers, we do a lot of the hard work of trying to give you a representation that's easier to work with, right but, also allow you to sort of evolve that and de-bug it and also eventually get the performance out of these systems One of the challenges of course of Spark and Flink is that they have to be tuned, and you have to, and so what we're trying to do is, using some of our machine learning, is eventually gather information that can help us identify how to tune different types of work flows in different environments. And that, if we're able to do that in it's entirety, then we, you know, we take out a lot of the really hard work that goes into making a lot of these streaming applications both scalable and performing. >> Performimg. So this would be, but you would have, to do that, you would probably have to collect well, what's the term? I guess data from the operations of many customers, >> Right. >> Because, as training data, just as the developer alone, you won't really have enough. >> Absolutely, and that's, so that you have to bootstrap that. For our machine learning that we currently use today, we leverage, you know, the thousands of pipelines, the trillions of documents that we now process on a monthly basis, and that allows us to provide good recommendations when you're building pipelines, because we have a lot of information. >> Oh, so you are serving the runtime, these runtime compilations. >> Yes. >> Oh, they're not all hosted on the customer premises. >> Oh, no no no, we do both. So it's interesting, we do both. So you can, you can deploy completely in the cloud, we're a complete SASS provider for you. Most of our customers though, you know, Banks Healthcare, want to run our engine behind their firewalls. Even when we do that though, we still have metadata that we can get introspection, sort of anonymized, but we can get introspection into how things are behaving. >> Okay. That's very interesting. Alright, Greg we're going to have to end it on that note, but uh you know, I guess everyone stay tuned. That sounds like a big step forward in sort of specification of real time pipelines at a graphical level. >> Yeah, well, it's, I hope to be talking to you again soon with more results. >> Looking forward to it. With that, this is George Gilbert, we are at Flink Forward, the user conference for the Apache Flink conference, sorry for the Apache Flink user community, sponsored by Data Artisans, we will be back shortly. (upbeat music)
SUMMARY :
brought to you by Data Artisans. We are at Flink Forward on the ground in San Francisco. and what you're trying to do. so that you don't have to write Java code. Okay, and so you were telling me before So this allows us to you know, to explore and prototype you need to work at the source level. so they can take actionable steps to remedy those that gives you that feedback something that you build with a dataset API to execution. you would have better and so, I'm not saying this is impossible to do in Spark, and you don't have to then infer it from all this log data. As you know, Flink is a streaming platform, Want to do what Alibaba and all the other companies the idea that you would have perv and processing so that would open up a whole new is that they have to be tuned, and you have to, So this would be, but you would have, to do that, just as the developer alone, you won't really have enough. we leverage, you know, the thousands of pipelines, Oh, so you are serving the runtime, Most of our customers though, you know, Banks Healthcare, you know, I guess everyone stay tuned. Yeah, well, it's, I hope to be talking to you again soon Looking forward to it.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Greg Benson | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
US | LOCATION | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
Java | TITLE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Snap Logic | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
Data Artisans | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
Scala | TITLE | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
Banks Healthcare | ORGANIZATION | 0.99+ |
second one | QUANTITY | 0.99+ |
Lego | ORGANIZATION | 0.99+ |
last fall | DATE | 0.98+ |
one semester | QUANTITY | 0.98+ |
SnapLogic | ORGANIZATION | 0.98+ |
SnapLogic | TITLE | 0.97+ |
first two | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
Flink | TITLE | 0.96+ |
single notes | QUANTITY | 0.96+ |
about six | QUANTITY | 0.96+ |
trillions of documents | QUANTITY | 0.95+ |
Flink Forward | ORGANIZATION | 0.95+ |
seven months | QUANTITY | 0.94+ |
One | QUANTITY | 0.94+ |
University of San Francisco | ORGANIZATION | 0.93+ |
one | QUANTITY | 0.92+ |
one thing | QUANTITY | 0.91+ |
Apache Flink Community | ORGANIZATION | 0.89+ |
Spark | ORGANIZATION | 0.85+ |
Apache Flink | ORGANIZATION | 0.82+ |
Flink Forward | EVENT | 0.82+ |
2018 | DATE | 0.81+ |
pipelines | QUANTITY | 0.81+ |
Flink | EVENT | 0.76+ |
SASS | ORGANIZATION | 0.73+ |
dual | QUANTITY | 0.72+ |
Logic | TITLE | 0.72+ |
Apache | ORGANIZATION | 0.57+ |
Hadoop | ORGANIZATION | 0.54+ |
thing | QUANTITY | 0.54+ |
Beam | TITLE | 0.51+ |
Kostas Tzoumas, data Artisans | Flink Forward 2018
(techno music) >> Announcer: Live, from San Francisco, it's theCUBE. Covering Flink Forward, brought to you by data Artisans. (techno music) >> Hello again everybody, this is George Gilbert, we're at the Flink Forward Conference, sponsored by data Artisans, the provider of both Apache Flink and the commercial distribution, the dA Platform that supports the productionization and operationalization of Flink, and makes it more accessible to mainstream enterprises. We're priviledged to have Kostas Tzoumas, CEO of data Artisans, with us today. Welcome Kostas. >> Thank you. Thank you George. >> So, tell us, let's start with sort of an idealized application-use case, that is in the sweet spot of Flink, and then let's talk about how that's going to broaden over time. >> Yeah, so just a little bit of an umbrella above that. So what we see very, very consistently, we see it in tech companies, and we see, so modern tech companies, and we see it in traditional enterprises that are trying to move there, is a move towards a business that runs in real time. Runs 24/7, is data-driven, so decisions are made based on data, and is software operated. So increasingly decisions are made by AI, by software, rather than someone looking at something and making a decision, yeah. So for example, some of the largest users of Apache Flink are companies like Uber, Netflix, Alibaba, Lyft, they are all working in this way. >> Can you tell us about the size of their, you know, something in terms of records per day, or cluster size, or, >> Yeah, sure. So, latest I heard, Alibaba is powering Alibaba Certs, more than a thousand nodes, terabytes of states, I'm pretty sure they will give us bigger numbers today. Netflix has reported of doing about one trillion events per day. >> George: Wow. >> On Flink. So pretty big sizes. >> So and is Netflix, I think I read, is powering their real time recommendation updates. >> They are powering a bunch of things, a bunch of applications, there's a lot of routing events internally. I think they have a talk, they had a talk definitely at the last conference, where they talk about this. And it's really a variety of use cases. It's really about building a platform, internally. And offering it to all sorts of departments in the company, be that for recommendations, be that for BI, be that for running, state of microservices, you know, all sorts of things. And we also see, the more traditional enterprise moving to this modus operandi. For example, ING is also one of our biggest partners, it's a global consumer bank based in the Netherlands, and their CEO is saying that ING is not a bank, it's a tech company that happens to have a banking license. It's a tech company that inherited a banking license. So that's how they want to operate. So what we see, is stream processing is really the enabler for this kind of business, for this kind of modern business where we interact with, in real time, they interact with the consumer in real time, they push notifications, they can change the pricing, et cetera, et cetera. So this is really the crux of stateful stream processing , for me. >> So okay, so tell us, for those who, you know, have a passing understanding of how Kafka's evolving, how Apache Spark and Structured Streaming's evolving, as distinct from, but also, Databricks. What is it about having state management that's sort of integrated, that for example, might make it easy to elastically change a cluster size by repartitioning. What can you assume about managing state internally, that makes things easier? >> Yeah, so I think really the, the sweet spot of Flink, is that if you are looking for stream process, from a stream processing engine, and for a stateful stream processing engine for that matter, Flink is the definition of this. It's the definite solution to this problem. It was created from scratch, with this in mind, it was not sort of a bolt-on on top of something else, so it's streaming from the get-go. And we have done a lot of work to make state a first-class citizen. What this means, is that in Flink programs, you can keep state that scales to terabytes, we have seen that, and you can manage this state together with your application. So Flink has this model based on check points, where you take a check point of your application and state together, and you can restart at any time from there. So it's really, the core of Flink, is around state management. >> And you manage exactly one semantics across the checkpointing? >> It's exactly once, it's application-level exactly once. We have also introduced end-to-end exactly once with Kafka. So Kafka-Flink-Kafka exactly once. So fully consistent. >> Okay so, let's drill down a little bit. What are some of the things that customers would do with an application running on a, let's say a big cluster or a couple clusters, where they want to operate both on the application logic and on the state that having it integrated you know makes much easier? >> Yeah, so it is a lot about a flipped architecture and about making operations and DevOps much, much easier. So traditionally what you would do is create, let's say a containerized stateless application and have a central centralized data store to keep all your states. What you do now, is the state becomes part of the application. So this has several benefits. It has performance benefits, it has organizational benefits in the company. >> Autonomy >> Autonomy between teams. It has, you know it gives you a lot of flexibility on what you can do with the applications, like, for example right, scaling an application. What you can do with Flink is that you have an application running with parallelism over 100 and you are getting a higher volume and you want to scale it to 500 right, so you can simply with Flink take a snapshot of the state and the application together, and then restart it at a 500 and Flink is going to resolve the state. So no need to do anything on a database. >> And then it'll reshard and Flink will reshard it. >> Will reshard and it will restart. And then one step further with the product that we have introduced, dA Platform which includes Flink, you can simply do this with one click or with one rest command. >> So, the the resharding was possible with core Flink, the Apache Flink and the dA Platform just makes it that much easier along with other operations. >> Yeah so what the dA Platform does is it gives you an API for common operational tasks, that we observed everybody that was deploying Flink at a decent scale needs to do. It abstracts, it is based on Kubernetes, but it gives you a higher-level API than Kubernetes. You can manage the application and the state together, and it gives that to you in a rest API, in a UI, et cetera. >> Okay, so in other words it's sort of like by abstracting even up from Kubernetes you might have a cluster as a first-class citizen but you're treating it almost like a single entity and then under the covers you're managing the, the things that happen across the cluster. >> So what we have in the dA Platform is a notion of a deployment which is, think of it as, I think of it as a cluster, but it's basically based on containers. So you have this notion of deployments that you can manage, (coughs) sorry, and then you have a notion of an application. And an application, is a Flink job that evolves over time. And then you have a very, you know, bird's-eye view on this. You can, when you update the code, this is the same application with updated code. You can travel through a history, you can visit the logs, and you can do common operational tasks, like as I said, rescaling, updating the code, rollbacks, replays, migrate to a new deployment target, et cetera. >> Let me ask you, outside of the big tech companies who have built much of the application management scaffolding themselves, you can democratize access to stream processing because the capabilities, you know, are not in the skill set of traditional, mainstream developers. So question, the first thing I hear from a lot of sort of newbies, or people who want to experiment, is, "Well, it's so easy to manage the state "in a shared database, even if I'm processing, "you know, continuously." Where should they make the trade-off? When is it appropriate to use a shared database? Maybe you know, for real OLTP work, and then when can you sort of scale it out and manage it integrally with the rest of the application? >> So when should we use a database and when should we use streaming, right? >> Yeah, and even if it's streaming with the embedded state. >> Yeah, that's a very good question. I think it really depends on the use case. So what we see in the market, is many enterprises start with with a use case that either doesn't scale, or it's not developer friendly enough to have these database application levels. Level separation. And then it quickly spreads out in the whole company and other teams start using it. So for example, in the work we did with ING, they started with a fraud detection application, where the idea was to load models dynamically in the application, as the data scientists are creating new models, and have a scalable fraud detection system that can handle their load. And then we have seen other teams in the company adopting processing after that. >> Okay, so that sounds like where the model becomes part of the application logic and it's a version of the application logic and then, >> The version of the model >> Is associated with the checkpoint >> Correct. >> So let me ask you then, what happens when you you're managing let's say terabytes of state across a cluster, and someone wants to query across that distributed state. Is there in Flink a query manager that, you know, knows about where all the shards are and the statistics around the shards to do a cost-based query? >> So there is a feature in Flink called queryable state that gives you the ability to do, very simple for now, queries on the state. This feature is evolving, it's in progress. And it will get more sophisticated and more production-ready over time. >> And that enables a different class of users. >> Exactly, I wouldn't, like to be frank, I wouldn't use it for complex data warehousing scenarios. That still needs a data warehouse, but you can do point queries and a few, you know, slightly more sophisticated queries. >> So this is different. This type of state would be different from like in Kafka where you can store you know the commit log for X amount of time and then replay it. This, it's in a database I assume, not in a log form and so, you have faster access. >> Exactly, and it's placed together with a log, so, you can think of the state in Flink as the materialized view of the log, at any given point in time, with various versions. >> Okay. >> And really, the way replay works is, roll back the state to a prior version and roll back the log, the input log, to that same logical time. >> Okay, so how do you see Flink spreading out, now that it's been proven in the most demanding customers, and now we have to accommodate skills, you know, where the developers and DevOps don't have quite the same distributed systems knowledge? >> Yeah, I mean we do a lot of work at data Artisans with financial services, insurance, very traditional companies, but it's definitely something that is work in progress in the sense that our product the dA Platform makes operation smarts easier. This was a common problem everywhere, this was something that tech companies solved for themselves, and we wanted to solve it for everyone else. Application development is yet another thing, and as we saw today in the last keynote, we are working together with Google and the BIM Community to bring Python, GOLD, all sorts of languages into Flink. >> Okay so that'll help at the developer level, and you're also doing work at the operations level with the platform. >> And of course there's SQL right? So Flink has Stream SQL which is standard SQL. >> And would you see, at some point, actually sort of managing the platform for customers, either on-prem or in the cloud? >> Yeah, so right now, the platform is running on Kubernetes, which means that typically the customer installs it in their clusters, in their Kubernetes clusters. Which can be either their own machines, or it can be a Kubernetes service from a cloud vendor. Moving forward I think it will be very interesting yes, to move to more hosted solutions. Make it even easier for people. >> Do you see a breakpoint or a transition between the most sophisticated customers who, either are comfortable on their own premises, or who were cloud, sort of native, from the beginning, and then sort of the rest of the mainstream? You know, what sort of applications might they move to the cloud or might coexist between on-prem and the cloud? >> Well I think it's clear that the cloud is, you know, every new business starts on the cloud, that's clear. There's a lot of enterprise that is not yet there, but there's big willingness to move there. And there's a lot of hybrid cloud solutions as well. >> Do you see mainstream customers rewriting applications because they would be so much more powerful in stream processing, or do you see them doing just new applications? >> Both, we see both. It's always easier to start with a new application, but we do see a lot of legacy applications in big companies that are not working anymore. And we see those rewritten. And very core applications, very core to the business. >> So could that be, could you be sort of the source and in an analytic processing for the continuous data and then that sort of feeds a transaction and some parameters that then feed a model? >> Yeah. >> Is that, is that a, >> Yeah. >> so in other words you could augment existing OLTP applications with analytics then inform them in real time essentially. >> Absolutely. >> Okay, 'cause that sounds like then something that people would build around what exists. >> Yeah, I mean you can do, you can think of stream processing, in a way, as transaction processing. It's not a dedicated OLTP store, but you can think of it in this flipped architecture right? Like the log is essentially the re-do log, you know, and then you create the materialized views, that's the write path, and then you have the read path, which is queryable state. This is this whole CQRS idea right? >> Yeah, Command-Query-Response. >> Exactly. >> So, this is actually interesting, and I guess this is critical, it's sort of like a new way of doing distributed databases. I know that's not the word you would choose, but it's like the derived data, managed by, sort of coming off of the state changes, then in the stream processor that goes through a single sort of append-only log, and then reading, and how do you manage consistency on the materialized views that derive data? >> Yeah, so we have seen Flink users implement that. So we have seen, you know, companies really base the complete product on the CQRS pattern. I think this is a little bit further out. Consistency-wise, Flink gives you the exactly once consistency on the write path, yeah. What we see a lot more is an architecture where there's a lot of transactional stores in the front end that are running, and then there needs to be some kind of global, of single source of truth, between all of them. And a very typical way to do that is to get these logs into a stream, and then have a Flink application that can actually scale to that. Create a single source of truth from all of these transactional stores. >> And by having, by feeding the transactional stores into this sort of hub, I presume, some cluster as a hub, and even if it's in the form of sort of a log, how can you replay it with sufficient throughput, I guess not to be a data warehouse but to, you know, have low latency for updating the derived data? And is that derived data I assume, in non-Flink products? >> Yeah, so the way it works is that, you know, you can get the change logs from the databases, you can use something like Kafka to buffer them up, and then you can use Flink for all the processing and to do the reprocessing with Flink, this is really one of the core strengths of Flink. Basically what you do is, you replay the Flink program together with the states you can get really, really high throughput reprocessing there. >> Where does the super high throughput come from? Is that because of the integration of state and logic? >> Yeah, that is because Flink is a true streaming engine. It is a high-performance streaming engine. And it manages the state, there's no tier, >> Crossing a boundary? >> no tier crossing and there's no boundary crossing when you access state. It's embedded in the Flink application. >> Okay, so that you can optimize the IO path? >> Correct. >> Okay, very, very interesting. So, it sounds like the Kafka guys, the Confluent folks, their aspirations, from the last time we talked to 'em, doesn't extend to analytics, you know, I don't know whether they want partners to do that, but it sounds like they have a similar topology, but they're, but I'm not clear how much of a first-class citizen state is, other than the log. How would you characterize the trade-offs between the two? >> Yeah, so, I mean obviously I cannot comment on Confluent, but like, what I think is that the state and the log are two very different things. You can think of the log as storage, it's a kind of hot storage because it's the most recent data but you know, you cannot query it, it's not a materialized view, right. So for me the separation is between processing state and storage. The log is is a kind of storage, so kind of message queue. State is really the active data, the real-time active data that needs to have consistency guarantees, and that's a completely different thing. >> Okay, and that's the, you're managing, it's almost like you're managing under the covers a distributed database. >> Yes, kind of. Yeah a distributed key-value store if you wish. >> Okay, okay, and then that's exposed through multiple interfaces, data stream, table. >> Data stream, table API, SQL, other languages in the future, et cetera. >> Okay, so going further down the line, how do you see the sort of use cases that are going to get you across the chasm from the big tech companies into the mainstream? >> Yeah, so we are already seeing that a lot. So we're doing a lot of work with financial services, insurance companies a lot of very traditional businesses. And it's really a lot about maintaining single source of truth, becoming more real-time in the way they interact with the outside world, and the customer, like they do see the need to transform. If we take financial services and investment banks for example, there is a big push in this industry to modernize the IT infrastructure, to get rid of legacy, to adopt modern solutions, become more real-time, et cetera. >> And so they really needed this, like the application platform, the dA Platform, because operationalizing what Netflix did isn't going to be very difficult maybe for non-tech companies. >> Yeah, I mean, you know, it's always a trade-off right, and you know for some, some companies build, some companies buy, and for many companies it's much more sensible to buy. That's why we have software products. And really, our motivation was that we worked in the open-source Flink community with all the big tech companies. We saw their successes, we saw what they built, we saw, you know, their failures. We saw everything and we decided to build this for everybody else, for everyone that, you know, is not Netflix, is not Uber, cannot hire software developers so easily, or with such good quality. >> Okay, alright, on that note, Kostas, we're going to have to end it, and to be continued, one with Stefan next, apparently. >> Nice. >> And then hopefully next year as well. >> Nice. Thank you. >> Alright, thanks Kostas. >> Thank you George. Alright, we're with Kostas Tzoumas, CEO of data Artisans, the company behind Apache Flink and now the application platform that makes Flink run for mainstream enterprises. We will be back, after this short break. (techno music)
SUMMARY :
Covering Flink Forward, brought to you by data Artisans. and makes it more accessible to mainstream enterprises. Thank you George. application-use case, that is in the sweet spot of Flink, So for example, some of the largest users of Apache Flink I'm pretty sure they will give us bigger numbers today. So pretty big sizes. So and is Netflix, I think I read, is powering it's a tech company that happens to have a banking license. So okay, so tell us, for those who, you know, and you can restart at any time from there. We have also introduced end-to-end exactly once with Kafka. and on the state that having it integrated So traditionally what you would do is and you want to scale it to 500 right, which includes Flink, you can simply do this with one click So, the the resharding was possible with and it gives that to you in a rest API, in a UI, et cetera. you might have a cluster as a first-class citizen and you can do common operational tasks, because the capabilities, you know, are not in the skill set So for example, in the work we did with ING, and the statistics around the shards that gives you the ability to do, but you can do point queries and a few, you know, where you can store you know the commit log so, you can think of the state in Flink and roll back the log, the input log, in the sense that our product the dA Platform at the operations level with the platform. And of course there's SQL right? Yeah, so right now, the platform is running on Kubernetes, Well I think it's clear that the cloud is, you know, It's always easier to start with a new application, so in other words you could augment Okay, 'cause that sounds like then something that's the write path, and then you have the read path, I know that's not the word you would choose, So we have seen, you know, companies Yeah, so the way it works is that, you know, And it manages the state, there's no tier, It's embedded in the Flink application. doesn't extend to analytics, you know, but you know, you cannot query it, Okay, and that's the, you're managing, it's almost like Yeah a distributed key-value store if you wish. Okay, okay, and then that's exposed other languages in the future, et cetera. and the customer, like they do see the need to transform. like the application platform, the dA Platform, and you know for some, some companies build, and to be continued, one with Stefan next, apparently. and now the application platform
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Alibaba | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
ING | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Kostas Tzoumas | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Lyft | ORGANIZATION | 0.99+ |
Kostas | PERSON | 0.99+ |
Stefan | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Flink | ORGANIZATION | 0.99+ |
next year | DATE | 0.99+ |
Netherlands | LOCATION | 0.99+ |
two | QUANTITY | 0.99+ |
Both | QUANTITY | 0.99+ |
Kafka | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
one click | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
SQL | TITLE | 0.98+ |
first thing | QUANTITY | 0.98+ |
more than a thousand nodes | QUANTITY | 0.98+ |
Kubernetes | TITLE | 0.98+ |
500 | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
Confluent | ORGANIZATION | 0.97+ |
Artisans | ORGANIZATION | 0.96+ |
single source | QUANTITY | 0.96+ |
2018 | DATE | 0.96+ |
over 100 | QUANTITY | 0.95+ |
dA Platform | TITLE | 0.95+ |
Flink | TITLE | 0.94+ |
about one trillion events per day | QUANTITY | 0.94+ |
Apache Flink | ORGANIZATION | 0.93+ |
single | QUANTITY | 0.92+ |
Flink Forward Conference | EVENT | 0.9+ |
one step | QUANTITY | 0.9+ |
Apache | ORGANIZATION | 0.89+ |
Databricks | ORGANIZATION | 0.89+ |
Kafka | PERSON | 0.88+ |
first | QUANTITY | 0.86+ |
single entity | QUANTITY | 0.84+ |
one rest command | QUANTITY | 0.82+ |