Greg Rokita, Edmunds.com & Joel Minnick, Databricks | AWS re:Invent 2021
>>We'll come back to the cubes coverage of AWS reinvent 2021, the industry's most important hybrid event. Very few hybrid events, of course, in the last two years. And the cube is excited to be here. Uh, this is our ninth year covering AWS reinvent this the 10th reinvent we're here with Joel Minnick, who the vice president of product and partner marketing at smoking hot company, Databricks and Greg Rokita, who is executive director of technology at Edmonds. If you're buying a car or leasing a car, you gotta go to Edmund's. We're gonna talk about busting data silos, guys. Great to see you again. >>Welcome. Welcome. Glad to be here. >>All right. So Joel, what the heck is a lake house? This is all over the place. Everybody's talking about lake house. What is it? >>And it did well in a nutshell, a Lakehouse is the ability to have one unified platform to handle all of your traditional analytics workloads. So your BI and reporting Trisha, the lake, the workloads that you would have for your data warehouse on the same platform as the workloads that you would have for data science and machine learning. And so if you think about kind of the way that, uh, most organizations have built their infrastructure in the cloud today, what we have is generally customers will land all their data in a data lake and a data lake is fantastic because it's low cost, it's open. It's able to handle lots of different kinds of data. Um, but the challenges that data lakes have is that they don't necessarily scale very well. It's very hard to govern data in a data lake house. It's very hard to manage that data in a data lake, sorry, in a, in a data lake. >>And so what happens is that customers then move the data out of a data lake into downstream systems and what they tend to move it into our data warehouses to handle those traditional reporting kinds of workloads that they have. And they do that because data warehouses are really great at being able to have really great scale, have really great performance. The challenge though, is that data warehouses really only work for structured data. And regardless of what kind of data warehouse you adopt, all data warehouse and platforms today are built on some kind of proprietary format. So once you've put that data into the data warehouse, that's, that is kind of what you're locked into. The promise of the data lake house was to say, look, what if we could strip away all of that complexity and having to move data back and forth between all these different systems and keep the data exactly where it is today and where it is today is in the data lake. >>And then being able to apply a transaction layer on top of that. And the Databricks case, we do that through a technology and open source technology called data lake, or sorry, Delta lake. And what Delta lake allows us to do is when you need it, apply that performance, that reliability, that quality, that scale that you would expect out of a data warehouse directly on your data lake. And if I can do that, then what I'm able to do now is operate from one single source of truth that handles all of my analytics workloads, both my traditional analytics workloads and my data science and machine learning workloads, and being able to have all of those workloads on one common platform. It means that now not only do I get much, much more simple in the way that my infrastructure works and therefore able to operate at much lower costs, able to get things to production much, much faster. >>Um, but I'm also able to now to leverage open source in a much bigger way being that lake house is inherently built on an open platform. Okay. So I'm no longer locked into any kind of data format. And finally, probably one of the most, uh, lasting benefits of a lake house is that all the roles that have to take that have to touch my data for my data engineers, to my data analyst, my data scientists, they're all working on the same data, which means that collaboration that has to happen to go answer really hard problems with data. I'm now able to do much, much more easy because those silos that traditionally exist inside of my environment no longer have to be there. And so Lakehouse is that is the promise to have one single source of truth, one unified platform for all of my data. Okay, >>Great. Thank you for that very cogent description of what a lake house is now. Let's I want to hear from the customer to see, okay, this is what he just said. True. So actually, let me ask you this, Greg, because the other problem that you, you didn't mention about the data lake is that with no schema on, right, it gets messy and Databricks, I think, correct me if I'm wrong, has begun to solve that problem, right? Through series of tooling and AI. That's what Delta liked us. It's a man, like it's a managed service. Everybody thought you were going to be like the cloud era of spark and Brittany Britain, a brilliant move to create a managed service. And it's worked great. Now everybody has a managed service, but so can you paint a picture at Edmonds as to what you're doing with, maybe take us through your journey the early days of a dupe, a data lake. Oh, that sounds good. Throw it in there, paint a picture as to how you guys are using data and then tie it into what y'all just said. >>As Joel said, that they'll the, it simplifies the architecture quite a bit. Um, in a modern enterprise, you have to deal with a variety of different data sources, structured semi-structured and unstructured in the form of images and videos. And with Delta lake and built a lake, you can have one system that handles all those data sources. So what that does is that basically removes the issue of multiple systems that you have to administer. It lowers the cost, and it provides consistency. If you have multiple systems that deal with data, you always arise as the issue as to which data has to be loaded into which system. And then you have issues with consistency. Once you have issues with consistency, business users, as analysts will stop trusting your data. So that was very critical for us to unify the system of data handling in the one place. >>Additionally, you have a massive scalability. So, um, I went to the talk with from apple saying that, you know, he can process two years worth of data. Instead of just two days in an Edmonds, we have this use case of backfilling the data. So often we changed the logic and went to new. We need to reprocess massive amounts of data with the lake house. We can reprocess months worth of data in, in a matter of minutes or hours. And additionally at the data lake houses based on open, uh, open standards, like parquet that allowed us, allowed us to basically hope open source and third-party tools on top of the Delta lake house. Um, for example, a Mattson, we use a Matson for data discovery, and finally, uh, the lake house approach allows us for different skillsets of people to work on the same source data. We have analysts, we have, uh, data engineers, we have statisticians and data scientists using their own programming languages, but working on the same core of data sets without worrying about duplicating data and consistency issues between the teams. >>So what, what is, what are the primary use cases where you're using house Lakehouse Delta? >>So, um, we work, uh, we have several use cases, one of them more interesting and important use cases as vehicle pricing, you have used the Edmonds. So, you know, you go to our website and you use it to research vehicles, but it turns out that pricing and knowing whether you're getting a good or bad deal is critical for our, uh, for our business. So with the lake house, we were able to develop a data pipeline that ingests the transactions, curates the transactions, cleans them, and then feeds that curated a curated feed into the machine learning model that is also deployed on the lake house. So you have one system that handles this huge complexity. And, um, as you know, it's very hard to find unicorns that know all those technologies, but because we have flexibility of using Scala, Java, uh, Python and SQL, we have different people working on different parts of that pipeline on the same system and on the same data. So, um, having Lakehouse really enabled us to be very agile and allowed us to deploy new sources easily when we, when they arrived and fine tune the model to decrease the error rates for the price prediction. So that process is ongoing and it's, it's a very agile process that kind of takes advantage of the, of the different skill sets of different people on one system. >>Because you know, you guys democratized by car buying, well, at least the data around car buying because as a consumer now, you know, I know what they're paying and I can go in of course, but they changed their algorithms as well. I mean, the, the dealers got really smart and then they got kickbacks from the manufacturer. So you had to get smarter. So it's, it's, it's a moving target, I guess. >>Great. The pricing is actually very complex. Like I, I don't have time to explain it to you, but knowing, especially in this crazy market inflationary market where used car prices are like 38% higher year over year, and new car prices are like 10% higher and they're changing rapidly. So having very responsive pricing model is, is extremely critical. Uh, you, I don't know if you're familiar with Zillow. I mean, they almost went out of business because they mispriced their, uh, their houses. So, so if you own their stock, you probably under shorthand of it, but, you know, >>No, but it's true because I, my lease came up in the middle of the pandemic and I went to Edmonds, say, what's this car worth? It was worth like $7,000. More than that. Then the buyout costs the residual value. I said, I'm taking it, can't pass up that deal. And so you have to be flexible. You're saying the premises though, that open source technology and Delta lake and lake house enabled that flexible. >>Yes, we are able to ingest new transactions daily recalculate our model within less than an hour and deploy the new model with new pricing, you know, almost real time. So, uh, in this environment, it's very critical that you kind of keep up to date and ingest their latest transactions as they prices change and recalculate your model that predicts the future prices. >>Because the business lines inside of Edmond interact with the data teams, you mentioned data engineers, data scientists, analysts, how do the business people get access to their data? >>Originally, we only had a core team that was using Lakehouse, but because the usage was so powerful and easy, we were able to democratize it across our units. So other teams within software engineering picked it up and then analysts picked it up. And then even business users started using the dashboarding and seeing, you know, how the price has changed over time and seeing other, other metrics within the, >>What did that do for data quality? Because I feel like if I'm a business person, I might have context of the data that an analyst might not have. If they're part of a team that's servicing all these lines of business, did you find that the data quality, the collaboration affected data? >>Th the biggest thing for us was the fact that we don't have multiple systems now. So you don't have to load the data. Whenever you have to load the data from one system to another, there is always a lag. There's always a delay. There is always a problematic job that didn't do the copy correctly. And the quality is uncertain. You don't know which system tells you the truth. Now we just have one layer of data. Whether you do reports, whether you're data processing or whether you do modeling, they all read the same data. And the second thing is that with the dashboarding capabilities, people that were not very technical that before we could only use Tableau and Tableau is not the easiest thing to use as if you're not technical. Now they can use it. So anyone can see how our pricing data looks, whether you're an executive, whether you're an analyst or a casual business users, >>But Hey, so many questions, you guys are gonna have to combat. I'm gonna run out of time, but you now allow a consumer to buy a car directly. Yes. Right? So that's a new service that you launched. I presume that required new data. We give, we >>Give consumers offers. Yes. And, and that offer you >>Offered to buy my league. >>Exactly. And that offer leverages the pricing that we develop on top of the lake house. So the most important thing is accurately giving you a very good offer price, right? So if we give you a price, that's not so good. You're going to go somewhere else. If we give you price, that's too high, we're going to go bankrupt like Zillow debt, right. >>It took to enable that you're working off the same dataset. Yes. You're going to have to spin up a, did you have to inject new data? Was there a new data source that we're working on? >>Once we curate the data sources and once we clean it, we see the directly to the model. And all of those components are running on the lake house, whether you're curating the data, cleaning it or running the model. The nice thing about lake house is that machine learning is the first class citizen. If you use something like snowflake, I'm not going to slam snowflake here, but you >>Have two different use case. You have >>To, you have to load it into a different system later. You have to load it into a different system. So like good luck doing machine learning on snowflake. Right. >>Whereas, whereas Databricks, that's kind of your raison d'etre >>So what are your, your, your data engineer? I feel like I should be a salesman or something. Yeah. I'm not, I'm not saying that. Just, just because, you know, I was told to, like, I'm saying it because of that's our use case, >>Your use case. So question for each of you, what, what business results did you see when you went to kind of pre lake house, post lake house? What are the, any metrics you can share? And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, what can you tell us? Well, >>Uh, before their lake house, we had two different systems. We had one for processing, which was still data breaks. And the second one for serving and we iterated over Nateeza or Redshift, but we figured that maintaining two different system and loading data from one to the other was a huge overhead administration security costs. Um, the fact that you had to consistency issues. So the fact that you can have one system, um, with, uh, centralized data, solves all those issues. You have to have one security mechanism, one administrative mechanism, and you don't have to load the data from one system to the other. You don't have to make compromises. >>It's scale is not a problem because of the cloud, >>Because you can spend clusters at will for different use cases. So your clusters are independent. You have processing clusters that are not affecting your serving clusters. So, um, in the past, if you were running a serving, say on Nateeza or Redshift, if you were doing heavy processing, your reports would be affected, but now all those clusters are separated. So >>Consumer data consumer can take that data from the producer independ >>Using its own cluster. Okay. >>Yeah. I'll give you the final word, Joel. I know it's been, I said, you guys got to come back. This is what have you seen broadly? >>Yeah. Well, I mean, I think Greg's point about scale. It's an interesting one. So if you look at cross the entire Databricks platform, the platform is launching 9 million VMs every day. Um, and we're in total processing over nine exabytes a month. So in terms of just how much data the platform is able to flow through it, uh, and still maintain a extremely high performance is, is bar none out there. And then in terms of, if you look at just kind of the macro environment of what's happening out there, you know, I think what's been most exciting to watch or what customers are experiencing traditionally or, uh, on the traditional data warehouse and kinds of workloads, because I think that's where the promise of lake house really comes into its own is saying, yes, I can run these traditional data warehousing workloads that require a high concurrency high scale, high performance directly on my data lake. >>And, uh, I think probably the two most salient data points to raise up there is, uh, just last month, Databricks announced it's set the world record for the, for the, uh, TPC D S 100 terabyte benchmark. So that is a place where Databricks at the lake house architecture, that benchmark is built to measure data warehouse performance and the lake house beat data warehouse and sat their own game in terms of overall performance. And then what's that spends from a price performance standpoint, it's customers on Databricks right now are able to enjoy that level of performance at 12 X better price performance than what cloud data warehouses provide. So not only are we jumping on this extremely high scale and performance, but we're able to do it much, much more efficiently. >>We're gonna need a whole nother section second segment to talk about benchmarking that guys. Thanks so much, really interesting session and thank you and best of luck to both join the show. Thank you for having us. Very welcome. Okay. Keep it right there. Everybody you're watching the cube, the leader in high-tech coverage at AWS reinvent 2021
SUMMARY :
Great to see you again. Glad to be here. This is all over the place. and reporting Trisha, the lake, the workloads that you would have for your data warehouse on And regardless of what kind of data warehouse you adopt, And what Delta lake allows us to do is when you need it, that all the roles that have to take that have to touch my data for as to how you guys are using data and then tie it into what y'all just said. And with Delta lake and built a lake, you can have one system that handles all Additionally, you have a massive scalability. So you have one system that So you had to get smarter. So, so if you own their stock, And so you have to be flexible. less than an hour and deploy the new model with new pricing, you know, you know, how the price has changed over time and seeing other, other metrics within the, lines of business, did you find that the data quality, the collaboration affected data? So you don't have to load But Hey, so many questions, you guys are gonna have to combat. So the most important thing is accurately giving you a very good offer did you have to inject new data? I'm not going to slam snowflake here, but you You have To, you have to load it into a different system later. Just, just because, you know, I was told to, And then I wonder, Joel, if you could share a sort of broader what you're seeing across your customer base, but Greg, So the fact that you can have one system, So, um, in the past, if you were running a serving, Okay. This is what have you seen broadly? So if you look at cross the entire So not only are we jumping on this extremely high scale and performance, but we're able to do it much, Thanks so much, really interesting session and thank you and best of luck to both join the show.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Joel | PERSON | 0.99+ |
Greg | PERSON | 0.99+ |
Joel Minnick | PERSON | 0.99+ |
$7,000 | QUANTITY | 0.99+ |
Greg Rokita | PERSON | 0.99+ |
38% | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
10% | QUANTITY | 0.99+ |
Java | TITLE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
two years | QUANTITY | 0.99+ |
one system | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
Scala | TITLE | 0.99+ |
apple | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
ninth year | QUANTITY | 0.99+ |
last month | DATE | 0.99+ |
lake house | ORGANIZATION | 0.99+ |
two different systems | QUANTITY | 0.99+ |
Tableau | TITLE | 0.99+ |
2021 | DATE | 0.99+ |
9 million VMs | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.99+ |
less than an hour | QUANTITY | 0.99+ |
Lakehouse | ORGANIZATION | 0.98+ |
12 X | QUANTITY | 0.98+ |
Delta | ORGANIZATION | 0.98+ |
Delta lake house | ORGANIZATION | 0.98+ |
one layer | QUANTITY | 0.98+ |
one common platform | QUANTITY | 0.98+ |
both | QUANTITY | 0.97+ |
AWS | ORGANIZATION | 0.97+ |
Zillow | ORGANIZATION | 0.97+ |
Brittany Britain | PERSON | 0.97+ |
Edmunds.com | ORGANIZATION | 0.97+ |
two different system | QUANTITY | 0.97+ |
Edmonds | ORGANIZATION | 0.97+ |
over nine exabytes a month | QUANTITY | 0.97+ |
today | DATE | 0.96+ |
Lakehouse Delta | ORGANIZATION | 0.96+ |
Delta lake | ORGANIZATION | 0.95+ |
Trisha | PERSON | 0.95+ |
data lake | ORGANIZATION | 0.94+ |
Mattson | ORGANIZATION | 0.92+ |
second segment | QUANTITY | 0.92+ |
each | QUANTITY | 0.92+ |
Matson | ORGANIZATION | 0.91+ |
two most salient data points | QUANTITY | 0.9+ |
Edmonds | LOCATION | 0.89+ |
100 terabyte | QUANTITY | 0.87+ |
one single source | QUANTITY | 0.86+ |
first class | QUANTITY | 0.85+ |
Nateeza | TITLE | 0.85+ |
one security | QUANTITY | 0.85+ |
Redshift | TITLE | 0.84+ |