Steven Mih, Ahana & Girish Baliga, Uber | CUBE Conversation
(bright music) >> Hey everyone, welcome to this CUBE conversation featuring Ahana, I'm your host Lisa Martin. I've got two guests here with me today. Steven Mih joins us, the Presto Foundation governing board member, co-founder and CEO of Ahana, and Girish Baliga Presto Foundation governing board chair and senior engineering manager at Uber. Guys thanks for joining us. >> Thanks for having us. >> Thanks for having us. >> So Steven we're going to dig into and unpack Presto in the next few minutes or so, but Steven let's go ahead and start with you. Talk to us about some of the challenges with the open data lake house market. What are some of those key challenges that organizations are facing? >> Yeah, just pulling up the slide you know, what we see is that many organizations are dealing with a lot more data and very different data types and putting that all into, traditionally as the data warehouse, which has been the workhorse for BI and analytics traditionally, it becomes very, very expensive, and there's a lot of lock in associated with that. And so what's happening is that people are putting the data semistructured and unstructured data for example, in cloud data lakes or other data lakes, and they find that they can query directly with a SQL query engine like Presto. And that lets you have a much more approach to dealing with getting insights out of your data. And that's what this is all about, and that's why companies are moving to a modern architecture. Girish maybe you can share some of your thoughts on how Uber uses Presto for this. >> Yeah, at Uber we use Presto in our internal deployments. So at Uber we have our own data centers, we store data locally in our data centers, but we have made the conscious choice to go with an open data stack. Our entire data stack is built around open source technologies like Hadoop, Hive, Spark and Presto. And so Presto is an invaluable engine that is able to connect to all these different storage and data formats and allow us to have a single entry point for our users, to run their SQL engines and get insights rather quickly compared to some of the other engines that we have at Uber. >> So let's talk a little bit about Presto so that the audience gets a good overview of that. Steven starting with you, you talked about the challenges of the traditional data warehouse application. Talk to us about why Presto was founded the open, the project, give us that background information if you will. >> Absolutely, so Presto was originally developed out of the biggest hyperscaler out there which is Facebook now known as Meta. And they donated that project to the, and open sourced it and donated it to the Linux Foundation. And so Presto is a SQL query engine, it's a storage SQL query engine, that runs directly on open data lakes, so you can put your data into open formats like 4K or C, and get insights directly from that at a very good price performance ratio. The Presto Foundation of which Girish and I are part of, we're all working together as a consortium of companies that all want to see Presto continue to get bigger and bigger. Kind of like Kubernetes has a, has an organization called CNCF, Presto has Presto Foundation all under the umbrella of the Linux Foundation. And so there's a lot of exciting things that are coming on the roadmap that make Presto very unique. You know, RaptorX is a multilevel caching system that it's been fantastic, Aria optimizations are another area, we Ahana have developed some security features with donating the integrations with Apache Ranger and that's the type of things that we do to help the community. But maybe Girish can talk about some of the exciting items on the roadmap that you're looking forward to. >> Absolutely, I think from Uber's point of view just a sheer scale of data and our volume of query traffic. So we run about half a million Presto queries a day, right? And we have thousands of machines in our Presto deployments. So at that scale in addition to functionality you really want a system that can handle traffic reliably, that can scale, and that is backed by a strong community which guarantees that if you pull in the new version of Presto, you won't break anything, right? So all of those things are very important to us. So I think that's where we are relying on our partners particularly folks like Facebook and Twitter and Ahana to build and maintain this ecosystem that gives us those guarantees. So that is on the reliability front, but on the roadmap side we are also excited to see where Presto is extending. So in addition to the projects that Steven talked about, we are also looking at things like Presto and Spark, right? So take the Presto SQL and run it as a Spark job for instance, or running Presto on real-time analytics applications something that we built and contributed from Uber side. So we are all taking it in very different directions, we all have different use cases to support, and that's the exciting thing about the foundation. That it allows us all to work together to get Presto to a bigger and better and more flexible engine. >> You guys mentioned Facebook and I saw on the slide I think Twitter as well. Talk to me about some of the organizations that are leveraging the Presto engine and some of the business benefits. I think Steve you talked about insights, Steven obviously being able to get insights from data is critical for every business these days. >> Yeah, a major, major use case is finding the ad hoc and interactive queries, and being able to drive insights from doing so. And so, as I mentioned there's so much data that's being generated and stored, and to be able to query that data in place, at a, with very, very high performance, meaning that you can get answers back in seconds of time. That lets you have the interactive ability to drill into data and innovate your business. And so this is fantastic because it's been developed at hyperscalers like Uber that allow you to have open source technology, pick that up, and just download it right from prestodb.io, and then start to run with this and join the community. I think from an open source perspective this project under the governance of Linux Foundation gives you the confidence that it's fully transparent and you'll never see any licensing changes by the Linux Foundation charter. And therefore that means the technology remains free forever without later on limitations occurring, which then would perhaps favor commercialization of any one vendor. That's not the case. So maybe Girish your thoughts on how we've been able to attract industry giants to collaborate, to innovate further, and your thoughts on that. >> Yeah, so of the interesting I've seen in the space is that there is a bifurcation of companies in this ecosystem. So there are these large internet scale companies like Facebook, and Uber, and Twitter, which basically want to use something like Presto for their internal use cases. And then there is the second set of companies, enterprise companies like Ahana which basically wanted to take Presto and provide it as a service for other companies to use as an alternative to things like Snowflake and other systems right? So, and the foundation is a great place for both sets of companies to come together and work. The internet scale companies bring in the scale, the reliability, the different kind of ways in which you can challenge the system, optimize it, and so forth, and then companies like Ahana bring in the flexibility and the extensibility. So you can work with different clouds, different storage formats, different engines, and I think it's a great partnership that we can see happening primarily through the foundational spaces. Which you would be hard pressed to find in a single vendor or a, you know, a single-source system that is there on the market today. >> How long ago was the Presto Foundation initiated? >> It's been over three years now and it's been going strong, we're over a dozen members and it's open to everyone. And it's all governed like the Linux Foundation so we use best practices from that and you can just check it out at prestodb.io where you can get the software, or you can hear about how to join the foundation. So it includes members like Intel, and HPE as well, and we're really excited for new members to come, and contribute in and participate. >> Sounds like you've got good momentum there in the foundation. Steven talk a little bit about the last two years. Have you seen the acceleration in use cases in the number of users as we've been in such an interesting environment where the need for real-time insights is essential for every business initially a few couple of years ago to survive but now to be, to really thrive, is it, have you seen the acceleration in Presto in that timeframe? >> Absolutely, we see there's acceleration of being more data-driven and especially moving to cloud and having more data in the cloud, we think that innovation is happening, digital innovation is happening very fast and Presto is a major enabler of that, again, being able to get, drive insights from the data this is not just your typical business data, it's now getting into really clickstream data, knowing about how customers are operating today, Uber is a great example of all the different types of innovations they can drive, whether it be, you know, knowing in real time what's happening with rides, or offering you a subscription for special deals to use the service more. So, you know, Ahana we really love Presto, and we provide a SaaS manage service of the open source and provide free trials, and help people get up to speed that may not have the same type of skills as Uber or Facebook does. And we work with all companies in that way. >> Think about the consumers these days, we're very demanding, right? When I think one of the things that was in short supply during the last two years was patience. And if I think of Uber as a great example, I want to know if I'm asking for a ride I want to know exactly in real time what's coming for me? Where is it now? How many more minutes is it going to take? I mean, that need to fulfill real-time insights is critical across every industry but have you seen anything in the last couple years that's been more leading edge, like e-commerce or retail for example? I'm just curious. >> Girish you want to take that one or? >> Yeah, sure. So I can speak from the Uber point of view. So real-time insights has really exploded as an area, particularly as you mentioned with this just-in-time economy, right? Just to talk about it a little bit from Uber side, so some of the insights that you mentioned about when is your ride coming, and things of that nature, right? Look at it from the driver's point of view who are, now we have Uber Eats, so look at it from the restaurant manager's point of view, right? They also want to know how is their business coming? How many customer orders are coming for instance? what is the conversion rate? And so forth, right? And today these are all insights that are powered by a system which has a Presto as an front-end interface at Uber. And these queries run like, you have like tens of thousands of queries every single second, and the queries run in like a second and so forth. So you are really talking about production systems running on top of Presto, production serving systems. So coming to other use cases like eCommerce, we definitely have seen some of that uptake happen as well, so in the broader community for instance, we have companies like Stripe, and other folks who are also using this hashtag which is very similar to us based on another open source technology called Pino, using Presto as an interface. And so we are seeing this whole open data lakehouse more from just being, you know, about interactive analytics to driving all different kinds of analytics. Having anything to do with data and insights in this space. >> Yeah, sounds like the evolution has been kind of on a rocket ship the last couple years. Steven, one more time we're out of time, but can you mention that URL where folks can go to learn more? >> Yeah, prestodb.io and that's the Presto Foundation. And you know, just want to say that we'll be sharing the use case at the Startup Showcase coming up with theCUBE. We're excited about that and really welcome everyone to join the community, it's a real vibrant, expanding community and look forward to seeing you online. >> Sounds great guys. Thank you so much for sharing with us what Presto Foundation is doing, all of the things that it is catalyzing, great stuff, we look forward to hearing that customer use case, thanks for your time. >> Thank you. >> Thanks Lisa, thank you. >> Thanks everyone. >> For Steven and Girish, I'm Lisa Martin, you're watching theCUBE the leader in live tech coverage. (bright music)
SUMMARY :
and Girish Baliga Presto in the next few minutes or so, And that lets you have that is able to connect to so that the audience gets and that's the type of things that we do So that is on the reliability front, and some of the business benefits. and then start to run with So, and the foundation is a great place and it's open to everyone. in the number of users as we've been and having more data in the cloud, I mean, that need to fulfill so some of the insights that you mentioned Yeah, sounds like the evolution and look forward to seeing you online. all of the things that it For Steven and Girish, I'm Lisa Martin,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Lisa Martin | PERSON | 0.99+ |
Steven | PERSON | 0.99+ |
Steve | PERSON | 0.99+ |
Girish | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Steven Mih | PERSON | 0.99+ |
Presto Foundation | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Ahana | ORGANIZATION | 0.99+ |
Linux Foundation | ORGANIZATION | 0.99+ |
CNCF | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Intel | ORGANIZATION | 0.99+ |
two guests | QUANTITY | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
Presto | ORGANIZATION | 0.99+ |
second set | QUANTITY | 0.99+ |
both sets | QUANTITY | 0.99+ |
over three years | QUANTITY | 0.99+ |
Ahana | PERSON | 0.98+ |
Kubernetes | ORGANIZATION | 0.98+ |
Spark | TITLE | 0.97+ |
Girish Baliga | PERSON | 0.97+ |
about half a million | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
over a dozen members | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
Presto | TITLE | 0.96+ |
SQL | TITLE | 0.95+ |
single | QUANTITY | 0.95+ |
thousands of machines | QUANTITY | 0.94+ |
every single second | QUANTITY | 0.93+ |
Girish Baliga Presto Foundation | ORGANIZATION | 0.92+ |
prestodb.io | OTHER | 0.91+ |
last couple years | DATE | 0.9+ |
4K | OTHER | 0.89+ |
Startup Showcase | EVENT | 0.88+ |
one vendor | QUANTITY | 0.88+ |
UNLIST TILL 4/2 - Vertica @ Uber Scale
>> Sue: Hi, everybody. Thank you for joining us today, for the Virtual Vertica BDC 2020. This breakout session is entitled "Vertica @ Uber Scale" My name is Sue LeClaire, Director of Marketing at Vertica. And I'll be your host for this webinar. Joining me is Girish Baliga, Director I'm sorry, user, Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q and A session, at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can also Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button, in the lower right corner of the slides. And yet, this virtual session is being recorded, and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Girish over to you. >> Girish: Thanks a lot Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real time analytics teams at Uber. Vertica is one of the main platforms that we support, and Vertica powers a lot of core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases, across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability, using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I will be sharing an easy way for you to take advantage of, many of the ideas and solutions that I'm going to present today, that you can apply to your own Vertica deployments in your companies. So stick around and put on your seat belts, and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems, and enabling people all over the world to solve their local problems, their local needs, their local issues, in a manner that's efficient, fast and reliable. As our CEO Dara has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So, across our various business lines, we have over 110 million monthly users, who use our rides, services, or eat services, and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the ride business, have over 20 million trips per day. And that each business is also catching up, particularly during the recent times that we've been having. And so, I hope these numbers give you a scale of the amount of data, that we process each and every day. And support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So, Uber to describe it very briefly, is a lot like Amazon. We are largely an operation and logistics company. And employee work based reflects that. So over 70% of our employees work in teams, which come under the umbrella of Community Operations and Centers of Excellence. So these are all folks working in various cities and towns that we operate around the world, and run the Uber businesses, as somewhat local businesses responding to local needs, local market conditions, local regulation and so forth. And Vertica is one of the most important tools, that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going, to deeply into any issues that they want to triage , to generate reports, to plan for the future, a whole lot of use cases. The second big class of users, are in our marketplace team. So marketplace is the engineering team, that backs our ride shared business. And as part of this, running this business, a key problem that they have to solve, is how to determine what prices to set, for particular rides, so that we have a good match between supply and demand. So obviously the real time pricing decisions they're made by serving systems, with very detailed and well crafted machine learning models. However, the training data that goes into this models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store, and serve out of Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and back-end systems, to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines, is dashboards and reporting. So we have a lot of dashboards, which are built by core data analysts teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run, periodically say once a week or once a day even, depending on the frequency of data that they need. And many of these are powered by the data, and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas, doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives, or marketing support, and so forth. So the analysis that backs a lot of these decisions, is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is, how we provide best in class algorithms, pricing, and matching. And a lot of the analysis that goes into, figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real time decisions, are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica usage spans a whole bunch of organizations and users, all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5000 weekly active, people who run queries at least once a week, to do some critical business role or problem to solve, that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps, and request for a ride or order food delivery on each platform, the apps are talking to our serving systems. And the serving systems use online storage systems, to store the data as the trips and eat orders are getting processed in real time. So for this, we primarily use an in house built, key value storage system called Schemaless, and an open source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support serving systems. So all of this operations generates a lot of data, that we then want to process and analyze, and use for our operational improvements. So, we have ingestion systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber a data lake is powered by Hadoop, with files stored on HDFS clusters. So once the raw data lines on the data lake, we then have ETL jobs that process these raw datasets, and generate, modeled and customize datasets which we then use for further analysis. So once these model datasets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer. So with internal tools, like QueryBuilder, which is a UI interface to write queries, and look at results. And it read over the front-end sites, and Dashbuilder, which is a dash, board building tool, and report management tool. So these are all various tools that we have built within Uber. And these can talk to Vertica and run SQL queries to power, whatever, dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights, that we show on dashboards as folks use, and it also powers reports that we run periodically. But more importantly, we have some core, properties and core feature sets that Vertica provides, which allows us to support many of these use cases, very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables onto Vertica. The core fact tables are all the trips, all the each orders and all these other line items for various businesses from Uber, stored as partitioned tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, career partners and so forth. So we have both these two kinds of datasets, which will load into Vertica. And we have full historical data, all the way since we launched these businesses to today. So that folks can do deeper longitudinal analysis, so they can look at patterns, like how the business has grown from month to month, year to year, the same month, over a year, over multiple years, and so forth. And, the really powerful thing about Vertica, is that most of these queries, you run the deep longitudinal queries, run very, very fast. And that's really why we love Vertica. Because we see query latency P90s. That is 90 percentile of all queries that we run on our platform, typically finish in under a minute. So that's very important for us because Vertica is used, primarily for interactive analytics use cases. And providing SQL query execution times under a minute, is critical for our users and business owners to get the most out of analytics and Big Data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important set of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards, that rely very heavily on being able to restrict datasets by geographic areas, cities, source destination pairs, heat maps, and so forth. And Vertica has a rich array of functions that we use very heavily. We also have, support for custom projections in Vertica. And this really helps us, have very good performance for critical datasets. So for instance, in some of our core fact tables, we have done a lot of query and analysis to figure out, how users run their queries, what kind of columns they use, what combination of columns they use, and what joints they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica, is very valuable for us. So we've also had some very successful collaborations, with the Vertica engineering team. About a year and a half back, we had open-sourced a Python Client, that we had built in house to talk to Vertica. We were using this Python Client in our business intelligence layer that I'd shown on the previous slide. And we had open-sourced it after working closely with Eng team. And now Vertica formally supports the Python Client as an open-source project, which you can download to and integrate into your systems. Another more recent example of collaboration is the Vertica Eon mode on GCP. So as most of or at least some of you know, Vertica Eon mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So Vertica team hustled on this, and provided us early preview version, which we've been testing out to see how performance, is impacted by running on the Cloud, and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this very soon. So here I have a visualization of an internal dashboard, that is powered solely by data and queries running on Vertica. So this GIF has sequence have different visualizations supported by this tool. So for instance, here you see a heat map, downgrading heat map of source of traffic demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lines. And then you can see how demand moves around. So, as the cycles through the various animations, you can basically see all the different kinds of insights, and query shapes that we send to Vertica, which powers this critical business dashboard for our operations teams. All right, so now how do we do all of this at scale? So, we started off with a single Vertica cluster, a few years back. So we had our data lake, the data would land into Vertica. So these are the core fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer, right? So this is a very simple, and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that, Uber is a pretty big company at this point, with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course, on queueing theory, you will realize that basically, even though you could have all the query is processed through a single serving system. You will tend to see larger and larger queue wait times, as the number of queries pile up. And what this means in practice for end users, is that they are basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself, is probably less than a minute, their query sitting in the queue for a bunch of minutes, and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point of failure. Now Vertica can handle single node failures very gracefully, and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon, you will see that, when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime, is when you're upgrading software in your clusters. So, essentially we're a global company, and we have users all around the world, we really cannot afford to have downtime, even for one hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we we might need to upgrade our machines, or we might need to replace storage or memory due to issues with the hardware in there, due to normal wear and tear, or due to abnormal issues. And so because of all of these things, having a single point of failure, having a single cluster was not really practical for us. So the next thing we did, was we set up multiple clusters, right? So we had a bunch of identities clusters, all of which have the same datasets. So then we would basically load data using ingestion pipelines from our data lake, onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Anytime we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster, if we had to, or we would just have some spare nodes, which would rotate into our production clusters and so forth. However, having multiple clusters, led to a new set of issues. So the first problem was that since we have multiple clusters, you would end up with inconsistent schema. So one of the things to understand about our platform, is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on Vertica clusters. So we have dataset owners and publishers, who manage their own datasets. Now exposing multiple clusters to these dataset owners. Turns out, it's not a great idea, right? Because they are not really aware of, the importance of having consistency of schemas and datasets across different clusters. So over time, what we saw was that the schema for the same tables would basically get out of order, because they were all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables in one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where, we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had ingestion pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the ingestion pipeline into cluster B failed, then the data there would be older than clusters A and C. So, when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results, than you would if you went to a or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts. But then that would lead to a bad situation for them where they would not able to fully trust the data that was, and the results and insights that were being returned by the SQL queries and Vertica systems. And then the third problem was, we had a lot of extra replication. So the 20/80 Rule, or maybe even the 90/10 Rule, applies to datasets on our clusters as well. So less than 10% of our datasets, for instance, in 90% of the queries, right? And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this set up where we had to do that, was obviously very suboptimal for us. So then what we did, was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing datasets or dataset owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that, all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that, when queries come in from the BI layer, the query was forwarded, smartly and with knowledge and data about which cluster up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent, almost single system view of our entire Vertica deployment. And the third bit, we had put in place, was the data manifest, which were the communication mechanism between ingestion and proxy. So the data manifest basically is a listing of, which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra replication problem. So now we basically have some big clusters, where all the core tables, and all the tables, in fact, are served. So any query that hits 90%, less so tables, goes to the big clusters. And most of the queries which hit 10% heavily queried important tables, can also be served by many other small clusters, so much more efficient use of resources. So this basically is the view that we have today, of Vertica within Uber, so external to our team, folks, just have an endpoint, where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries. And they are so to a proxy layer. So let's get a little more into details, about each of these layers. So, on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle, so the list of cities list of drivers, the list of users and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these datasets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each orders and so forth. So these are partition. So we have one partition roughly per day, for the last couple of years, and then we have more of a hierarchical partitions set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that, is because not all our data comes in at the same time. So we have updates for trips, going over the past two or three days, for instance, where people add ratings to their trips, or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last few days to make sure we capture all those updates. And we also update older partitions, if for instance, records were deleted for retention purposes, or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow dataset owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that, we take advantage of the data being clustered by day, so that we don't have to update all the data at once. So when we are recovering from an cluster event, like a version upgrade or software upgrade, or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, and copying all the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we, add this cluster to our serving pool, and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here, is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables, in all the clusters in phase one. And then when all the clusters give us access signals, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load, using Vertica Data Copy. So what this means is earlier, in a parallel pipelines scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica Data Copy, we just load it data into one cluster and then much more efficiently copy it, to the other clusters. So this has significantly reduced our ingestion overheads, and speed it up our load process. And as I mentioned as the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date, by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first, what we do is we stage these changes to make sure that these are correct. So this catches errors that are trying to do, an incompatible update, like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency. And provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for, replicated clusters to all our users. So the way we handle that is, as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query, and route the query to the appropriate cluster based on the manifest information. Also the proxy is aware of the health of individual clusters. So if for some reason a cluster is down for maintenance or upgrades, the proxy is aware of this information. And it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters, and do some load balancing to ensure that we award hotspots on various clusters. So the key takeaways that I have from the stock, are primarily these. So we started off with single cluster mode on Vertica, and we ran into a bunch of issues around scaling and availability due to cluster downtime. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. Then we run into issues around schema consistency, data staleness, and data replication. So we built an entire ecosystem around Vertica, with abstraction layers around data management and ingestion, and proxy. And with this setup, we were able to enforce consistency and improve storage utilization. So, hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber, and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have a interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem, is to use the Vertica Eon mode. So the Vertica Eon mode, allows you to set up multiple clusters with consistent data updates, and set them up at various different sizes to handle different query loads. And it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been, trying it out on DCP, and initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new. And hopefully you took away something that you can also apply to your systems. We have a few more time for some questions. So I'll pause for now and take any questions.
SUMMARY :
Any questions that we don't address, So the first issue that we have is that,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Girish Baliga | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Girish | PERSON | 0.99+ |
10% | QUANTITY | 0.99+ |
one hour | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Sue | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Dara | PERSON | 0.99+ |
first issue | QUANTITY | 0.99+ |
less than a minute | QUANTITY | 0.99+ |
MySQL | TITLE | 0.99+ |
First | QUANTITY | 0.99+ |
first problem | QUANTITY | 0.99+ |
third problem | QUANTITY | 0.99+ |
third bit | QUANTITY | 0.99+ |
less than 10% | QUANTITY | 0.99+ |
each platform | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
one cluster | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
second issue | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
today | DATE | 0.99+ |
second phase | QUANTITY | 0.99+ |
two kinds | QUANTITY | 0.99+ |
over 10,000 cities | QUANTITY | 0.99+ |
over 70% | QUANTITY | 0.99+ |
each business | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.98+ |
second problem | QUANTITY | 0.98+ |
Vertica | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
Vertica Data Manager | TITLE | 0.98+ |
two-phase | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
90 percentile | QUANTITY | 0.98+ |
once a week | QUANTITY | 0.98+ |
each | QUANTITY | 0.98+ |
single point | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
once a day | QUANTITY | 0.97+ |
Redis | TITLE | 0.97+ |
one partition | QUANTITY | 0.97+ |
under a minute | QUANTITY | 0.97+ |
@ Uber Scale | ORGANIZATION | 0.96+ |