Jeff Healey, Vertica at Micro Focus | CUBEConversations, March 2020

>> Narrator: From theCUBE studios in Palo Alto in Boston, connecting with top leaders all around the world, this is theCUBE Conversation. >> Hi everybody, I'm Dave Vellante, and welcome to the Vertica Big Data Conference virtual. This is our digital presentation, wall to wall coverage actually, of the Vertica Big Data Conference. And with me is Jeff Healy, who directs product marketing at Vertica. Jeff, good to see you. >> Good to see you, Dave. Thanks for the opportunity to chat. >> You're very welcome Now I'm excited about the products that you guys announced and you're hardcore into product marketing, but we're going to talk about the Vertica Big Data Conference. It's been a while since you guys had this. Obviously, new owner, new company, some changes, but that new company Microfocus has announced that it's investing, I think the number was $70 million into two areas. One was security and the other, of course, was Vertica. So we're really excited to be back at the virtual Big Data Conference. And let's hear it from you, what are your thoughts? >> Yeah, Dave, thanks. And we love having theCUBE at all of these events. We're thrilled to have the next Vertica Big Data Conference. Actually it was a physical event, we're moving it online. We know it's going to be a big hit because we've been doing this for some time particularly with two of the webcast series we have every month. One is under the Hood Webcast Series, which is led by our engineers and the other is what we call a Data Disruptors Webcast Series, which is led by all customers. So we're really confident this is going to be a big hit we've seen the registration spike. We just hit 1,000 and we're planning on having about 1,000 at the physical event. It's growing and growing. We're going to see those big numbers and it's not going to be a one time thing. We're going to keep the conversation going, make sure there's plenty of best practices learning throughout the year. >> We've been at all the big BDCs and the first one's were really in the heart of the Big Data Movement, really exciting time and the interesting thing about this event is it was always sort of customers talking to customers. There wasn't a lot of commercials, an intimate event. Of course I loved it because it was in our hometown. But I think you're trying to carry that theme obviously into the digital sphere. Maybe you can talk about that a little bit. >> Yeah, Dave, absolutely right. Of course, nothing replaces face to face, but everything that you just mentioned that makes it special about the Big Data Conference, and you know, you guys have been there throughout and shown great support in talking to so many customers and leaders and what have you. We're doing the same thing all right. So we had about 40 plus sessions planned for the physical event. We're going to run half of those and we're not going to lose anything though, that's the key point. So what makes the Vertica Big Data Conference really special is that the only presenters that are allowed to present are either engineers, Vertica engineers, or best practices engineers and then customers. Customers that actually use the product. There's no sales or marketing pitches or anything like that. And I'll tell you as far as the customer line up that we have, we've got five or six already lined up as part of those 20 sessions, customers like Uber, customers like the Trade Desk, customers like Phillips talking about predictive maintenance, so list goes on and on. You won't want to miss it if you're on the fence or if you're trying to figure out if you want to register for this event. Best part about it, it's all free, and if you can't attend it live, it will be live Q&A chat on every single one of those sessions, we promise we'll answer every question if we don't get it live, as we always do. They'll all be available on demand. So no reason not to register and attend or watch later. >> Thinking about the content over the years, in the early days of the Big Data Conference, of course Vertica started before the whole Big Data Conference meme really took off and then as it took off, plugged right into it, but back then the discussion was a lot of what do I do with big data, Gartner's three Vs and how do I wrangle it all, and what's the best approach and this stuff is, Hadoop is really complicated. Of course Vertica was an alternative to RDBMS that really couldn't scale or give that type of performance for analytical databases so you had your foot in that door. But now the conversation that's interesting your theme, it's win big with data. Of course, the physical event was at the Encore, which is the new Casino in Boston. But my point is, the conversation is no longer about, how to wrangle all this data, you know how to lower the cost of storing this data, how to make it go faster, and actually make it work. It's really about how to turn data into insights and transform your organizations and quote and quote, win with big data. >> That's right. Yeah, that's great point, Dave. And that's why I mean, we chose the title really, because it's about our customers and what they're able to do with our platform. And it's we know, it's not just one platform, all of the ecosystem, all of our incredible partners. Yeah it's funny when I started with the organization about seven years ago, we were closing lots of deals, and I was following up on case studies and it was like, Okay, why did you choose Vertica? Well, the queries went fast. Okay, so what does that mean for your business? We knew we're kind of in the early adopter stage. And we were disrupting the data warehouse market. Now we're talking to our customers that their volumes are growing, growing and growing. And they really have these analytical use cases again, talk to the value at the entire organization is gaining from it. Like that's the difference between now and a few years ago, just like you were saying, when Vertica disrupted the database market, but also the data warehouse market, you can speak to our customers and they can tell you exactly what's happening, how it's moving the needle or really advancing the entire organization, regardless of the analytical use case, whether it's an internet of things around predictive maintenance, or customer behavior analytics, they can speak confidently of it more than just, hey, our queries went faster. >> You know, I've mentioned before the Micro Focus investment, I want to drill into that a bit because the Vertica brand stands alone. It's a Micro Focus company, but Vertica has its own sort of brand awareness. The reason I've mentioned that is because if you go back to the early days of MPP Database, there was a spate of companies, startups that formed. And many if not all of those got acquired, some lived on with the Codebase, going into the cloud, but generally speaking, many of those brands have gone away Vertica stays. And so my point is that we've seen Vertica have staying power throughout, I think it's a function of the architecture that Stonebraker originally envisioned, you guys were early on the market had a lot of good customer traction, and you've been very responsive to a lot of the trends. Colin Mahony will talk about how you adopted and really embrace cloud, for example, and different data formats. And so you've really been able to participate in a lot of the new emerging waves that have come out to the market. And I would imagine some of that's cultural. I wonder if you could just address that in the context of BDC. >> Oh, yeah, absolutely. You hit on all the key points here, Dave. So a lot of changes in the industry. We're in the hottest industry, the tech industry right now. There's lots of competition. But one of the things we'll say in terms of, Hey, who do you compete with? You compete with these players in the cloud, open source alternatives, traditional enterprise data warehouses. That's true, right. And one of the things we've stayed true within calling is really kind of led the charge for the organization is that we know who we are right. So we're an analytical database platform. And we're constantly just working on that one sole Source Code base, to make sure that we don't provide a bunch of different technologies and databases, and different types of technologies need to stitch together. This platform just has unbelievable universal capabilities from everything from running analytics at scale, to in Database Machine Learning with the different approach to all different types of deployment models that are supported, right. We don't go to our companies and we say, yeah, we take care of all your problems but you have to stitch together all these different types of technologies. It's all based on that core Vertica engine, and we've expanded it to meet all these market needs. So Colin knows and what he believes and what he tells the team what we lead with, is that it lead with that one core platform that can address all these analytical initiatives. So we know who we are, we continue to improve on it, regardless of the pivots and the drastic measures that some of the other competitors have taken. >> You know, I got to ask you, so we're in the middle of this global pandemic with Coronavirus and COVID-19, and things change daily by the hour sometimes by the minute. I mean, every day you get up to something new. So you see a lot of forecasts, you see a lot of probability models, best case worst case likely case even though nobody really knows what that likely case looks like, So there's a lot of analytics going on and a lot of data that people are crunching new data sources come in every day. Are you guys participating directly in that, specifically your customers? Are they using your technology? You can't use a traditional data warehouse for this. It's just you know, too slow to asynchronous, the process is cumbersome. What are you seeing in the customer base as it relates to this crisis? >> Sure, well, I mean naturally, we have a lot of customers that are healthcare technology companies, companies, like Cerner companies like Philips, right, that are kind of leading the charge here. And of course, our whole motto has always been, don't throw away any the data, there's value in that data, you don't have to with Vertica right. So you got petabyte scale types of analytics across many of our customers. Again, just a few years ago, we called the customers a petabyte club. Now a majority of our large enterprise software companies are approaching those petabyte volumes. So it's important to be able to run those analytics at that scale and that volume. The other thing we've been seeing from some of our partners is really putting that analytics to use with visualizations. So one of the customers that's going to be presenting as part of the Vertica Big Data conferences is Domo. Domo has a really nice stout demo around be able to track the Coronavirus the outbreak and how we're getting care and things like that in a visual manner you're seeing more of those. Well, Domo embeds Vertica, right. So that's another customer of ours. So think of Vertica is that embedded analytical engine to support those visualizations so that just anyone in the world can track this. And hopefully as we see over time, cases go down we overcome this. >> Talk a little bit more about that. Because again, the BDC has always been engineers presenting to audiences, you guys have a lot of you just mentioned the demo by Domo, you have a lot of brand names that we've interviewed on theCUBE before, but maybe you could talk a little bit more about some of the customers that are going to be speaking at the virtual event, and what people can expect. >> Sure, yeah, absolutely. So we've got Uber that's presenting just a quick fact around Uber. Really, the analytical data warehouse is all Vertica, right. And it works very closely with Open Source or what have you. Just to quick stat on on Uber, 14 million rides per day, what Uber is able to do is connect the riders with the drivers so that they can determine the appropriate pricing. So Uber is going to be a great session that everyone will want to tune in on that. Others like the Trade Desk, right massive Ad Tech company 10 billion ad auctions daily, it may even be per second or per minute, the amount of scale and analytical volume that they have, that they are running the queries across, it can really only be accomplished with a few platforms in the world and that's Vertica that's another a hot one is with the Trade Desk. Philips is going to be presenting IoT analytical workloads we're seeing more and more of those across not only telematics, which you would expect within automotive, but predictive maintenance that cuts across all the original manufacturers and Philips has got a long history of being able to handle sensor data to be able to apply to those business cases where you can improve customer satisfaction and lower costs related to services. So around their MRI machines and predictive maintenance initiative, again, Vertica is kind of that heartbeat, that analytical platform that's driving those initiatives So list goes on and on. Again, the conversation is going to continue with the Data Disruptors in the Under Hood webcast series. Any customers that weren't able to present and we had a few that just weren't able to do it, they've already signed up for future months. So we're already booked out six months out more and more customer stories you're going to hear from Vertica.com. >> Awesome, and we're going to be sharing some of those on theCUBE as well, the BDC it's always been intimate event, one of my favorites, a lot of substance and I'm sure the online version, the virtual digital version is going to be the same. Jeff Healey, thanks so much for coming on theCUBE and give us a little preview of what we can expect at the Vertica BDC 2020. >> You bet. >> Thank you. >> Yeah, Dave, thanks to you and the whole CUBE team. Appreciate it >> Alright, and thank you for watching everybody. Keep it right here for all the coverage of the virtual Big Data conference 2020. You're watching theCUBE. I'm Dave Vellante, we'll see you soon

Published Date : Mar 20 2020

SUMMARY :

connecting with top leaders all around the world, actually, of the Vertica Big Data Conference. Thanks for the opportunity to chat. Now I'm excited about the products that you guys announced and it's not going to be a one time thing. and the interesting thing about this event is that the only presenters that are allowed to present how to wrangle all this data, you know how to lower the cost all of the ecosystem, all of our incredible partners. in a lot of the new emerging waves So a lot of changes in the industry. and a lot of data that people are crunching So one of the customers that's going to be presenting that are going to be speaking at the virtual event, Again, the conversation is going to continue and I'm sure the online version, the virtual digital version Yeah, Dave, thanks to you and the whole CUBE team. of the virtual Big Data conference 2020.

ENTITIES

Entity	Category	Confidence
Jeff Healy	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Jeff Healey	PERSON	0.99+
Colin Mahony	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
five	QUANTITY	0.99+
Dave	PERSON	0.99+
Microfocus	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
Palo Alto	LOCATION	0.99+
Uber	ORGANIZATION	0.99+
$70 million	QUANTITY	0.99+
Colin	PERSON	0.99+
20 sessions	QUANTITY	0.99+
six	QUANTITY	0.99+
two	QUANTITY	0.99+
Boston	LOCATION	0.99+
March 2020	DATE	0.99+
Gartner	ORGANIZATION	0.99+
One	QUANTITY	0.99+
six months	QUANTITY	0.99+
Domo	ORGANIZATION	0.98+
one platform	QUANTITY	0.98+
Big Data Conference	EVENT	0.98+
two areas	QUANTITY	0.98+
one	QUANTITY	0.98+
CUBE	ORGANIZATION	0.98+
Vertica Big Data Conference	EVENT	0.98+
Coronavirus	OTHER	0.98+
Stonebraker	ORGANIZATION	0.98+
about 40 plus sessions	QUANTITY	0.97+
COVID-19	OTHER	0.96+
BDC	ORGANIZATION	0.96+
one core platform	QUANTITY	0.95+
Vertica BDC 2020	EVENT	0.95+
1,000	QUANTITY	0.95+
Vertica Big Data	EVENT	0.95+
one time	QUANTITY	0.95+
Micro Focus	ORGANIZATION	0.94+
few years ago	DATE	0.93+
about 1,000	QUANTITY	0.93+
Codebase	ORGANIZATION	0.93+
Phillips	ORGANIZATION	0.93+
Cerner	ORGANIZATION	0.92+
10 billion ad auctions	QUANTITY	0.91+
14 million rides per day	QUANTITY	0.9+
Coronavirus	EVENT	0.89+
first one	QUANTITY	0.89+
Under Hood	TITLE	0.86+
Hadoop	TITLE	0.85+
BDC	EVENT	0.83+
seven years ago	DATE	0.8+
outbreak	EVENT	0.79+

UNLIST TILL 4/2 - Vertica Database Designer - Today and Tomorrow

>> Jeff: Hello everybody and thank you for joining us today for the Virtual VERTICA BDC 2020. Today's breakout session has been titled, "VERTICA Database Designer Today and Tomorrow." I'm Jeff Healey, Product VERTICA Marketing, I'll be your host for this breakout session. Joining me today is Yuanzhe Bei, Senior Technical Manager from VERTICA Engineering. But before we begin, (clearing throat) I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions, as we're able to during that time, any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VERTICA forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums, to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button at the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you Yuanzhe. >> Yuanzhe: Thanks Jeff. Hi everyone, my name is Yuanzhe Bei, I'm a Senior Technical Manager at VERTICA Server RND Group. I run the query optimizer, catalog and the disaggregated engine team. Very glad to be here today, to talk about, the "VERTICA Database Designer Today and Tomorrow". This presentation will be organized as the following; I will first refresh some knowledge about, VERTICA fundamentals such as Tables and Projections, which will bring to the question, "What is Database Designer?" and "Why we need this tool?". Then I will take you through a deep dive, into a Database Designer or we call DBD, and see how DBD's internals works, after that I'll show you some exciting DBD improvements, we have planned for 10.0 release and lastly, I will share with you, some DBD future roadmap we planned next. As most of you should already know, VERTICA is built on a columnar architecture. That means, data is stored column wise. Here we can see a very simple example, of table with four columns, and the many of you may also know, table in VERTICA is a virtual concept. It's just a logical representation of data, which means user can write SQL query, to reference the table names and column, just like other relational database management system, but the actual physical storage of data, is called Projection. A Projection can reference a subset, or all of the columns all to its anchor table, and must be sorted by at least one column. Each table need at least one C for projection which reference all the columns to the table. If you load data to a table with no projection, and automated, auto production will be created, which will be arbitrarily assorted by, the first couple of columns in the table. As you can imagine, even though such other production, can be used to answer any query, the performance is not optimized in most cases. A common practice in VERTICA, is to create multiple projections, contain difference step of column, and sorted in different ways on the same table. When query is sent to the server, the optimizer will pick the projection, that can answer the query in the most efficient way. For example, here you can say, let's say you have a query, that select columns B, D, C and sorted by B and D, the third projection will be ideal, because the data is already sorted, so you can save the sorting costs while executing the query. Basically when you choose the design of the projection, you need to consider four things. First and foremost, of course the sort order. The data already sorted in the right way, can benefit quite a lot of the query actually, like Ordered by, Group By, Analytics, Merge, Join, Predicates and so on. The select column group is also important, because the projection must contain, all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, VERTICA is the distributed database, and allow projection to be segmented, based on the hash of a set of columns, which is beneficial if the segmentation merged, the join keys or group keys. And finally encoding of each per columns is also part of the design, because the data is sorted in different way, may completely change the optimal encoding for each column. This example only show the benefit of the first two, but you can imagine the rest too are also important. But even for that, it doesn't sound that hard, right? Well I hope you change your mind already when you see this, at least I do. These machine generated queries, really beats me. It will probably take an experienced DBA hours, to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries, in the regular work logs in the real world. So what can we do? That's why we need DBD. DBD is a tool integrated in the VERTICA server, that it can help DBA to perform an access, on their work log query, tabled schema and data, and then automatically figure out, the most optimized projection design for their workload. In addition, DBD also a sophisticated tool, that can take customize by a user, by sending a lot of parameters objectives and so on. And lastly, DBD has access to the optimizer, so DB knows what kind of attribute, the projection need to have, in order to have the optimizer to benefit from them. DBD has been there for years, and I'm sure there are plenty of materials available online, to show you how DBD can be used in different scenarios, whether to achieve the query optimize, or load optimize, whether it's the comprehensive design, or the incremental design, whether it's a dumping deployment script, and manual deployment later, or let the DBD do the order deployment for you, and the many other options. I'm not planning to talk about this today, instead, I will take the opportunity today, to open this black box DBD, and show you what exactly hide inside. DBD is a complex tool and I have tried my best to summarize the DBD design process into seven steps; Extract, Permute, Prune, Build, Score, Identify and Encode. What do they mean? Don't worry, I will show you step by step. The first step is Extract. Extract Interesting Columns. In this step, DBD pass the design queries, and figure out the operations that can be benefited, by the potential projection design, and extract the corresponding columns, as interesting columns. So Predicates, Group By, Order By, Joint Condition, and analytics are all interesting Column to the DBD. As you can see this three simple sample queries, DBD can extract the interest in column sets on the right. Some of these column sets are unordered. For example, the green one for Group By a1 and b1, the DBD extracts the interesting column set, and put them in the own orders set, because either data sorted by a1 first or b1 first, can benefit from this Group By operation. Some of the other sets are ordered, and the best example is here, order by clause a2 and b2, and obviously you cannot sort it by b2 and then a2. These interesting columns set will be used as if, to extend to actual projection sort order candidates. The next step is Permute, once DBD extract all the C's, it will enumerate sort order using C, and how does DBD do that? I'm starting with a very simple example. So here you can see DBD can enumerate two sort orders, by extending d1 with the unordered set a1, b1, and the derived at two sort order candidates, d1, a1, b1, and d1, b1, a1. This sort order can benefit queries with predicate on d1, and also benefit queries by Group By a1, b1, when a1, sorry when d1 is constant. So with the same idea, DBD will try to extend other States with each other, and populate more sort order permutations. You can imagine that how many of them, there could be many of them, these candidates, based on how many queries you have in the design and that can be handled of the sort order candidates. That comes to the third step, which is Pruning. This step is to limit the candidates sort order, so that the design won't be running forever. DBD uses very simple capping mechanism. It sorts all the, sort all the candidates, are ranked by length, and only a certain number of the sort order, with longest length, will be moved forward to the next step. And now we have all the sort orders candidate, that we want to try, but whether this sort order candidate, will be actually be benefit from the optimizer, DBD need to ask the optiizer. So this step before that happens, this step has to build those projection candidate, in the catalog. So this step will build, will generates the projection DBL's, surround the sort order, and create this projection in the catalog. These projections won't be loaded with real data, because that takes a lot of time, instead, DBD will copy over the statistic, on existing projections, to this projection candidates, so that the optimizer can use them. The next step is Score. Scoring with optimizer. Now projection candidates are built in the catalog. DBD can send a work log queries to optimizer, to generate a query plan. And then optimizer will return the query plan, DBD will go through the query plan, and investigate whether, there are certain benefits being achieved. The benefits list have been growing over time, when optimizer add more optimizations. Let's say in this case because the projection candidates, can be sorted by the b1 and a1, it is eligible for Group By Pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries, will be aggregated and then recorded, for each projection candidate. We are almost there. Now we have all the total benefit score, for the projection candidates, we derived on the work log queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner d1, b1 and a1. Sometimes you need to find more winners, because the chosen winner may only benefit a subset, of the work log query you provided to the DBD. So in order to have the rest of the queries, to be also benefit, you need more projections. So in this case, DBD will go to the next iteration, and let's say in this case find to another winner, d1, c1, to benefit the work log queries, that cannot be benefit by d1, b1 and a1. The number of iterations and thus the winner outcome, DBD really depends on the design objective that uses that. It can be load optimized, which means that only one, super projection winner will be selected, or query optimized, where DBD try to create as many projections, to cover most of the work log queries, or somewhat balance an objective in the middle. The last step is to decide encoding, for each projection columns, for the projection winners. Because the data are sorted differently, the encoding benefits, can be very different from the existing projection. So choose the right projection encoding design, will save the disk footprint a significant factor. So it's worth the effort, to find out the best thing encoding. DBD picks the encoding, based on the actual sampling the data, and measure the storage footprint. For example, in this case, the projection winner has three columns, and say each column has a few encoding options. DBD will write the sample data in the way this projection is sorted, and then you can see with different encoding, the disk footprint is different. DBD will then compare the disk footprint of each, of different options for each column, and pick the best encoding options, based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basic that how DBD internal works, of course, I think we've heard it quite a lot. For example, I didn't mention how the DBD handles segmentation, but the idea is similar to analyze the sort order. But I hope this section gave you some basic idea, about DBD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DBD in many ways. In this talk I will highlight four issues in old DBD and describe how the 10.0 version new DBD, will address those issues. The first issue is that a DBD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday, with the new or different projection can help speed it up? However, to answer a simple question like this using DBD, user will be very likely to have the documentation open on the side, because they have to go through it's whole complex flow, from creating a projection, run the design, get outputs and then create a design in the end. And that's not there yet, for each step, there are several functions user need to call in order. So adding these up, user need to write the quite long script with dozens of functions, it's just too complicated, and most of you may find it annoying. They either manually tune the projection to themselves, or simply live with the performance and come back, when it gets really slow again, and of course in most situations, they never come back to use the DBD. In 10.0 VERTICA support the new simplified API, to run DBD easily. There will be just one function designer_single_run and one argument, the interval that you think, your query was slow. In this case, user complained about it yesterday. So what does this user to need to do, is just specify one day, as argument and run it. The user don't need to provide anything else, because the DBD will look up his query or history, within that time window and automatically populate design, run design and export the projection design, and the clean up, no user intervention needed. No need to have the documentation on the side and carefully write a script, and a debug, just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes to another issue. To fully utilize this single round function, users are encouraged to run DBD on the production cluster. However, in fact, VERTICA used to not recommend, to run a design on a production cluster. One of the reasons issue, is that DBD picks massive locks, both table locks and catalog locks, which will badly interfere the running workload, on a production cluster. As of 10.0, we eliminated all the table and ten catalog locks from DBD. Yes, we eliminate 100% of them, simple improvement, clear win. The third issue, which user may not be aware of, is that DBD writes intermediate result. into real VERTICA tables, the real DBD have to do that is, DBD is the background task. So the intermediate results, some user needs to monitor it, the progress of the DBD in concurrent session. For complex design, the intermediate result can be quite massive, and as a result, many lost files will be created, and written to the disk, and we should both stress, the catalog, and that the disk can slow down the design. For ER mode, it's even worse because, the table are shared on communal storage. So writing to the regular table, means that it has to upload the data, to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer, and make this shared in memory data structure. Monitoring queries will go directly look up, in memory data structure, and go through the system table, and return the results. No Intermediate Results files will be written anymore. Another expensive lubidge of local disk for DBD is encoding design, as I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DBD need to actually write down, the sample data to the disk, using the different encoding options, and to find out which ones have the smallest footprint, or pick it as the best choice. These written sample data will be useless after this, and it will be wiped out right away, and you can imagine this is a huge waste of the system resource. In 10.0 we improve this process. So instead of writing, the different encoded data on the disk, and then read the file size, DBD aggregate the data block size on-the-fly. The data block will not be written to the disk, so the overall encoding and design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put a significant amount of the resource on the improving the DBD in 10.0, is because the VERTICA DBD, as essential component of the out of box performance design campaign. To simply illustrate the timeline, we are now on the second step, where we significantly reduced, the running overhead of the DBD, so that user will no longer fear, to run DBD on their production cluster. Please be noted that as of 10.0, we haven't really started changing, how DBD design algorithm works, so that what we have discussed in the deep dive today, still holds. For the next phase of DBD, we will briefly make the design process smarter, and this will include better enumeration mechanism, so that the pruning is more intelligence rather than brutal, then that will result in better design quality, and also faster design. The longer term is to make DBD to achieve the automation. What entail automation and what I really mean is that, instead of having user to decide when to use DBD, until their query is slow, VERTICA have to know, detect this event, and have have DBD run automatically for users, and suggest the better projections design, if the existing projection is not good enough. Of course, there will be a lot of work that need to be done, before we can actually fully achieve the automation. But we are working on that. At the end of day, what the user really wants, is the fast database, right? And thank you for listening to my presentation. so I hope you find it useful. Now let's get ready for the Q&A.

Published Date : Mar 30 2020

SUMMARY :

at the end of the presentation. and the many of you may also know,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Yuanzhe Bei	PERSON	0.99+
Jeff Healey	PERSON	0.99+
100%	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
one day	QUANTITY	0.99+
second step	QUANTITY	0.99+
third step	QUANTITY	0.99+
tomorrow	DATE	0.99+
third issue	QUANTITY	0.99+
today	DATE	0.99+
First	QUANTITY	0.99+
yesterday	DATE	0.99+
Each benefit	QUANTITY	0.99+
Today	DATE	0.99+
third projection	QUANTITY	0.99+
One	QUANTITY	0.99+
b2	OTHER	0.99+
each column	QUANTITY	0.99+
first issue	QUANTITY	0.99+
one column	QUANTITY	0.99+
three columns	QUANTITY	0.99+
VERTICA Engineering	ORGANIZATION	0.99+
Yuanzhe	PERSON	0.99+
each step	QUANTITY	0.98+
Each table	QUANTITY	0.98+
first step	QUANTITY	0.98+
DBD	TITLE	0.98+
DBD	ORGANIZATION	0.98+
seven steps	QUANTITY	0.98+
DBL	ORGANIZATION	0.98+
each	QUANTITY	0.98+
one argument	QUANTITY	0.98+
VERTICA	TITLE	0.98+
each projection	QUANTITY	0.97+
first two	QUANTITY	0.97+
first	QUANTITY	0.97+
this week	DATE	0.97+
hundreds	QUANTITY	0.97+
one function	QUANTITY	0.97+
clause a2	OTHER	0.97+
one	QUANTITY	0.97+
each per columns	QUANTITY	0.96+
Tomorrow	DATE	0.96+
both	QUANTITY	0.96+
four issues	QUANTITY	0.95+
VERTICA	ORGANIZATION	0.95+
b1	OTHER	0.95+
single round	QUANTITY	0.94+
4/2	DATE	0.94+
first couple of columns	QUANTITY	0.92+
VERTICA Database Designer Today and Tomorrow	TITLE	0.91+
Vertica	ORGANIZATION	0.91+
10.0	QUANTITY	0.89+
one function call	QUANTITY	0.89+
a1	OTHER	0.89+
four things	QUANTITY	0.88+
c1	OTHER	0.87+
two sort order	QUANTITY	0.85+

UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases

>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.

Published Date : Mar 30 2020

SUMMARY :

Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Sumeet	PERSON	0.99+
Sumeet Keswani	PERSON	0.99+
Shirang Kamat	PERSON	0.99+
Jeff Healey	PERSON	0.99+
6 nodes	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
six years	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
12 nodes	QUANTITY	0.99+
Shirang	PERSON	0.99+
1,000 files	QUANTITY	0.99+
one	QUANTITY	0.99+
12 shards	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
99.9%	QUANTITY	0.99+
two modes	QUANTITY	0.99+
S3	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
first subcluster	QUANTITY	0.99+
first time	QUANTITY	0.99+
two options	QUANTITY	0.99+
first	QUANTITY	0.99+
first option	QUANTITY	0.99+
each	QUANTITY	0.99+
two subclusters	QUANTITY	0.99+
Each node	QUANTITY	0.99+
hundreds of nodes	QUANTITY	0.99+
Today	DATE	0.99+
each app	QUANTITY	0.99+
today	DATE	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
One	QUANTITY	0.98+
three nodes	QUANTITY	0.98+
SQL	TITLE	0.98+
Eon Mode	TITLE	0.98+
single container	QUANTITY	0.97+
this week	DATE	0.97+
16 secondary shard subscription	QUANTITY	0.97+
two types	QUANTITY	0.97+
Sizing and Configuring Vertica in Eon Mode for Different Use Cases	TITLE	0.97+
Vertica	TITLE	0.97+
one limitation	QUANTITY	0.97+

UNLIST TILL 4/2 - The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL

hello everybody and thank you for joining us today for the virtual verdict of BBC 2020 today's breakout session is entitled the shortest path to Vertica best practices for data warehouse migration ETL I'm Jeff Healey I'll leave verdict and marketing I'll be your host for this breakout session joining me today are Marco guesser and Mauricio lychee vertical product engineer is joining us from yume region but before we begin I encourage you to submit questions or comments or in the virtual session don't have to wait just type question in a comment in the question box below the slides that click Submit as always there will be a Q&A session the end of the presentation will answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forums that formed at vertical comm to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also reminder that you can maximize your screen by clicking the double arrow button and lower right corner of the sides and yes this virtual session is being recorded be available to view on demand this week send you a notification as soon as it's ready now let's get started over to you mark marco andretti oh hello everybody this is Marco speaking a sales engineer from Amir said I'll just get going ah this is the agenda part one will be done by me part two will be done by Mauricio the agenda is as you can see big bang or piece by piece and the migration of the DTL migration of the physical data model migration of et I saw VTL + bi functionality what to do with store procedures what to do with any possible existing user defined functions and migration of the data doctor will be by Maurice it you want to talk about emeritus Rider yeah hello everybody my name is Mauricio Felicia and I'm a birth record pre-sales like Marco I'm going to talk about how to optimize that were always using some specific vertical techniques like table flattening live aggregated projections so let me start with be a quick overview of the data browser migration process we are going to talk about today and normally we often suggest to start migrating the current that allows the older disease with limited or minimal changes in the overall architecture and yeah clearly we will have to port the DDL or to redirect the data access tool and we will platform but we should minimizing the initial phase the amount of changes in order to go go live as soon as possible this is something that we also suggest in the second phase we can start optimizing Bill arouse and which again with no or minimal changes in the architecture as such and during this optimization phase we can create for example dog projections or for some specific query or optimize encoding or change some of the visual spools this is something that we normally do if and when needed and finally and again if and when needed we go through the architectural design for these operations using full vertical techniques in order to take advantage of all the features we have in vertical and this is normally an iterative approach so we go back to name some of the specific feature before moving back to the architecture and science we are going through this process in the next few slides ok instead in order to encourage everyone to keep using their common sense when migrating to a new database management system people are you often afraid of it it's just often useful to use the analogy of how smooth in your old home you might have developed solutions for your everyday life that make perfect sense there for example if your old cent burner dog can't walk anymore you might be using a fork lifter to heap in through your window in the old home well in the new home consider the elevator and don't complain that the window is too small to fit the dog through this is very much in the same way as Narita but starting to make the transition gentle again I love to remain in my analogy with the house move picture your new house as your new holiday home begin to install everything you miss and everything you like from your old home once you have everything you need in your new house you can shut down themselves the old one so move each by feet and go for quick wins to make your audience happy you do bigbang only if they are going to retire the platform you are sitting on where you're really on a sinking ship otherwise again identify quick wings implement published and quickly in Vertica reap the benefits enjoy the applause use the gained reputation for further funding and if you find that nobody's using the old platform anymore you can shut it down if you really have to migrate you can still go to really go to big battle in one go only if you absolutely have to otherwise migrate by subject area use the group all similar clear divisions right having said that ah you start off by migrating objects objects in the database that's one of the very first steps it consists of migrating verbs the places where you can put the other objects into that is owners locations which is usually schemers then what do you have that you extract tables news then you convert the object definition deploy them to Vertica and think that you shouldn't do it manually never type what you can generate ultimate whatever you can use it enrolls usually there is a system tables in the old database that contains all the roads you can export those to a file reformat them and then you have a create role and create user scripts that you can apply to Vertica if LDAP Active Directory was used for the authentication the old database vertical supports anything within the l dubs standard catalogued schemas should be relatively straightforward with maybe sometimes the difference Vertica does not restrict you by defining a schema as a collection of all objects owned by a user but it supports it emulates it for old times sake Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use it it usually said it is always set to the name of the database in case of vertical having had now the schemas the catalogs the users and roles in place move the take the definition language of Jesus thought if you are allowed to it's best to use a tool that translates to date types in the PTL generated you might see as a mention of old idea to listen by memory to by the way several times in this presentation we are very happy to have it it actually can export the old database table definition because they got it works with the odbc it gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target schema to several target DBMS flavors the most important which is obviously vertical if they force you to use something else there are always tubes like sequel plots in Oracle the show table command in Tara data etc H each DBMS should have a set of tools to extract the object definitions to be deployed in the other instance of the same DBMS ah if I talk about youth views usually a very new definition also in the old database catalog one thing that you might you you use special a bit of special care synonyms is something that were to get emulated different ways depending on the specific needs I said I stop you on the view or table to be referred to or something that is really neat but other databases don't have the search path in particular that works that works very much like the path environment variable in Windows or Linux where you specify in a table an object name without the schema name and then it searched it first in the first entry of the search path then in a second then in third which makes synonym hugely completely unneeded when you generate uvl we remained in the analogy of moving house dust and clean your stuff before placing it in the new house if you see a table like the one here at the bottom this is usually corpse of a bad migration in the past already an ID is usually an integer and not an almost floating-point data type a first name hardly ever has 256 characters and that if it's called higher DT it's not necessarily needed to store the second when somebody was hired so take good care in using while you are moving dust off your stuff and use better data types the same applies especially could string how many bytes does a string container contains for eurozone's it's not for it's actually 12 euros in utf-8 in the way that Vertica encodes strings and ASCII characters one died but the Euro sign thinks three that means that you have to very often you have when you have a single byte character set up a source you have to pay attention oversize it first because otherwise it gets rejected or truncated and then you you will have to very carefully check what their best science is the best promising is the most promising approach is to initially dimension strings in multiples of very initial length and again ODP with the command you see there would be - I you 2 comma 4 will double the lengths of what otherwise will single byte character and multiply that for the length of characters that are wide characters in traditional databases and then load the representative sample of your cells data and profile using the tools that we personally use to find the actually longest datatype and then make them shorter notice you might be talking about the issues of having too long and too big data types on projection design are we live and die with our projects you might know remember the rules on how default projects has come to exist the way that we do initially would be just like for the profiling load a representative sample of the data collector representative set of already known queries from the Vertica database designer and you don't have to decide immediately you can always amend things and otherwise follow the laws of physics avoid moving data back and forth across nodes avoid heavy iOS if you can design your your projections initially by hand encoding matters you know that the database designer is a very tight fisted thing it would optimize to use as little space as possible you will have to think of the fact that if you compress very well you might end up using more time in reading it this is the testimony to run once using several encoding types and you see that they are l e is the wrong length encoded if sorted is not even visible while the others are considerably slower you can get those nights and look it in look at them in detail I will go in detail you now hear about it VI migrations move usually you can expect 80% of everything to work to be able to live to be lifted and shifted you don't need most of the pre aggregated tables because we have live like regain projections many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flatten tables that are going to be talked about later you might have to ride those by hand you will be able to switch off casting because vertical speeds of everything with laps Lyle aggregate projections and you have worked with molap cubes before you very probably won't meet them at all ETL tools what you will have to do is if you do it row by row in the old database consider changing everything to very big transactions and if you use in search statements with parameter markers consider writing to make pipes and using verticals copy command mouse inserts yeah copy c'mon that's what I have here ask you custom functionality you can see on this slide the verticals the biggest number of functions in the database we compare them regularly by far compared to any other database you might find that many of them that you have written won't be needed on the new database so look at the vertical catalog instead of trying to look to migrate a function that you don't need stored procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have very rarely you will have to actually write a procedure that involves a loop but it's really in our experience very very rarely usually you can just switch to standard scripting and this is basically repeating what Mauricio said in the interest of time I will skip this look at this one here the most of the database data warehouse migration talks should be automatic you can use you can automate GDL migration using ODB which is crucial data profiling it's not crucial but game-changing the encoding is the same thing you can automate at you using our database designer the physical data model optimization in general is game-changing you have the database designer use the provisioning use the old platforms tools to generate the SQL you have no objects without their onus is crucial and asking functions and procedures they are only crucial if they depict the company's intellectual property otherwise you can almost always replace them with something else that's it from me for now Thank You Marco Thank You Marco so we will now point our presentation talking about some of the Vertica that overall the presentation techniques that we can implement in order to improve the general efficiency of the dot arouse and let me start with a few simple messages well the first one is that you are supposed to optimize only if and when this is needed in most of the cases just a little shift from the old that allows to birth will provide you exhaust the person as if you were looking for or even better so in this case probably is not really needed to to optimize anything in case you want optimize or you need to optimize then keep in mind some of the vertical peculiarities for example implement delete and updates in the vertical way use live aggregate projections in order to avoid or better in order to limit the goodbye executions at one time used for flattening in order to avoid or limit joint and and then you can also implement invert have some specific birth extensions life for example time series analysis or machine learning on top of your data we will now start by reviewing the first of these ballots optimize if and when needed well if this is okay I mean if you get when you migrate from the old data where else to birth without any optimization if the first four month level is okay then probably you only took my jacketing but this is not the case one very easier to dispute in session technique that you can ask is to ask basket cells to optimize the physical data model using the birth ticket of a designer how well DB deal which is the vertical database designer has several interfaces here I'm going to use what we call the DB DB programmatic API so basically sequel functions and using other databases you might need to hire experts looking at your data your data browser your table definition creating indexes or whatever in vertical all you need is to run something like these are simple as six single sequel statement to get a very well optimized physical base model you see that we start creating a new design then we had to be redesigned tables and queries the queries that we want to optimize we set our target in this case we are tuning the physical data model in order to maximize query performances this is why we are using my design query and in our statement another possible journal tip would be to tune in order to reduce storage or a mix between during storage and cheering queries and finally we asked Vertica to produce and deploy these optimized design in a matter of literally it's a matter of minutes and in a few minutes what you can get is a fully optimized fiscal data model okay this is something very very easy to implement keep in mind some of the vertical peculiarities Vaska is very well tuned for load and query operations aunt Berta bright rose container to biscuits hi the Pharos container is a group of files we will never ever change the content of this file the fact that the Rose containers files are never modified is one of the political peculiarities and these approach led us to use minimal locks we can add multiple load operations in parallel against the very same table assuming we don't have a primary or unique constraint on the target table in parallel as a sage because they will end up in two different growth containers salad in read committed requires in not rocket fuel and can run concurrently with insert selected because the Select will work on a snapshot of the catalog when the transaction start this is what we call snapshot isolation the kappa recovery because we never change our rows files are very simple and robust so we have a huge amount of bandages due to the fact that we never change the content of B rows files contain indiarose containers but on the other side believes and updates require a little attention so what about delete first when you believe in the ethica you basically create a new object able it back so it appeared a bit later in the Rose or in memory and this vector will point to the data being deleted so that when the feed is executed Vertica will just ignore the rules listed in B delete records and it's not just about the leak and updating vertical consists of two operations delete and insert merge consists of either insert or update which interim is made of the little insert so basically if we tuned how the delete work we will also have tune the update in the merge so what should we do in order to optimize delete well remember what we said that every time we please actually we create a new object a delete vector so avoid committing believe and update too often we reduce work the work for the merge out for the removal method out activities that are run afterwards and be sure that all the interested projections will contain the column views in the dedicate this will let workers directly after access the projection without having to go through the super projection in order to create the vector and the delete will be much much faster and finally another very interesting optimization technique is trying to segregate the update and delete operation from Pyrenean third workload in order to reduce lock contention beliefs something we are going to discuss and these contain using partition partition operation this is exactly what I want to talk about now here you have a typical that arouse architecture so we have data arriving in a landing zone where the data is loaded that is from the data sources then we have a transformation a year writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end those green data structure we have at the end are the ones used by the data access tools when they run their queries sometimes we might need to change old data for example because we have late records or maybe because we want to fix some errors that have been originated in the facilities so what we do in this case is we just copied back the partition we want to change or we want to adjust from the green interior a the end to the stage in the area we have a very fast operation which is Tokyo Station then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area and at the very same time people continues to you with green data structures that are at the end so we will never have contention between the two operations when we updating the staging area is completed what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in be staging zone to the query area that is the green one at the end this swap partition is very fast is an atomic operation and basically what will happens is just that well exchange the pointer to the data this is a very very effective techniques and lot of customer useless so why flops on table and live aggregate for injections well basically we use slot in table and live aggregate objection to minimize or avoid joint this is what flatten table are used for or goodbye and this is what live aggregate projections are used for now compared to traditional data warehouses better can store and process and aggregate and join order of magnitudes more data that is a true columnar database joint and goodbye normally are not a problem at all they run faster than any traditional data browse that page there are still scenarios were deficits are so big and we are talking about petabytes of data and so quickly going that would mean be something in order to boost drop by and join performances and this is why you can't reduce live aggregate projections to perform aggregations hard loading time and limit the need for global appear on time and flux and tables to combine information from different entity uploading time and again avoid running joint has query undefined okay so live aggregate projections at this point in time we can use live aggregate projections using for built in aggregate functions which are some min Max and count okay let's see how this works suppose that you have a normal table in this case we have a table unit sold with three columns PIB their time and quantity which has been segmented in a given way and on top of this base table we call it uncle table we create a projection you see that we create the projection using the salad that will aggregate the data we get the PID we get the date portion of the time and we get the sum of quantity from from the base table grouping on the first two columns so PID and the date portion of day time okay what happens in this case when we load data into the base table all we have to do with load data into the base table when we load data into the base table we will feel of course big injections that assuming we are running with k61 we will have to projection to projections and we will know the data in those two projection with all the detail in data we are going to load into the table so PAB playtime and quantity but at the very same time at the very same time and without having to do nothing any any particular operation or without having to run any any ETL procedure we will also get automatically in the live aggregate projection for the data pre aggregated with be a big day portion of day time and the sum of quantity into the table name total quantity you see is something that we get for free without having to run any specific procedure and this is very very efficient so the key concept is that during the loading operation from VDL point of view is executed again the base table we do not explicitly aggregate data or we don't have any any plc do the aggregation is automatic and we'll bring the pizza to be live aggregate projection every time we go into the base table you see the two selection that we have we have on in this line on the left side and you see that those two selects will produce exactly the same result so running select PA did they trying some quantity from the base table or running the select star from the live aggregate projection will result exactly in the same data you know this is of course very useful but is much more useful result that if we and we can observe this if we run an explained if we run the select against the base table asking for this group data what happens behind the scene is that basically vertical itself that is a live aggregate projection with the data that has been already aggregating loading phase and rewrite your query using polite aggregate projection this happens automatically you see this is a query that ran a group by against unit sold and vertical decided to rewrite this clearly as something that has to be collected against the light aggregates projection because if I decrease this will save a huge amount of time and effort during the ETL cycle okay and is not just limited to be information you want to aggregate for example another query like select count this thing you might note that can't be seen better basically our goodbyes will also take advantage of the live aggregate injection and again this is something that happens automatically you don't have to do anything to get this okay one thing that we have to keep very very clear in mind Brassica what what we store in the live aggregate for injection are basically partially aggregated beta so in this example we have two inserts okay you see that we have the first insert that is entered in four volts and the second insert which is inserting five rules well in for each of these insert we will have a partial aggregation you will never know that after the first insert you will have a second one so better will calculate the aggregation of the data every time irin be insert it is a key concept and be also means that you can imagine lies the effectiveness of bees technique by inserting large chunk of data ok if you insert data row by row this technique live aggregate rejection is not very useful because for every goal that you insert you will have an aggregation so basically they'll live aggregate injection will end up containing the same number of rows that you have in the base table but if you everytime insert a large chunk of data the number of the aggregations that you will have in the library get from structure is much less than B base data so this is this is a key concept you can see how these works by counting the number of rows that you have in alive aggregate injection you see that if you run the select count star from the solved live aggregate rejection the query on the left side you will get four rules but actually if you explain this query you will see that he was reading six rows so this was because every of those two inserts that we're actively interested a few rows in three rows in India in the live aggregate projection so this is a key concept live aggregate projection keep partially aggregated data this final aggregation will always happen at runtime okay another which is very similar to be live aggregate projection or what we call top K projection we actually do not aggregate anything in the top case injection we just keep the last or limit the amount of rows that we collect using the limit over partition by all the by clothes and this again in this case we create on top of the base stable to top gay projection want to keep the last quantity that has been sold and the other one to keep the max quantity in both cases is just a matter of ordering the data in the first case using the B time column in the second page using quantity in both cases we fill projection with just the last roof and again this is something that we do when we insert data into the base table and this is something that happens automatically okay if we now run after the insert our select against either the max quantity okay or be lost wanted it okay we will get the very last you see that we have much less rows in the top k projections okay we told at the beginning that basically we can use for built-in function you might remember me max sum and count what if I want to create my own specific aggregation on top of the lid and customer sum up because our customers have very specific needs in terms of live aggregate projections well in this case you can code your own live aggregate production user-defined functions so you can create the user-defined transport function to implement any sort of complex aggregation while loading data basically after you implemented miss VPS you can deploy using a be pre pass approach that basically means the data is aggregated as loading time during the data ingestion or the batch approach that means that the data is when that woman is running on top which things to remember on live a granade projections they are limited to be built in function again some max min and count but you can call your own you DTF so you can do whatever you want they can reference only one table and for bass cab version before 9.3 it was impossible to update or delete on the uncle table this limit has been removed in 9.3 so you now can update and delete data from the uncle table okay live aggregate projection will follow the segmentation of the group by expression and in some cases the best optimizer can decide to pick the live aggregates objection or not depending on if depending on the fact that the aggregation is a consistent or not remember that if we insert and commit every single role to be uncoachable then we will end up with a live aggregate indirection that contains exactly the same number of rows in this case living block or using the base table it would be the same okay so this is one of the two fantastic techniques that we can implement in Burtka this live aggregate projection is basically to avoid or limit goodbyes the other which we are going to talk about is cutting table and be reused in order to avoid the means for joins remember that K is very fast running joints but when we scale up to petabytes of beta we need to boost and this is what we have in order to have is problem fixed regardless the amount of data we are dealing with so how what about suction table let me start with normalized schemas everybody knows what is a normalized scheme under is no but related stuff in this slide the main scope of an normalized schema is to reduce data redundancies so and the fact that we reduce data analysis is a good thing because we will obtain fast and more brides we will have to write into a database small chunks of data into the right table the problem with these normalized schemas is that when you run your queries you have to put together the information that arrives from different table and be required to run joint again jointly that again normally is very good to run joint but sometimes the amount of data makes not easy to deal with joints and joints sometimes are not easy to tune what happens in in the normal let's say traditional data browser is that we D normalize the schemas normally either manually or using an ETL so basically we have on one side in this light on the left side the normalized schemas where we can get very fast right on the other side on the left we have the wider table where we run all the three joints and pre aggregation in order to prepare the data for the queries and so we will have fast bribes on the left fast reads on the Left sorry fast bra on the right and fast read on the left side of these slides the probability in the middle because we will push all the complexity in the middle in the ETL that will have to transform be normalized schema into the water table and the way we normally implement these either manually using procedures that we call the door using ETL this is what happens in traditional data warehouse is that we will have to coach in ETL layer in order to round the insert select that will feed from the normalized schema and right into the widest table at the end the one that is used by the data access tools we we are going to to view store to run our theories so this approach is costly because of course someone will have to code this ETL and is slow because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was ok with the batch and is resource intensive of course and is also human being intensive because of the people that will have to code and check the results it ever thrown because it can fail and introduce a latency because there is a get in the time axis between the time t0 when you load the data into be normalized schema and the time t1 when we get the data finally ready to be to be queried so what would be inverter to facilitate this process is to create this flatten table with the flattened T work first you avoid data redundancy because you don't need the wide table on the normalized schema on the left side second is fully automatic you don't have to do anything you just have to insert the data into the water table and the ETL that you have coded is transformed into an insert select by vatika automatically you don't have to do anything it's robust and this Latin c0 is a single fast as soon as you load the data into the water table you will get all the joints executed for you so let's have a look on how it works in this case we have the table we are going to flatten and basically we have to focus on two different clauses the first one is you see that there is one table here I mentioned value 1 which can be defined as default and then the Select or set using okay the difference between the fold and set using is when the data is populated if we use default data is populated as soon as we know the data into the base table if we use set using Google Earth to refresh but everything is there I mean you don't need them ETL you don't need to code any transformation because everything is in the table definition itself and it's for free and of course is in latency zero so as soon as you load the other columns you will have the dimension value valued as well okay let's see an example here suppose here we have a dimension table customer dimension that is on the left side and we have a fact table on on the right you see that the fact table uses columns like o underscore name or Oh the score city which are basically the result of the salad on top of the customer dimension so Beezus were the join is executed as soon as a remote data into the fact table directly into the fact table without of course loading data that arise from the dimension all the data from the dimension will be populated automatically so let's have an example here suppose that we are running this insert as you can see we are running be inserted directly into the fact table and we are loading o ID customer ID and total we are not loading made a major name no city those name and city will be automatically populated by Vertica for you because of the definition of the flood table okay you see behave well all you need in order to have your widest tables built for you your flattened table and this means that at runtime you won't need any join between base fuck table and the customer dimension that we have used in order to calculate name and city because the data is already there this was using default the other option was is using set using the concept is absolutely the same you see that in this case on the on the right side we have we have basically replaced this all on the school name default with all underscore name set using and same is true for city the concept that I said is the same but in this case which we set using then we will have to refresh you see that we have to run these select trash columns and then the name of the table in this case all columns will be fresh or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension so this technique this technique is extremely useful the difference between default and said choosing just to summarize the most important differences remember you just have to remember that default will relate your target when you load set using when you refresh end and in some cases you might need to use them both so in some cases you might want to use both default end set using in this example here we'll see that we define the underscore name using both default and securing and this means that we love the data populated either when we load the data into the base table or when we run the Refresh this is summary of the technique that we can implement in birth in order to make our and other browsers even more efficient and well basically this is the end of our presentation thank you for listening and now we are ready for the Q&A session you

Published Date : Mar 30 2020

SUMMARY :

the end to the stage in the area we have

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Tom	PERSON	0.99+
Marta	PERSON	0.99+
John	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Dave	PERSON	0.99+
Peter Burris	PERSON	0.99+
Chris Keg	PERSON	0.99+
Laura Ipsen	PERSON	0.99+
Jeffrey Immelt	PERSON	0.99+
Chris	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Chris O'Malley	PERSON	0.99+
Andy Dalton	PERSON	0.99+
Chris Berg	PERSON	0.99+
Dave Velante	PERSON	0.99+
Maureen Lonergan	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Paul Forte	PERSON	0.99+
Erik Brynjolfsson	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Andrew McCafee	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Cheryl	PERSON	0.99+
Mark	PERSON	0.99+
Marta Federici	PERSON	0.99+
Larry	PERSON	0.99+
Matt Burr	PERSON	0.99+
Sam	PERSON	0.99+
Andy Jassy	PERSON	0.99+
Dave Wright	PERSON	0.99+
Maureen	PERSON	0.99+
Google	ORGANIZATION	0.99+
Cheryl Cook	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
$8,000	QUANTITY	0.99+
Justin Warren	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Europe	LOCATION	0.99+
Andy	PERSON	0.99+
30,000	QUANTITY	0.99+
Mauricio	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Robb	PERSON	0.99+
Jassy	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Mike Nygaard	PERSON	0.99+

UNLIST TILL 4/2 - A Deep Dive into the Vertica Management Console Enhancements and Roadmap

>> Jeff: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "A Deep Dive "into the Vertica Mangement Console Enhancements and Roadmap." I'm Jeff Healey of Vertica Marketing. I'll be your host for this breakout session. Joining me are Bhavik Gandhi and Natalia Stavisky from Vertica engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively visit Vertica Forums at forum.vertica.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going well after the event. Also, a reminder that you can maximize the screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to you on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Bhavik. >> Bhavik: All right. So hello, and welcome, everybody doing this presentation of "Deep Dive into the Vertica Management Console Enhancements and Roadmap." Myself, Bhavik, and my team member, Natalia Stavisky, will go over a few useful announcements on Vertica Management Console, discussing a few real scenarios. All right. So today we will go forward with the brief introduction about the Management Console, then we will discuss the benefits of using Management Console by going over a couple of user scenarios for the query taking too long to run and receiving email alerts from Management Console. Then we will go over a few MC features for what we call Eon Mode databases, like provisioning and reviving the Eon Mode databases from MC, managing the subcluster and understanding the Depot. Then we will go over some of the future announcements on MC that we are planning. All right, so let's get started. All right. So, do you want to know about how to provision a new Vertica cluster from MC? How to analyze and understand a database workload by monitoring the queries on the database? How do you balance the resource pools and use alerts and thresholds on MC? So, the Management Console is basically our answer and we'll talk about its capabilities and new announcements in this presentation. So just to give a brief overview of the Management Console, who uses Management Console, it's generally used by IT administrators and DB admins. Management Console can be used to monitor both Eon Mode and Enterprise Mode databases. Why to use Management Console? You can use Management Console for provisioning Vertica databases and cluster. You can manage the already existing Vertica databases and cluster you have, and you can use various tools on Management Console like query execution, Database Designer, Workload Analyzer, and set up alerts and thresholds to get notified by some of your activities on the MC. So let's go over a few benefits of using Management Console. Okay. So using Management Console, you can view and optimize resource pool usage. Management Console helps you to identify some critical conditions on your Vertica cluster. Additionally, you can set up various thresholds thresholds in MC and get other data if those thresholds are triggered on the database. So now let's dig into the couple of scenarios. So for the first scenario, we will discuss about queries taking too long and using workload analyzer to possibly help to solve the problem. In the second scenario, we will go over alert email that you received from your Management Console and analyzing the problem and taking required actions to solve the problem. So let's go over the scenario where queries are taking too long to run. So in this example, we have this one query that we are running using the query execution on MC. And for some reason we notice that it's taking about 14.8 seconds seconds to execute this query, which is higher than the expected run time of the query. The query that we are running happens to be the query used by MC during the extended monitoring. Notice that the table name and the schema name which is ds_requests_issued, and, is the schema used for extended monitoring. Now in 10.0 MC we have redesigned the Workload Analyzer and Recommendations feature to show the recommendations and allow you to execute those recommendations. In our example, we have taken the table name and figured the tuning descriptions to see if there are any tuning recommendations related to this table. As we see over here, there are three tuning recommendations available for that table. So now in 10.0 MC, you can select those recommendations and then run them. So let's run the recommendations. All right. So once recommendations are run successfully, you can go and see all the processed recommendations that you have run previously. Over here we see that there are three recommendations that we had selected earlier have successfully processed. Now we take the same query and run it on the query execution on MC and hey, it's running really faster and we see that it takes only 0.3 seconds to run the query and, which is about like 98% decrease in original runtime of the query. So in this example we saw that using a Workload Analyzer tool on MC you can possibly triage and solve issue for your queries which are taking to long to execute. All right. So now let's go over another user scenario where DB admin's received some alert email messages from MC and would like to understand and analyze the problem. So to know more about what's going on on the database and proactively react to the problems, DB admins using the Management Console can create set of thresholds and get alerted about the conditions on the database if the threshold values is reached and then respond to the problem thereafter. Now as a DB admin, I see some email message notifications from MC and upon checking the emails, I see that there are a couple of email alerts received from MC on my email. So one of the messages that I received was for Query Resource Rejections greater than 5, pool, midpool7. And then around the same time, I received another email from the MC for the Failed Queries greater than 5, and in this case I see there are 80 failed queries. So now let's go on the MC and investigate the problem. So before going into the deep investigation about failures, let's review the threshold settings on MC. So as we see, we have set up the thresholds under the database settings page for failed queries in the last 10 minutes greater than 5 and MC should send an email to the individual if the threshold is triggered. And also we have a threshold set up for queries and resource rejections in the last five minutes for midpool7 set to greater than 5. There are various other thresholds on this page that you can set if you desire to. Now let's go and triage those email alerts about the failed queries and resource rejections that we had received. To analyze the failed queries, let's take a look at the query statistics page on the database Overview page on MC. Let's take a look at the Resource Pools graph and especially for the failed queries for each resource pools. And over to the right under the failed query section, I see about like, in the last 24 hours, there are about 6,000 failed queries for midpool7. And now I switch to view to see the statistics for each user and on this page I see for User MaryLee on the right hand side there are a high number of failed queries in last 24 hours. And to know more about the failed queries for this user, I can click on the graph for this user and get the reasons behind it. So let's click on the graph and see what's going on. And so clicking on this graph, it takes me to the failed queries view on the Query Monitoring page for database, on Database activities tab. And over here, I see there are a high number of failed queries for this user, MaryLee, with the reasons stated as, exceeding high limit. To drill down more and to know more reasons behind it, I can click on the plus icon on the left hand side for each failed queries to get the failure reason for each node on the database. So let's do that. And clicking the plus icon, I see for the two nodes that are listed, over here it says there are insufficient resources like memory and file handles for midpool7. Now let's go and analyze the midpool7 configurations and activities on it. So to do so, I will go over to the Resource Pool Monitoring view and select midpool7. I see the resource allocations for this resource pool is very low. For example, the max memory is just 1MB and the max concurrency is set to 0. Hmm, that's very odd configuration for this resource pool. Also in the bottom right graph for the resource rejections for midpool7, the graph shows very high values for resource rejection. All right. So since we saw some odd configurations and odd resource allocations for midpool7, I would like to see when this resource, when the settings were changed on the resource pools. So to do this, I can preview the audit logs on, are available on the Management Console. So I can go onto the Vertica Audit Logs and see the logs for the resource pool. So I just (mumbles) for the logs and figuring the logs for midpool7. I see on February 17th, the memory and other attributes for midpool7 were modified. So now let's analyze the resource activity for midpool7 around the time when the configurations were changed. So in our case we are using extended monitoring on MC for this database, so we can go back in time and see the statistics over the larger time range for midpool7. So viewing the activities for midpool7 around February 17th, around the time when these configurations were changed, we see a decrease in resource pool usage. Also, on the bottom right, we see the resource rejections for this midpool7 have an increase, linear increase, after the configurations were changed. I can select a point on the graph to get the more details about the resource rejections. Now to analyze the effects of the modifications on midpool7. Let's go over to the Query Monitoring page. All right, I will adjust the time range around the time when the configurations were changed for midpool7 and completed activities queries for user MaryLee. And I see there are no completed queries for this user. Now I'm taking a look at the Failed Queries tab and adjusting the time range around the time when the configurations were changed. I can do so because we are using extended monitoring. So again, adjusting the time, I can see there are high number of failed queries for this user. There about about like 10,000 failed queries for this user after the configurations were changed on this resource pool. So now let's go and modify the settings since we know after the configurations were changed, this user was not able to run the queries. So you can change the resource pool settings of using Management Console's database settings page and under the Resource Pools tab. So selecting the midpool7, I see the same odd configurations for this resource pool that we saw earlier. So now let's go and modify it, the settings. So I will increase the max memory and modify the settings for midpool7 so that it has adequate resources to run the queries for the user. Hit apply on the right hand top to see the settings. Now let's do the validation after we change the resource pool attributes. So let's go over to the same query monitoring page and see if MaryLee user is able to run the queries for midpool7. We see that now, after the configuration, after the change, after we changed the configuration for midpool7, the user can run the queries successfully and the count for Completed Queries has increased after we modified the settings for this midpool7 resource pool. And also viewing the resource pool monitoring page, we can validate that after the new configurations for midpool7 has been applied and also the resource pool usage after the configuration change has increased. And also on the bottom right graph, we can see that the resource rejections for midpool7 has decreased over the time after we modified the settings. And since we are using extended monitoring for this database, I can see that the trend in data for these resource pools, the before and after effects of modifying the settings. So initially when the settings were changed, there were high resource rejections and after we again modified the settings, the resource rejections went down. Right. So now let's go work with the provisioning and reviving the Eon Mode Vertica database cluster using the Management Console on different platform. So Management Console supports provisioning and reviving of Eon Mode databases on various cloud environments like AWS, the Google Cloud Platform, and Pure Storage. So for Google, for provisioning the Vertica Management Console on Google Cloud Platform you can use launch a template. Or on AWS environment you can use the cloud formation templates available for different OS's. Once you have provisioned Vertica Management Console, you can provision the Vertica cluster and databases from MC itself. So you can provision a Vertica cluster, you can select the Create new database button available on the homepage. This will open up the wizard to create a new database and cluster. In this example, we are using we are using the Google Cloud Platform. So the wizard will ask me for varius authentication parameters for the Google Cloud Platform. And if you're on AWS, it'll ask you for the authentication parameters for the AWS environment. And going forward on the Wizard, it'll ask me to select the instance Type. I will select for the new Vertica cluster. And also provide the communal location url for my Eon Mode database and all the other preferences related to the new cluster. Once I have selected all the preferences for my new cluster I can preview the settings and I can hit, if I am, I can hit Create if all looks okay. So if I hit Create, this will create a new, MC will create a new GCP instances because we are on the GCP environment in this example. It will create a cluster on this instance, it'll create a Vertica Eon Mode Database on this cluster. And it will, additionally, you can load the test data on it if you like to. Now let's go over and revive the existing Eon Mode database from the communal location. So you can do it the same using the Management Console by selecting the Revive Eon Mode database button on the homepage. This will again open up the wizard for reviving the Eon Mode database. Again, in this example, since we are using GCP Platform, it will ask me for the Google Cloud storage authentication attributes. And for reviving, it will ask me for the communal location so I can enter the Google Storage bucket and my folder and it will discover all the Eon Mode databases located under this folder. And I can select one of the databases that I would like to revive. And it will ask me for other Vertica preferences and for this video, for this database reviving. And once I enter all the preferences and review all the preferences I can hit Revive the database button on the Wizard. So after I hit Revive database it will create the GCP instances. The number of GCP instances that I created would be seen as the number of hosts on the original Vertica cluster. It will install the Vertica cluster on this data, on this instances and it will revive the database and it will start the database. And after starting the database, it will be imported on the MC so you can start monitoring on it. So in this example, we saw you can provision and revive the Vertica database on the GCP Platform. Additionally, you can use AWS environment to provision and revive. So now since we have the Eon Mode database on MC, Natalia will go over some Eon Mode features on MC like managing subcluster and Depot activity monitoring. Over to you, Natalia. >> Natalia: Okay, thank you. Hello, my name is Natalia Stavisky. I am also a member of Vertica Management Console Team. And I will talk today about the work I did to allow users to manage subclusters using the Management Console, and also the work I did to help users understand what's going on in their Depot in the Vertica Eon Mode database. So let's look at the picture of the subclusters. On the Manage page of Vertica Management Console, you can see here is a page that has blue tabs, and the tab that's active is Subclusters. You can see that there are two subclusters are available in this database. And for each of the subclusters, you can see subcluster properties, whether this is the primary subcluster or secondary. In this case, primary is the default subcluster. It's indicated by a star. You can see what nodes belong to each subcluster. You can see the node state and node statistics. You can also easily add a new subcluster. And we're quickly going to do this. So once you click on the button, you'll launch the wizard that'll take you through the steps. You'll enter the name of the subcluster, indicate whether this is secondary or primary subcluster. I should mention that Vertica recommends having only one primary subcluster. But we have both options here available. You will enter the number of nodes for your subcluster. And once the subcluster has been created, you can manage the subcluster. What other options for managing subcluster we have here? You can scale up an existing subcluster and that's a similar approach, you launch the wizard and (mumbles) nodes. You want to add to your existing subcluster. You can scale down a subcluster. And MC validates requirements for maintaining minimal number of nodes to prevent database shutdown. So if you can not remove any nodes from a subcluster, this option will not be available. You can stop a subcluster. And depending on whether this is a primary subcluster or secondary subcluster, this option may be available or not available. Like in this picture, we can see that for the default subcluster this option is not available. And this is because shutting down the default subcluster will cause the database to shut down as well. You can terminate a subcluster. And again, the MC warns you not to terminate the primary subcluster and validates requirements for maintaining minimal number of nodes to prevent database shutdown. So now we are going to talk a little more about how the MC helps you to understand what's going on in your Depot. So Depot is one of the core of Eon Mode database. And what are the frequently asked questions about the Depot? Is the Depot size sufficient? Are a subset of users putting a high load on the database? What tables are fetched and evicted repeatedly, we call it "re-fetched," in Depot? So here in the Depot Activity Monitoring page, we now have four tabs that allow you to answer those questions. And we'll go a little more in detail through each of them, but I'll just mention what they are for now. At a Glance shows you basic Depot configuration and also shows you query executing. Depot Efficiency, we'll talk more about that and other tabs. Depot Content, that shows you what tables are currently in your Depot. And Depot Pinning allows you to see what pinning policies have been created and to create new pinning policies. Now let's go through a scenario. Monitoring performance of workloads on one subcluster. As you know, Eon Mode database allows you to have multiple subclusters and we'll explore how this feature is useful and how we can use the Management Console to make decisions regarding whether you would like to have multiple subclusters. So here we have, in my setup, a single subcluster called default_subcluster. It has two users that are running queries that are accessing tables, mostly in schema public. So the query started executing and we can see that after fetching tables from Communal, which is the red line, the rest of the time the queries are executing in Depot. The green line is indicating queries running in Depot. The all nodes Depot is about 88% full, a steady flow, and the depot size seems to be sufficient for query executions from Depot only. That's the good case scenario. Now at around 17 :15, user Sherry got an urgent request to generate a report. And at, she started running her queries. We can see that picture is quite different now. The tables Sherry is querying are in a different schema and are much larger. Now we can see multiple lines in different colors. We can see a bunch of fetches and evictions which are indicated by blue and purple bars, and a lot of queries are now spilling into Communal. This is the red and orange lines. Orange line is an indicator of a query running partially in Depot and partially getting fetched from Communal. And the red line is data fetched from Communal storage. Let's click on the, one of the lines. Each data point, each point on the line, it'll take you to the Query Details page where you can see more about what's going on. So this is the page that shows us what queries have been run in this particular time interval which is on top of this page in orange color. So that's about one minute time interval and now we can see user Sherry among the users that are running queries. Sherry's queries involve large tables and are running against a different schema. We can see the clickstream schema in the name of the, in part of the query request. So what is happening, there is not enough Depot space for both the schema that's already in use and the one Sherry needs. As a result, evictions and fetches have started occurring. What other questions we can ask ourself to help us understand what's going on? So how about, what tables are most frequently re-fetched? So for that, we will go to the Depot Efficiency page and look at the middle, the middle chart here. We can see the larger version of this chart if we expand it. So now we have 10 tables listed that are most frequently being re-fetched. We can see that there is a clickstream schema and there are other schemas so all of those tables are being used in the queries, fetched, and then there is not enough space in the Depot, they getting evicted and they get re-fetched again. So what can be done to enable all queries to run in Depot? Option one can be increase the Depot size. So we can do this by running the following queries, which (mumbles) which nodes and storage location and the new Depot size. And I should mention that we can run this query from the Management Console from the query execution page. So this would have helped us to increase the Depot size. What other options do we have, for example, when increasing Depot size is not an option? We can also provision a second subcluster to isolate workloads like Sherry's. So we are going to do this now and we will provision a second subcluster using the Manage page. Here we're creating subcluster for Sherry or for workloads like hers. And we're going to create a (mumbles). So Sherry's subcluster has been created. We can see it here, added to the list of the subclusters. It's a secondary subcluster. Sherry has been instructed to use the new SherrySubcluster for her work. Now let's see what happened. We'll go again at Depot Activity page and we'll look at the At a Glance tab. We can see that around >> 18: 07, Sherry switched to running her queries on SherrySubcluster. On top of this page, you can see subcluster selected. So we currently have two subclusters and I'm looking, what happened to SherrySubcluster once it has been provisioned? So Sherry started using it and the lines after initial fetching from Depot, which was from Communal, which was the red line, after that, all Sherry's queries fit in Depot, which is indicated by green line. Also the Depot is pretty full on those nodes, about 90% full. But the queries are processed efficiently, there is no spilling into Communal. So that's a good case scenario. Let's now go back and take a look at the original subcluster, default subcluster. So on the left portion of the chart we can see multiple lines, that was activity before Sherry switched to her own designated subcluster. At around 18:07, after Sherry switched from the subcluster to using her designated subcluster, there is no, she is no longer using the subcluster, she is not putting a load in it. So the lines after that are turning a green color, which means the queries that are still running in default subcluster are all running in Depot. We can also see that Depot fetches and evictions bars, those purple and blue bars, are no longer showing significant numbers. Also we can check the second chart that shows Communal Storage Access. And we can see that the bars have also dropped, so there is no significant access for Communal Storage. So this problem has been solved. Each of the subclusters are serving queries from Depot and that's our most efficient scenario. Let's also look at the other tabs that we have for Depot monitoring. Let's look at Depot Efficiency tab. It has six charts and I'll go through each one of them quickly. Files Reads by Location gives an indicator of where the majority of query execution took place in Depot or in Communal. Top 10 Re-Fetches into Depot, and imagine the charts earlier in our user case, it shows tables that are most frequently fetched and evicted and then fetched again. These are good candidates to get pinned if increasing Depot size is not an option. Note that both of these charts have an option to select time interval using calendar widget. So you can get the information about the activity that happened during that time interval. Depot Pinning shows what portion of your Depot is pinned, both by byte count and by table count. And the three tables at the bottom show Depot structure. How long tables stay in Depot, we would like tables to be fetched in Depot and stay there for a long time, how often they are accessed, again, the tables in Depot, we would like to see them accessed frequently, and what the size range of tables in Depot. Depot Content. This tab allows us to search for tables that are currently in Depot and also to see stats like table size in Depot. How often tables are accessed and when were they last accessed. And the same information that's available for tables in Depot is also available on projections and partition levels for those tables. Depot Pinning. This tab allows users to see what policies are currently existing and so you can do this by clicking on the first little button and click search. This'll show you all existing policies that are already created. The second option allows you to search for a table and create a policy. You can also use the action column to modify existing policies or delete them. And the third option provides details about most frequently re-fetched tables, including fetch count, total access count, and number of re-fetched bytes. So all this information can help to make decisions regarding pinning specific tables. So that's about it about the Depot. And I should mention that the server team also has a very good presentation on the, webinar, on the Eon Mode database Depot management and subcluster management. that strongly recommend it to attend or download the slide presentation. Let's talk quickly about the Management Console Roadmap, what we are planning to do in the future. So we are going to continue focusing on subcluster management, there is still a lot of things we can do here. Promoting/demoting subclusters. Load balancing across subclusters, scheduling subcluster actions, support for large cluster mode. We'll continue working on Workload Analyzer enhancement recommendation, on backup and restore from the MC. Building custom thresholds, and Eon on HDFS support. Okay, so we are ready now to take any questions you may have now. Thank you.

Published Date : Mar 30 2020

SUMMARY :

for the virtual Vertica BDC 2020. and all the other preferences related to the new cluster. and the depot size seems to be sufficient So on the left portion of the chart

ENTITIES

Entity	Category	Confidence
Natalia Stavisky	PERSON	0.99+
Sherry	PERSON	0.99+
MaryLee	PERSON	0.99+
Jeff Healey	PERSON	0.99+
Natalia	PERSON	0.99+
Jeff	PERSON	0.99+
February 17th	DATE	0.99+
second scenario	QUANTITY	0.99+
10 tables	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
AWS	ORGANIZATION	0.99+
1MB	QUANTITY	0.99+
two users	QUANTITY	0.99+
first scenario	QUANTITY	0.99+
second option	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Bhavik	PERSON	0.99+
80 failed queries	QUANTITY	0.99+
today	DATE	0.99+
Depot	ORGANIZATION	0.99+
third	QUANTITY	0.99+
Each	QUANTITY	0.99+
six charts	QUANTITY	0.99+
both	QUANTITY	0.99+
each point	QUANTITY	0.99+
three recommendations	QUANTITY	0.99+
Today	DATE	0.99+
each	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Bhavik Gandhi	PERSON	0.99+
midpool7	TITLE	0.99+
two nodes	QUANTITY	0.99+
second chart	QUANTITY	0.99+
two subclusters	QUANTITY	0.98+
second subcluster	QUANTITY	0.98+
Each data point	QUANTITY	0.98+
each user	QUANTITY	0.98+
both options	QUANTITY	0.98+
4/2	DATE	0.98+
Eon	ORGANIZATION	0.97+
this week	DATE	0.97+
each subcluster	QUANTITY	0.97+
about 90%	QUANTITY	0.97+
three tables	QUANTITY	0.96+
0	QUANTITY	0.96+
about 14.8 seconds seconds	QUANTITY	0.96+
one subcluster	QUANTITY	0.95+

UNLIST TILL 4/2 - Migrating Your Vertica Cluster to the Cloud

>> Jeff: Hello everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's break-out session has been titled, "Migrating Your Vertica Cluster to the Cloud." I'm Jeff Healey, and I'm in Vertica marketing. I'll be your host for this break-out session. Joining me here are Sumeet Keswani and Chris Daly, Vertica product technology engineers and key members of our customer success team. Before we begin, I encourage you to submit questions and comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. And alternatively, you can visit Vertica forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also as a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Sumeet. >> Sumeet: Thank you, Jeff. Hello everyone, my name is Sumeet Keswani, and I will be talking about planning to deploy or migrate your Vertica cluster to the Cloud. So you may be moving an on-prem cluster or setting up a new cluster in the Cloud. And there are several design and operational considerations that will come into play. You know, some of these are cost, which industry you are in, or which expertise you have, in which Cloud platform. And there may be a personal preference too. After that, you know, there will be some operational considerations like VM and cluster sizing, what Vertica mode you want to deploy, Eon or Enterprise. It depends on your use keys. What are the DevOps skills available, you know, what elasticity, separation you need, you know, what is your backup and DR strategy, what do you want in terms of high availability. And you will have to think about, you know, how much data you have and where it's going to live. And in order to understand the cost, or the cost and the benefit of deployment and you will have to understand the access patterns, and how you are moving data from and to the Cloud. So things to consider before you move a deployment, a Vertica deployment to the Cloud, right, is one thing to keep in mind is, virtual CPUs, or CPUs in the Cloud, are not the same as the usual CPUs that you've been familiar with in your data center. A vCPU is half of a CPU because of hyperthreading. There is definitely the noisy neighbor effect. There is, depending on what other things are hosted in the Cloud environment, you may see performance, you may occasionally see performance issues. There are I/O limitations on the instance that you provision, so that what that really means is you can't always scale up. You might have to scale up, basically, you have to add more instances rather than getting bigger or the right size instances. Finally, there is an important distinction here. Virtualization is not free. There can be significant overhead to virtualization. It could be as much as 30%, so when you size and scale your clusters, you must keep that in mind. Now the other important aspect is, you know, where you put Vertica cluster is important. The choice of the region, how far it is from your various office locations. Where will the data live with respect to the cluster. And remember, popular locations can fill up. So if you want to scale out, additional capacity may or may not be available. So these are things you have to keep in mind when picking or choosing your Cloud platform and your deployment. So at this point, I want to make a plug for Eon mode. Eon mode is the latest mode, is a Cloud mode from Vertica. It has been designed with Cloud economics in mind. It uses shared storage, which is durable, available, and very cheap, like S3 storage or Google Cloud storage. It has been designed for quick scaling, like scale out, and highly elastic deployments. It has also been designed for high workload isolation, where each application or user group can be isolated from the other ones, so that they'll be paid and monitored separately, without affecting each other. But there are some disadvantages, or perhaps, you know, there's a cost for using Eon mode. Storage in S3 is neither cheap nor efficient. So there is a high latency of I/O when accessing data from S3. There is API and data access cost. There is API and data access cost associated with accessing your data in S3. Vertica in Eon mode has a pay as you go model, which you know, works for some people and does not work for others. And so therefore it is important to keep that in mind. And performance can be a little bit variable here, because it depends on cache, it depends on the local depot, which is a cache, and it is not as predictable as EE mode, so that's another trade-off. So let's spend about a minute and see how a Vertica cluster in Eon mode looks like. A Vertica cluster in Eon mode has S3 as the durability layer where all the data sits. There are subclusters, which are essentially just aggregation groups, which is separated compute, which will service different workloads. So for in this example, you may have two subclusters, one servicing ETL workload and the other one servicing (mic interference obscures speaking). These clusters are isolated, and they do not affect each other's performance. This allows you to scale them independently and isolate workloads. So this is the new Vertica Eon mode which has been specifically designed by us for use in the Cloud. But beyond this, you can use EE mode or Eon mode in the Cloud, it really depends on what your use case is. But both of these are possible, and we highly recommend Eon mode wherever possible. Okay, let's talk a little bit about what we mean by Vertica support in the Cloud. Now as you know, a Cloud is a shared data center, right. Performance in the Cloud can vary. It can vary between regions, availability zones, time of the day, choice of instance type, what concurrency you use, and of course the noisy neighbor effect. You know, we in Vertica, we performance, load, and stress test our product before every release. We have a bunch of use cases, we go through all of them, make sure that we haven't, you know, regressed any performance, and make sure that it works up to standards and gives you the high performance that you've come to expect. However, your solution or your workload is unique to you, and it is still your responsibility to make sure that it is tuned appropriately. To do this, one of the easiest things you can do is you know, pick a tested operating system, allocate the virtual machine, you know, with enough resources. It's something that we recommend, because we have tested it thoroughly. It goes a long way in giving you predictability. So after this I would like to now go into the various platforms, Cloud platforms, that Vertica has worked on. And I'll start with AWS, and my colleague Chris will speak about Azure and GCP. And our thoughts forward. So without further ado, let's start with the Amazon Web Services platform. So this is Vertica running on the Amazon Web Services platform. So as you probably are all aware, Amazon Web Services is the market leader in this space, and indeed really our biggest provider by far, and have been here for a very long time. And Vertica has a deep integration in the Amazon Web Services space. We provide a marketplace offering which has both pay as you go or a bring your own license model. We have many, you know, knowledge base articles, best practices, scripts, and resources that help you configure and use a Vertica database in the Cloud. We have several customers in the Cloud for many, many years now, and we have managed and console-based point and click deployments, you know, for ease of use in the Cloud. So Vertica has a deep integration in the Amazon space, and has been there for quite a bit now. So we communicate a lot of experience here. So let's talk about sizing on AWS. And sizing on any platform comes down to you know, these four or five different things. It comes down to picking the right instance type, picking the right disk volume and type, tuning and optimizing your networking, and finally, you know, some operational concerns like security, maintainability, and backup. So let's go into each one of these on the AWS ecosystem. So the choice of instance type is one of the important choices that you will make. In Eon mode, you know, you don't really need persistent disk. You can, you should probably choose ephemeral disk because it gives you extra speed, and speed with the instance type. We highly recommend the i3.4x instance types, which are very economical, have a big, 4 terabyte depot or cache per node. The i3.metal is similar to the i3.4, but has got significantly better performance, for those subclusters that need this extra oomph. The i3.2 is good for scale out of small ad hoc clusters. You know, they have a smaller cache and lower performance but it's cheap enough to use very indiscriminately. If you were in EE mode, well we don't use S3 as the layer of durability. Your local volumes is where we persist the data. Hence you do need an EBS volume in EE mode. In order to make sure that, you know, that the instance or the deployment is manageable, you might have to use some sort of a software RAID array over the EBS volumes. The most common instance type you see in EE mode is the r4.4x, the c4, or the m4 instance types. And then of course for temp space and depot we always recommend instance volumes. They're just much faster. Okay. So let's go, let's talk about optimizing your network or tuning your network. So the best, the best thing you can do about tuning your network, especially in Eon mode but in other modes too, is to get a VPC S3 endpoint. This is essentially a route table that makes sure that all traffic between your cluster and S3 goes over an internal fabric. This makes it much faster, you don't pay for egress cost, especially if you're doing external tables or your communal storage, but you do need to create it. Many times people will forget doing it. So you really do have to create it. And best of all, it's free. It doesn't cost you anything extra. You just have to create it during cluster creation time, and there's a significant performance difference for using it. The next thing about tuning your network is, you know, sizing it correctly. Pick the closest geographical region to where you'll consume the data. Pick the right availability zone. We highly recommend using cluster placement groups. In fact, they are required for the stability of the cluster. A cluster placement group is essentially, it operates this notion of rack. Nodes in a cluster placement group, are, you know, physically closer to each other than they would otherwise be. And this allows, you know, a 10 Gbps, bidirectional, TCP/IP flow between the nodes. And this makes sure that, you know, you get a high amount of Gbps per second. As you probably are all aware, the Cloud does not support broadcast or UDP broadcast. Hence you must use point-to-point UDP for spread in the Cloud, or in AWS. Beyond that, you know, point-to-point UDP does not scale very well beyond 20 nodes. So you know, as your cluster sizes increase, you must switch over to large cluster mode. And finally, use instances with enhanced networking or SR-IOV support. Again, it's free, it comes with the choice of the instance type and the operating system. We highly recommend it, it makes a big difference in terms of how your workload will perform. So let's talk a little bit about security, configuration, and orchestration. As I said, we provide CloudFormation scripts to make the ease of deployment. You can use the MC point and click. With regard to security, you know, Vertica does support instance profiles out of the box in Amazon. We recommend you use it. This is highly desirable so that you're not passing access keys and secret keys around. If you use our marketplace image, we have picked the latest operating systems, we have patched them, Amazon actually validates everything on marketplace and scans them for security vulnerabilities. So you get that for free. We do some basic configuration, like we disable root ssh access, we disallow any password access, we turn on encryption. And we run a basic set of security checks to make sure that the image is secure. Of course, it could be made more secure. But we try to balance out security, performance, and convenience. And finally, let's talk about backups. Especially in Eon mode I get the question, "Do we really need to back up our system, "since the data is in S3?" And the answer is yes, you do. Because you know, S3's not going to protect you against an accidental drop table. You know, S3 has a finite amount of reliability, durability, and availability. And you may want to be able to restore data differently. Also, backups are important if you're doing DR, or if you have additional cluster in a different region. The other cluster can be considered a backup. And finally, you know, why not create a backup or a disaster recovery cluster, you know, storage is cheap in the Cloud. So you know, we highly recommend you use it. So with this, I would like to hand it over to my colleague Christopher Daly, who will talk about the other two platforms that we support, that is Google and Azure. Over to you, Chris, thank you. >> Chris: Thanks, Sumeet, and hi everyone. So while there's no argument that we here at Vertica have a long history of running within the Amazon Web Services space, there are other alternative Cloud service providers where we do have a presence, such as Google Cloud Platform, or GCP. For those of you who are unfamiliar with GCP, it's considered the third-largest Cloud service provider in the marketspace, and it's priced very competitively to its peers. Has a lot of similarities to AWS in the products and services that it offers, but it tends to be the go-to place for newer businesses or startups. We officially started supporting GCP a little over a year ago with our first entry into their GCP marketplace. So a solution that deployed a fully-functional and ready-to-use Enterprise mode cluster. We followed up on that with the release and the support of Google storage buckets, and now I'm extremely pleased to announce that with the launch of Vertica 10, we're officially supporting Eon mode architecture in GCP as well. But that's not all, as we're adding additional offerings into the GCP marketplace. With the launch of version 10 we'll be introducing a second listing in the marketplace that allows for the deployment of an Eon mode cluster. It's all being driven by our own management consult. This will allow customers to quickly spin up Eon-based clusters within the GCP space. And if that wasn't enough, I'm also pleased to tell you that very soon after the launch we're going to be offering Vertica by the hour in GCP as well. And while we've done a lot to automate the solutions coming out of the marketplace, we recognize the simple fact that for a lot of you, building your cluster manually is really the only option. So with that in mind, let's talk about the things you need to understand in GCP to get that done. So wag me if you think this slide looks familiar. Well nope, it's not an erroneous duplicate slide from Sumeet's AWS section, it's merely an acknowledgement of all the things you need to consider for running Vertica in the Cloud. In Vertica, the choice of the operational mode will dictate some of the choices you'll need to make in the infrastructure, particularly around storage. Just like on-prem solutions, you'll need to understand the disk and networking capacities to get the most out of your cluster. And one of the most attractive things in GCP is the pricing, as it tends to run a little less than the others. But it does translate into less choices and options within the environment. If nothing else, I want you to take one thing away from this slide, and Sumeet said this earlier. VMs running, about AWS, Sumeet said this about AWS earlier. VMs running in the GCP space run on top of hardware that has hyperthreading enabled. And that a vCPU doesn't equate to a core, but rather a processing thread. This becomes particularly important if you're moving from an on-prem environment into the Cloud. Because a physical Vertica node with 32 cores is not the same thing as a VM with 32 vCPUs. In fact, with 32 vCPUs, you're only getting about 16 cores worth of performance. GCP does offer a handful of VM types, which they categorize by letter, but for us, most of these don't make great choices for Vertica nodes. The M series, however, does offer a good core to memory ratio, especially when you're looking at the high-mem variants. Also keep in mind, performance in I/O, such as network and disk, are partially dependent on the VM size, so customers in GCP space should be focusing on 16 vCPU VMs and above for their Vertica nodes. Disk options in GCP can be broken down into two basic types, persistent disks and local disks, which are ephemeral. Persistent disks come in two forms, standard or SSD. For Vertica in Eon mode, we recommend that customers use persistent SSD disks for the catalog, and either local SSD disks or persistent SSD disks for the depot and the temp space. Couple of things to think about here, though. Persistent disks are provisioned as a single device with a settable size. Local disks are provisioned as multiple disk devices with a fixed size, requiring you to use some kind of software RAIDing to create a single storage device. So while local SSD disks provide much more throughput, you're using CPU resources to maintain that RAID set. So you're giving, it's a little bit of a trade-off. Persistent disks offer redundancy, either within the zone that they exist or within the region, and if you're selecting regional redundancy, the disks are replicated across multiple zones in the region. This does have an effect in the performance to VM, so we don't recommend this. What we do recommend is the zonal redundancy when you're using persistent disks, as it gives you that redundancy level without actually affecting the performance. Remember also, in the Cloud space, all I/O is network I/O, as disks are basically block storage devices. This means that disk actions can and will slow down network traffic. And finally, the storage bucket access in GCP is based on GCP interoperability mode, which means that it's basically compliant with the AWS S3 API. In interoperability mode, access to the bucket is granted by a key pair that GCP refers to as HMAC keys. HMAC keys can be generated for individual users or for service accounts. We will recommend that when you're creating HMAC keys, choose a service account to ensure that the keys are not tied to a single employee. When thinking about storage for Enterprise mode, things change a little bit. We still recommend persistent SSD disks over standard ones. However, the use of local SSD disks for anything other than temp space is highly discouraged. I said it before, local SSD disks are ephemeral, meaning that the data's lost if the machine is turned off or goes down. So not really a place you want to store your data. In GCP, multiple persistent disks placed into a software RAID set does not create more throughput like you can find in other Clouds. The I/O saturation usually hits the VM limit long before it hits the disk limit. In fact, performance of a persistent disk is determined not just by the size of the disk but also by the size of the VM. So a good rule of thumb in GCP is to maximize your I/O throughput for persistent disks, is that the size tends to max out at two terabytes for SSDs and 10 terabytes for standard disks. Network performance in GCP can be thought of in two distinct ways. There's node-to-node traffic, and then there's egress traffic. Node-to-node performance in GCP is really good within the zone, with typical traffic between nodes falling in the 10-15 gigabits per second range. This might vary a little from zone to zone and region to region, but usually it's only limited, they're only limited by the existing traffic where the VMs exist. So kind of a noisy neighbor effect. Egress traffic from a VM, however, is subject to throughput caps, and these are based on the size of the VM. So the speed is set for the number of vCPUs in the VM at two gigabits per second per vCPU, and tops out at 32 gigabits per second. So the larger the VM, the more vCPUs you get, the larger the cap. So some things to consider in the NAV ring space for your Vertica cluster, pick a region that's physically close to you, even if you're connecting to the GCP network from a corporate LAN as opposed to the internet. The further the packets have to travel, the longer it's going to take. Also, GCP, like most Clouds, doesn't support UDP broadcast traffic on their virtual NAV ring, so you do have to use the point-to-point flag for spread when you're creating your cluster. And since the network cap on VMs is set at 32 gigabits per second per VM, maximize your network egress throughput and don't use VMs that are smaller than 16 vCPUs for your Vertica nodes. And that gets us to the one question I get asked the most often. How do I get my data into and out of the Cloud? Well, GCP offers many different methods to support different speeds and different price points for data ingress and egress. There's the obvious one, right, across the internet either directly to the VMs or into the storage bucket. Or you can, you know, light up a VPN tunnel to encrypt all that traffic. But additionally, GCP offers direct network interconnect from your corporate network. These get provided either by Google or by a partner, and they vary in speed. They also offer things called direct or carrier peering, which is connecting the edges of the networks between your network and GCP, and you can use a CDN interconnect, which creates, I believe, an on-demand connection from the GCP network, your network to the GCP network provided by a large host of CDN service providers. So GCP offers a lot of ways to move your data around in and out of the GCP Cloud. It's really a matter of what price point works for you, and what technology your corporation is looking to use. So we've talked about AWS, we've talked about GCP, it really only leaves one more Cloud. So last, and by far not the least, there's the Microsoft Azure environment. Holding on strong to the number two place in the major Cloud providers, Azure offers a very robust Cloud offering that's attractive to customers that already consume services from Microsoft. But what you need to keep in mind is that the underlying foundation of their Cloud is based on the Microsoft Windows products. And this makes their Cloud offering a little bit different in the services and offerings that they have. The good news here, though, is that Microsoft has done a very good job of getting their virtualization drivers baked into the modern kernels of most Linux operating systems, making running Linux-based VMs in Azure fairly seamless. So here's the slide again, but now you're going to notice some slight differences. First off, in Azure we only support Enterprise mode. This is because the Azure storage product is very different from Google Cloud storage and S3 on AWS. So while we're working on getting this supported, and we're starting to focus on this, we're just not there yet. This means that since we're only supporting Enterprise mode in Azure, getting the local disk performance right is one of the keys to success of running Vertica here, with the other major key being making sure that you're getting the appropriate networking speeds. Overall, Azure's a really good platform for Vertica, and its performance and pricing are very much on par with AWS. But keep in mind that the newer versions of the Linux operating systems like RHEL and CentOS run much better here than the older versions. Okay, so first things first again, just like GCP, in Azure VMs are running on top of hardware that has hyperthreading enabled. And because of the way Hyper-V, Azure's virtualization engine works, you can actually see this, right? So if you look down into the CPU information of the VM, you'll actually see how it groups the vCPUs by core and by thread. Azure offers a lot of VM types, and is adding new ones all the time. But for us, we see three VM types that make the most sense for Vertica. For customers that are looking to run production workloads in Azure, the Es_v3 and the Ls_v2 series are the two main recommendations. While they differ slightly in the CPU to memory ratio and the I/O throughput, the Es_v3 series is probably the best recommendation for a generalized Vertica node, with the Ls_v2 series being recommended for workloads with higher I/O requirements. If you're just looking to deploy a sandbox environment, the Ds_v3 series is a very suitable choice that really can reduce your overall Cloud spend. VM storage in Azure is provided by a grouping of four different types of disks, all offering different levels of performance. Introduced at the end of last year, the Ultra Disk option is the highest-performing disk type for VMs in Azure. It was designed for database workloads where high throughput and low latency is very desirable. However, the Ultra Disk option is not available in all regions yet, although that's been changing slowly since their launch. The Premium SSD option, which has been around for a while and is widely available, can also offer really nice performance, especially higher capacities. And just like other Cloud providers, the I/O throughput you get on VMs is dictated not only by the size of the disk, but also by the size of the VM and its type. So a good rule of thumb here, VM types with an S will have a much better throughput rate than ones that don't, meaning, and the larger VMs will have, you know, higher I/O throughput than the smaller ones. You can expand the VM disk throughput by using multiple disks in Azure and using a software RAID. This overcomes limitations of single disk performance, but keep in mind, you're now using CPU cycles to maintain that raid, so it is a bit of a trade-off. The other nice thing in Azure is that all their managed disks are encrypted by default on the server side, so there's really nothing you need to do here to enable that. And of course I mentioned this earlier. There is no native access to Azure storage yet, but it is something we're working on. We have seen folks using third-party applications like MinIO to access Azure's storage as an S3 bucket. So it might be something you want to keep in mind and maybe even test out for yourself. Networking in Azure comes in two different flavors, standard and accelerated. In standard networking, the entire network stack is abstracted and virtualized. So this works really well, however, there are performance limitations. Standard networking tends to top out around four gigabits per second. Accelerated networking in Azure is based on single root I/O virtualization of the Mellanox adapter. This is basically the VM talking directly to the physical network card in the host hardware, and it can produce network speeds up to 20 gigabits per second, so much, much faster. Keep in mind, though, that not all VM types and operating systems actually support accelerated networking, and you know, just like disk throughput, network throughput is based on VM type and size. So what do you need to think about for networking in the Azure space? Again, stay close to home. Pick regions that are geographically close to your location. Yes, the backbones between the regions are very, very fast, but the more hops your packets have to make, the longer it takes. Azure offers two types of groupings of their VMs, availability sets and availability zones. Availability zones offer good redundancy across multiple zones, but this actually increases the node-to-node latency, so we recommend you avoid this. Availability sets, on the other hand, keep all your VMs grouped together within a single zone, but makes sure that no two VMs are running on the same host hardware, for redundancy. And just like the other Clouds, UDP broadcast is not supported. So you have to use the point-to-point flag when you're creating your database to ensure that the spread works properly. Spread time out, okay, this is a good one. So recently, Microsoft has started monthly rolling updates of their environment. What this looks like is VMs running on top of hardware that's receiving an update can be paused. And this becomes problematic when the pausing of the VM exceeds eight seconds, as the unpaused members of the cluster now think the paused VM is down. So consider adjusting the spread time out for your clusters in Azure to 30 seconds, and this will help avoid a little of that. If you're deploying a large cluster in Azure, more than 20 nodes, use large closer mode, as point-to-point for spread doesn't really scale well with a lot of Vertica nodes. And finally, you know, pick VM types and operating systems that support accelerated networking. The difference in the node-to-node speeds can be very dramatic. So how do we move data around in Azure, right? So Microsoft views data egress a little differently than other Clouds, as it classifies any data being transmitted by a VM as egress. However, it only bills for data egress that actually leaves the Azure environment. Egress speed limits in Azure are based entirely on the VM type and size, and then they're limited by your connection to them. While not offering as many pathways to access their Cloud as GCP, Azure does offer a direct network-to-network connection called ExpressRoute. Offered by a large group of third-party processors, partners, the ExpressRoute offers multiple tiers of performance that are based on a flat charge for inbound data and a metered charge for outbound data. And of course you can still access these via the internet, and securely through a VPN gateway. So on behalf of Jeff, Sumeet, and myself, I'd like to thank you for listening to our presentation today, and we're now ready for Q&A.

Published Date : Mar 30 2020

SUMMARY :

Also as a reminder that you can maximize your screen So the best, the best thing you can do and the larger VMs will have, you know,

ENTITIES

Entity	Category	Confidence
Chris	PERSON	0.99+
Sumeet	PERSON	0.99+
Jeff Healey	PERSON	0.99+
Chris Daly	PERSON	0.99+
Jeff	PERSON	0.99+
Christopher Daly	PERSON	0.99+
Sumeet Keswani	PERSON	0.99+
Google	ORGANIZATION	0.99+
Vertica	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
10 Gbps	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
forum.vertica.com	OTHER	0.99+
30 seconds	QUANTITY	0.99+
Amazon Web Services	ORGANIZATION	0.99+
RHEL	TITLE	0.99+
Today	DATE	0.99+
32 cores	QUANTITY	0.99+
CentOS	TITLE	0.99+
more than 20 nodes	QUANTITY	0.99+
32 vCPUs	QUANTITY	0.99+
two platforms	QUANTITY	0.99+
eight seconds	QUANTITY	0.99+
Vertica	TITLE	0.99+
10 terabytes	QUANTITY	0.99+
one	QUANTITY	0.99+
today	DATE	0.99+
both	QUANTITY	0.99+
20 nodes	QUANTITY	0.99+
two terabytes	QUANTITY	0.99+
each application	QUANTITY	0.99+
S3	TITLE	0.99+
two types	QUANTITY	0.99+
Linux	TITLE	0.99+
two subclusters	QUANTITY	0.98+
first entry	QUANTITY	0.98+
one question	QUANTITY	0.98+
four	QUANTITY	0.98+
Azure	TITLE	0.98+
Vertica 10	TITLE	0.98+
4/2	DATE	0.98+
First	QUANTITY	0.98+
16 vCPU	QUANTITY	0.98+
two forms	QUANTITY	0.97+
MinIO	TITLE	0.97+
single employee	QUANTITY	0.97+
first	QUANTITY	0.97+
this week	DATE	0.96+

UNLIST TILL 4/1 - Putting Complex Data Types to Work

hello everybody thank you for joining us today from the virtual verdict of BBC 2020 today's breakout session is entitled putting complex data types to work I'm Jeff Healey I lead vertical marketing I'll be a host for this breakout session joining me is Deepak Magette II technical lead from verdict engineering but before we begin I encourage you to submit questions and comments during the virtual session you don't have to wait just type your question or comment and the question box below the slides and click Submit it won't be a Q&A session at the end of the presentation we'll answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forms that formed up Vertica calm to post your questions there after the session engineering team is planning to join the forms conversation going and also as a reminder that you can maximize your screen by clicking a double arrow button in the lower right corner of the slides yes this virtual session is being recorded and will be available to view on demand this week we'll send you a notification as submits ready now let's get started over to you Deepak thanks yes make sure you talk about the complex a textbook they've been doing it wedeck R&D without further delay let's see why and how we should put completely aside to work in your data analytics so this is going to be the outline or overview of my talk today first I'm going to talk about what are complex data types in some use cases I will then quickly cover some file formats that support these complex website I will then deep dive into the current support for complex data types in America finally I'll conclude with some usage considerations and what is coming in are 1000 release and our future roadmap and directions for this project so what are complex stereotypes complex data types are nested data structures composed of tentative types community types are nothing but your int float and string war binary etc the basic types some examples of complex data types include struct also called row are a list set map and Union composite types can also be built by composing other complicated types computer types are very useful for handling sparse data we also make samples on this presentation on that use case and also they help simplify analysis so let's look at some examples of complex data types so the first example on the left you can see a simple customer which is of type struc with two fields namely make a field name of type string and field ID of type integer structs are nothing but a group of fields and each field is a type of its own the type can be primitive or another complex type and on the right we have some example data for this simple customer complex type so it's basically two fields of type string and integer so in this case you have two rows where the first row is Alex with name named Alex and ID 1 0 and the second row has name Mary with ID 2 0 0 2 the second complex type on the left is phone numbers of type array of data has the element type string so area is nothing but a collection of elements the elements could be again a primitive type or another complex type so in this example the collection is of type string which is a primitive type and on the right you have some example of this collection of a fairy type called phone numbers and basically each row has a set or the list or a collection of phone numbers on the first we have two phone numbers and second you have a single phone number in that array and the third type on the slide is the map data type map is nothing but a collection of key value pairs so each element is actually a key value and you have a collection of such elements the key is usually a primitive type however the value is can be a primitive or complex type so in this example the both the key and value are of type string and then if you look on the right side of the slide you have some sample data here we have HTTP requests where the key is the header type and the value is the header value so the for instance on the first row we have a key type pragma with value no cash key type host with value some hostname and similarly on the second row you have some key value called accept with some text HTML because yeah they actually have a collection of elements allison maps are commonly called as collections as a to talking to in mini documents so we saw examples of a one-level complex steps on this slide we have nested complex there types on the right we have the root complex site called web events of type struct script has a for field a session ID of type integer session duration of type timestamp and then the third and the fourth fields customer and history requests are further complex types themselves so customer is again a complex type of type struct with three fields where the first two fields name ID are primitive types however the third field is another complex type phone numbers which we just saw in the previous slide similarly history request is also the same map type that we just saw so in this example each complex types is independent and you can reuse a complex type inside other complex types for example you can build another type called orders and simply reuse the customer type however in a practical implementation you have to deal with complexities involving security ownership and like sets lifecycle dependencies so keeping complex types as independent has that advantage of reusing them however the complication with that is you have to deal with security and ownership and lifecycle dependencies so this is on this slide we have another style of declaring a nested complex type do is call inlined complex data type so we have the same web driven struct type however if you look at the complex sites that embedded into the parent type definition so customer and HTTP request definition is embedded in lined into this parent structure so the advantage of this is you won't have to deal with the security and other lifecycle dependency issues but with the downside being you can't reuse them so it's sort of a trade-off between the these two so so let's see now some use cases of these complex types so the first use case or the benefit of using complex stereotypes is that you'll be able to express analysis mode naturally compute I've simplified the expression of analysis logic thereby simplifying the data pipelines in sequel it feels as if you have tables inside table so let's look at an example on and say you want to list all the customers with more than one thousand website events so if you have complex types you can simply create a table called web events and with one column of type web even which is a complex step so we just saw that difference it has four fields station customer and HTTP request so you can basically have the entire schema or in one type if you don't have complex types you'll have to create four tables one essentially for each complex type and then you have to establish primary key foreign key dependencies across these tables now if you want to achieve your goal of of listing all the customers in more than thousand web requests if you have complex types you can simply use the dot notation to extract the name the contact and also use some special functions for maps that will give you the count of all the HTTP requests grid in thousand however if you don't have complex types you'll have to now join each table individually extract the results from sub query and again joined on the outer query and finally you can apply a predicate of total requests which are greater than thousand to basically get your final result so it's a complex steps basically simplify the query writing part also the execution itself is also simplified so you don't have to have joins if you have complex you can simply have a load step to load the map type and then you can apply the function on top of it directly however if you have separate tables you have to join all these data and apply the filter step and then finally another joint to get your results alright so the other advantage of complex types is that you can cross this semi structured data very efficiently for example if you have data from clique streams or page views the data is often sparse and maps are very well suited for such data so maps or semi-structured by nature and with this support you can now actually have semi structured data represented along with structured columns in in any database so maps have this nice of nice feature to cap encapsulated sparse data as an example the common fields of a kick stream click stream or page view data are pragma host and except if you don't have map types you will have to end up creating a column for each of this header or field types however if you have map you can basically embed as key value pairs for all the data so on the left here on the slide you can see an example where you have a separate column for each field you end up with a lot of nodes basically the sparse however if you can embed them into in a map you can put them into a single column and sort of yeah have better efficiency and better representation of spots they imagine if you have thousands of fields in a click stream or page view you will have thousands of columns you will need thousands of columns represent data if you don't have a map type correct so given these are the most commonly used complexity types let's see what are the file formats that actually support these complex data types so most of file formats popular ones support complex data types however they have different serve variations so for instance if you have JSON it supports arrays and objects which are complex data types however JSON data is schema-less it is row oriented and this text fits because it is Kimmel s it has to store it in encase on every job the second type of file format is Avro and Avro has records enums arrays Maps unions and a fixed type however Avro has a schema it is oriented and it is binary compressed the third category is basically the park' and our style of file formats where the columnar so parquet and arc have support for arrays maps and structs the hewa schema they are column-oriented unlike Avro which is oriented and they're also binary compressed and they support a very nice compression and encoding types additionally so the main difference between parquet and arc is only in terms of how they represent complex types parquet includes the complex type hierarchy as reputation deflation levels however orc uses a separate column at every parent of the complex type to basically the prisons are now less so that apart from that difference in how they represent complex types parking hogs have similar capabilities in terms of optimizations and other compression techniques so to summarize JSON has no schema has no binary format in this columnar so it is not columnar Avro has a schema because binary format however it is not columnar and parquet and art are have a schema have a binary format and are columnar so let's see how we can query these different kinds of complex types and also the different file formats that they can be present in in how we can basically query these different variations in Vertica so in Vertica we basically have this feature called flex tables to where you can load complex data types and analyze them so flex tables use a binary format called vemma to store data as key value pairs clicks tables are schema-less they are weak typed and they trade flexibility for performance so when I mean what I mean by schema-less is basically the keys provide the field name and each row can potentially have different keys and it is weak type because there's no type information at the column level we have some we will see some examples of of this week type in the following slides but basically there's no type information so so the data is stored in text format and because of the week type and schema-less nature of flex tables you can implement some optimum use cases like if you can trivially implement needs like schema evolution or keep the complex types types fluid if that is your use case then the weak tightness and schema-less nature of flex tables will help you a lot to get give you that flexibility however because you have this weak type you you have a downside of not getting the best possible performance so if you if your use case is to get the best possible performance you can use a new feature of the strongly-typed complex types that we started to introduce in Vertica so complex types here are basically a strongly typed complex types they have a schema and then they give you the best possible performance because the optimizer now has enough information from the schema and the type to implement optimization system column selection or all the nice techniques that Vertica employs to give you the best possible color performance can now be supported even for complex types so and we'll see some of the examples of these two types in these slides now so let's use a simple data called restaurants a restaurant data - as running throughout this poll excites to basically see all the different variations of flex and complex steps so on this slide you have some sample data with four fields and essentially two rows if you sort of loaded in if you just operate them out so the four fields are named cuisine locations in menu name in cuisine or of type watch are locations is essentially an array and menu array of a row of two fields item and price so if you the data is in JSON there is no schema and there is no type information so how do we process that in Vertica so in Vertica you can simply create a flex table called restaurants you can copy the restaurant dot J's the restaurants of JSON file into Vertica and basically you can now start analyzing the data so if you do a select star from restaurants you will see that all the data is actually in one column called draw and it also you have the other column called identity which is to give you some unique row row ID but the row column base again encapsulates all the data that gives in the restaurant so JSON file this tall column is nothing but the V map format the V map format is a binary format that encodes the data as key value pairs and RAW format is basically backed by the long word binary column type in Vertica so each key essentially gives you the field name and the values the field value and it's all in its however the values are in the text text representation so see now you want to get better performance of this JSON data flex tables has these nice functions to basically analyze your data or try to extract some schema and type information from your data so if you execute compute flex table keys on the restaurants table you will see a new table called public dot restaurants underscore keys and then that will give you some information about your JSON data so it was able to automatically infer that your data has four fields namely could be name cuisine locations in menu and could also get that the name in cuisine or watch are however since locations in menu are complex types themselves one is array and one is area for row it sort of uses the same be map format as ease to process them so it has four columns to two primitive of type watch R and 2 R P map themselves so now you can materialize these columns by altering the table definitions and adding columns of that particular type it inferred and then you can get better performance from this materialized columns and yeah it's basically it's not in a single column anymore you have four columns for the fare your restaurant data and you can get some column selection and other optimizations on on the data that Whittaker provides all right so that is three flex tables are basically helpful if you don't have a schema and if you don't have any type of permission however we saw earlier that some file formats like Parker and Avro have schema and have some type information so in those cases you don't have to do the first step of inputting the type so you can directly create the type external table definition of the type and then you can target it to the park a file and you can load it in by an external table in vertical so the same restaurants dot JSON if you call if you transfer it to a translations or park' format you can basically get the fields with look however the locations and menu are still in the B map format all right so the V map format also allows you to explode the data and it has some nice functions to yeah M extract the fields from P map format so you have this map items so the same restaurant later if you want to explode and you want to apply predicate on the fields of the RS and the address of pro you can have map items to export your data and then you can apply predicates on a particular field in the complex type data so on this slide is basically showing you how you can explode the entire data the menu items as well as the locations and basically give you the elements of each of these complex types up so as I mentioned the menus so if you go back to the previous slide the locations and menu items are still the bond binary or the V map format so the question is if you want what if you want to get perform better on the V map data so for primitive types you could materialize into the primitive style however if it's an array and array of row we will need some first-class complex type constructs and that is what we will see that are added in what is right now so Vertica has started to introduce complex stereotypes with where these complex types is sort of a strongly typed complex site so on this slide you have an example of a row complex type where so we create an external table called customers and you have a row type of twit to fields name and ID so the complex type is basically inlined into the tables into the column definition and on the second example you can see the create external table items which is unlisted row type so it has an item of type row which is so fast to peals name and the properties is again another nested row type with two fixed quantities label so these are basically strongly typed complex types and then the optimizer can now give you a better performance compared to the V map using the strongly typed information in their queries so we have support for pure rows and extra draws in external tables for power K we have support for arrays and nested arrays as well for external tables in power K so you can declare an external table called contacts with a flip phone number of array of integers similarly you can have a nested array of items of type integer we can declare a column with that strongly typed complex type so the other complex type support that we are adding in the thinner liz's support for optimized one dimensional arrays and sets for both ross and as well as RK external table so you can create internal table called phone numbers with a one-dimensional array so here you have phone numbers of array of type int you can have one dimensional you can have sets as well which is also one color one dimension arrays but sets are basically optimized for fast look ups they are have unique elements and they are ordered so big so you can get fast look ups using sets if that is a use case then set will give you very quick lookups for elements and we also implemented some functions to support arrays sets as well so you have applied min apply max which are scale out that you can apply on top of an array element and you can get the minimum element and so on so you can up you have support for additional functions as well so the other feature that is coming in ten o is the explored arrays of functionality so we have a implemented EU DX that will allow you to similar similar to the example you saw in the math items case you can extract elements from these arrays and you can apply different predicates or analysis on the elements so for example if you have this restaurant table with the column name watch our locations of each an area of archer and menu again an area watch our you can insert values using the array constructor into these columns so here we inserting three values lilies feed the with location with locations cambridge pittsburgh menu items cheese and pepperoni again another row with name restaurant named bob tacos location Houston and totila salsa and Patty on the third example so now you can basically explode the both arrays into and extract the elements out from these arrays so you can explode the location array and extract the location elements which is which are basically Houston Cambridge Pittsburgh New Jersey and also you can explode the menu items and extract individual elements and now you can sort of apply other predicates on the extruded data Kollek so so so let's see what are some usage considerations of these complex data types so complex data types as we saw earlier are nice if you have sparse data so if your data has clickstream or has some page view data then maps are very nice to have to represent your data and then you can sort of efficiently represent the in the space wise fashion for sparse data use a map types and compensate that as we saw earlier for the web request count query it will help you simplify the analysis as well you don't have to have joins and it will simplify your query analysis as I just mentioned if your use cases are for fast look ups then you can use a set type so arrays are nice but they have the ordering on them however if your primary use case to just look up for certain elements then we can use the set type also you can use the B map or the Flex functionality that we have in Vertica if you want flexibility in your complex set data type schema so like I mentioned earlier you can trivially implement needs like scheme evolution or even keep the complex types fluid so if you have multiple iterations of unit analysis and each iteration we are changing the fields because you're just exploring the data then we map and flex will give you that nice ease to change the fields within the complex type or across files and we can load fluid complex you can load complexity types with bit fluids is basically different fields in different Rho into V map and flex tables easily however if you're once you basically treated over your data you figured out what are the fields and the complex types that you really need you can use the strongly typed complex data types that we started to introduce in Vertica so you can use the array type the struct type in the map type for your data analysis so that's sort of the high level use cases for complex types in vertical so it depends on a lot on where your data analysis phase is fear early then your data is usually still fluid and you might want to use V Maps and flex to explore it once you finalize your schema you can use the strongly typed complex data types and to get the best possible performance holic so so what's coming in the following releases of Vertica so antenna which is coming in sometime now so yeah so we are adding which is the next release of vertical basically we're adding support for loading Park a complex data types to the V map format so parquet is a strongly typed file format basically it has the schema it also has the type information for each of the complex type however if you are exploring your data then you might have different park' files with different schemes so you can load them to the V map format first and then you can analyze your data and then you can switch to the strongly typed complex types we're also adding one dimensional optimized arrays and sets in growth and for parquet so yeah the complex sets are not just limited to parquet you can also store them in drawers however right now you only support one dimension arrays and set in rows we're also adding the Explorer du/dx for one-dimensional arrays in the in this release so you can as you saw in the previous example you can explode the data for of arrays in arrays and you can apply predicates on individual elements for the erase data so you can in it'll apply for set so you can cause them to milli to erase and Clinics code sets as well so what are the plans paths that you know release so we are going to continue both for strongly-typed computer types right now we don't have support for the full in the tail release we won't have support for the full all the combinations of complex types so we only have support for nested arrays sorriness listed pure arrays or nested pure rows and some are only limited to park a file format so we will continue to add more support for sub queries and nested complex sites in the following in the in following releases and we're also planning to add this B map data type so you saw in the examples that the V map data format is currently backed by the long word binary data format or the other column type because of this the optimizer really cannot distinguish which is a which is which data is actually a long wall binary or which is actually data and we map format so if we the idea is to basically add a type called V map and then the optimizer can now implement our support optimizations or even syntax such as dot notation and yeah if your data is columnar such as Parque then you can implement optimizations just keep push down where you can push the keys that are actually querying in your in your in your analysis and then only those keys should be loaded from parquet and built into the V map format so that way you get sort of the column selection optimization for complex types as well and yeah that's something you can achieve if you have different types for the V map format so that's something on the roadmap as well and then unless join is basically another nice to have feature right now if you want to explode and join the array elements you have to explode in the sub query and then in the outer query you have to join the data however if you have unless join till I love you to explode as well as join the data in the same query and on the fly you can do both and finally we are also adding support for this new feature called UD vector so that's on the plan too so our work for complex types is is essentially chain the fundamental way Vertica execute in the sense of functions and expression so right now all expressions in Vertica can return only a single column out acceptance in some cases like beauty transforms and so on but the scalar functions for instance if you take aut scalar you can get only one column out of it however if you have some use cases where you want to compute multiple computation so if you also have multiple computations on the same input data say you have input data of two integers and you want to compute both addition and multiplication on those two columns this is for example but in many many machine learning example use cases have similar patterns so say you want to do both these computations on the data at the same time then in the current approach you have to have one function for addition one function for multiplication and both of them will have to load the data once basically loading data twice to get both these computations turn however with the Uni vector support you can perform both these computations in the same function and you can return two columns out so essentially saving you the loading loading these columns twice you can only do it once and get both the results out so that's sort of what we are trying to implement with all the changes that we are doing to support complex data types in Vertica and also you don't have to use these over Clause like a uni transform so PD scale just like we do scalars you can have your a vector and you can have multiple columns returned from your computations so that sort of concludes my talk so thank you for listening to my presentation now we are ready for Q&A

Published Date : Mar 30 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
America	LOCATION	0.99+
Jeff Healey	PERSON	0.99+
second row	QUANTITY	0.99+
Mary	PERSON	0.99+
two rows	QUANTITY	0.99+
two fields	QUANTITY	0.99+
first row	QUANTITY	0.99+
two rows	QUANTITY	0.99+
two types	QUANTITY	0.99+
each row	QUANTITY	0.99+
two integers	QUANTITY	0.99+
Deepak	PERSON	0.99+
one function	QUANTITY	0.99+
three fields	QUANTITY	0.99+
fourth fields	QUANTITY	0.99+
each element	QUANTITY	0.99+
each field	QUANTITY	0.99+
third	QUANTITY	0.99+
more than thousand web requests	QUANTITY	0.99+
second example	QUANTITY	0.99+
today	DATE	0.99+
each key	QUANTITY	0.99+
each table	QUANTITY	0.99+
four fields	QUANTITY	0.99+
third field	QUANTITY	0.99+
first example	QUANTITY	0.99+
Deepak Magette II	PERSON	0.99+
two columns	QUANTITY	0.99+
third category	QUANTITY	0.99+
two columns	QUANTITY	0.99+
two fields	QUANTITY	0.99+
Houston	LOCATION	0.99+
first step	QUANTITY	0.99+
twice	QUANTITY	0.99+
thousands of columns	QUANTITY	0.98+
three values	QUANTITY	0.98+
this week	DATE	0.98+
more than one thousand website events	QUANTITY	0.98+
third type	QUANTITY	0.98+
each iteration	QUANTITY	0.98+
both	QUANTITY	0.98+
greater than thousand	QUANTITY	0.98+
cambridge	LOCATION	0.98+
JSON	TITLE	0.98+
both arrays	QUANTITY	0.97+
one column	QUANTITY	0.97+
thousands of fields	QUANTITY	0.97+
second	QUANTITY	0.97+
third example	QUANTITY	0.97+
two	QUANTITY	0.97+
single column	QUANTITY	0.96+
thousand	QUANTITY	0.96+
Alex	PERSON	0.96+
first	QUANTITY	0.96+
BBC 2020	ORGANIZATION	0.96+
Vertica	TITLE	0.96+
four columns	QUANTITY	0.95+
once	QUANTITY	0.95+
one type	QUANTITY	0.95+
V Maps	TITLE	0.94+
one color	QUANTITY	0.94+
second type	QUANTITY	0.94+
one dimension	QUANTITY	0.94+
first two fields	QUANTITY	0.93+
four tables	QUANTITY	0.91+
each	QUANTITY	0.91+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Jeff Healey: