UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases

>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.

Published Date : Mar 30 2020

SUMMARY :

Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Sumeet	PERSON	0.99+
Sumeet Keswani	PERSON	0.99+
Shirang Kamat	PERSON	0.99+
Jeff Healey	PERSON	0.99+
6 nodes	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
six years	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
12 nodes	QUANTITY	0.99+
Shirang	PERSON	0.99+
1,000 files	QUANTITY	0.99+
one	QUANTITY	0.99+
12 shards	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
99.9%	QUANTITY	0.99+
two modes	QUANTITY	0.99+
S3	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
first subcluster	QUANTITY	0.99+
first time	QUANTITY	0.99+
two options	QUANTITY	0.99+
first	QUANTITY	0.99+
first option	QUANTITY	0.99+
each	QUANTITY	0.99+
two subclusters	QUANTITY	0.99+
Each node	QUANTITY	0.99+
hundreds of nodes	QUANTITY	0.99+
Today	DATE	0.99+
each app	QUANTITY	0.99+
today	DATE	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
One	QUANTITY	0.98+
three nodes	QUANTITY	0.98+
SQL	TITLE	0.98+
Eon Mode	TITLE	0.98+
single container	QUANTITY	0.97+
this week	DATE	0.97+
16 secondary shard subscription	QUANTITY	0.97+
two types	QUANTITY	0.97+
Sizing and Configuring Vertica in Eon Mode for Different Use Cases	TITLE	0.97+
Vertica	TITLE	0.97+
one limitation	QUANTITY	0.97+

UNLIST TILL 4/2 - Vertica in Eon Mode: Past, Present, and Future

>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Vertica in Eon Mode past, present and future. I'm Paige Roberts, open source relations manager at Vertica and I'll be your host for this session. Joining me is Vertica engineer, Yuanzhe Bei and Vertica Product Manager, David Sprogis. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait till the end. Just type your question or comment as you think of it in the question box, below the slides and click Submit. Q&A session at the end of the presentation. We'll answer as many of your questions as we're able to during that time, and any questions that we don't address, we'll do our best to answer offline. If you wish after the presentation, you can visit the Vertica forums to post your questions there and our engineering team is planning to join the forums to keep the conversation going, just like a Dev Lounge at a normal in person, BDC. So, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, if you want to see them bigger. And yes, before you ask, this virtual session is being recorded and will be available to view on demand this week. We are supposed to send you a notification as soon as it's ready. All right, let's get started. Over to you, Dave. >> David: Thanks, Paige. Hey, everybody. Let's start with a timeline of the life of Eon Mode. About two years ago, a little bit less than two years ago, we introduced Eon Mode on AWS. Pretty specifically for the purpose of rapid scaling to meet the cloud economics promise. It wasn't long after that we realized that workload isolation, a byproduct of the architecture was very important to our users and going to the third tick, you can see that the importance of that workload isolation was manifest in Eon Mode being made available on-premise using Pure Storage FlashBlade. Moving to the fourth tick mark, we took steps to improve workload isolation, with a new type of subcluster which Yuanzhe will go through and to the fifth tick mark, the introduction of secondary subclusters for faster scaling and other improvements which we will cover in the slides to come. Getting started with, why we created Eon Mode in the first place. Let's imagine that your database is this pie, the pecan pie and we're loading pecan data in through the ETL cutting board in the upper left hand corner. We have a couple of free floating pecans, which we might imagine to be data supporting external tables. As you know, the Vertica has a query engine capability as well which we call external tables. And so if we imagine this pie, we want to serve it with a number of servers. Well, let's say we wanted to serve it with three servers, three nodes, we would need to slice that pie into three segments and we would serve each one of those segments from one of our nodes. Now because the data is important to us and we don't want to lose it, we're going to be saving that data on some kind of raid storage or redundant storage. In case one of the drives goes bad, the data remains available because of the durability of raid. Imagine also, that we care about the availability of the overall database. Imagine that a node goes down, perhaps the second node goes down, we still want to be able to query our data and through nodes one and three, we still have all three shards covered and we can do this because of buddy projections. Each neighbor, each nodes neighbor contains a copy of the data from the node next to it. And so in this case, node one is sharing its segment with node two. So node two can cover node one, node three can cover node two and node one back to node three. Adding a little bit more complexity, we might store the data in different copies, each copy sorted for a different kind of query. We call this projections in Vertica and for each projection, we have another copy of the data sorted differently. Now it gets complex. What happens when we want to add a node? Well, if we wanted to add a fourth node here, what we would have to do, is figure out how to re-slice all of the data in all of the copies that we have. In effect, what we want to do is take our three slices and slice it into four, which means taking a portion of each of our existing thirds and re-segmenting into quarters. Now that looks simple in the graphic here, but when it comes to moving data around, it becomes quite complex because for each copy of each segment we need to replace it and move that data on to the new node. What's more, the fourth node can't have a copy of itself that would be problematic in case it went down. Instead, what we need is we need that buddy to be sitting on another node, a neighboring node. So we need to re-orient the buddies as well. All of this takes a lot of time, it can take 12, 24 or even 36 hours in a period when you do not want your database under high demand. In fact, you may want to stop loading data altogether in order to speed it up. This is a planned event and your applications should probably be down during this period, which makes it difficult. With the advent of cloud computing, we saw that services were coming up and down faster and we determined to re-architect Vertica in a way to accommodate that rapid scaling. Let's see how we did it. So let's start with four nodes now and we've got our four nodes database. Let's add communal storage and move each of the segments of data into communal storage. Now that's the separation that we're talking about. What happens if we run queries against it? Well, it turns out that the communal storage is not necessarily performing and so the IO would be slow, which would make the overall queries slow. In order to compensate for the low performance of communal storage, we need to add back local storage, now it doesn't have to be raid because this is just an ephemeral copy but with the data files, local to the node, the queries will run much faster. In AWS, communal storage really does mean an S3 bucket and here's a simplified version of the diagram. Now, do we need to store all of the data from the segment in the depot? The answer is no and the graphics inside the bucket has changed to reflect that. It looks more like a bullseye, showing just a segment of the data being copied to the cache or to the depot, as we call it on each one of the nodes. How much data do you store on the node? Well, it would be the active data set, the last 30 days, the last 30 minutes or the last. Whatever period of time you're working with. The active working set is the hot data and that's how large you want to size your depot. By architecting this way, when you scale up, you're not re-segmenting the database. What you're doing, is you're adding more compute and more subscriptions to the existing shards of the existing database. So in this case, we've added a complete set of four nodes. So we've doubled our capacity and we've doubled our subscriptions, which means that now, the two nodes can serve the yellow shard, two nodes can serve the red shard and so on. In this way, we're able to run twice as many queries in the same amount of time. So you're doubling the concurrency. How high can you scale? Well, can you scale to 3X, 5X? We tested this in the graphics on the right, which shows concurrent users in the X axis by the number of queries executed in a minute along the Y axis. We've grouped execution in runs of 10 users, 30 users, 50, 70 up to 150 users. Now focusing on any one of these groups, particularly up around 150. You can see through the three bars, starting with the bright purple bar, three nodes and three segments. That as you add nodes to the middle purple bar, six nodes and three segments, you've almost doubled your throughput up to the dark purple bar which is nine nodes and three segments and our tests show that you can go to 5X with pretty linear performance increase. Beyond that, you do continue to get an increase in performance but your incremental performance begins to fall off. Eon architecture does something else for us and that is it provides high availability because each of the nodes can be thought of as ephemeral and in fact, each node has a buddy subscription in a way similar to the prior architecture. So if we lose node four, we're losing the node responsible for the red shard and now node one has to pick up responsibility for the red shard while that node is down. When a query comes in, and let's say it comes into one and one is the initiator then one will look for participants, it'll find a blue shard and a green shard but when it's looking for the red, it finds itself and so the node number one will be doing double duty. This means that your performance will be cut in half approximately, for the query. This is acceptable until you are able to restore the node. Once you restore it and once the depot becomes rehydrated, then your performance goes back to normal. So this is a much simpler way to recover nodes in the event of node failure. By comparison, Enterprise Mode the older architecture. When we lose the fourth node, node one takes over responsibility for the first shard and the yellow shard and the red shard. But it also is responsible for rehydrating the entire data segment of the red shard to node four, this can be very time consuming and imposes even more stress on the first node. So performance will go down even further. Eon Mode has another feature and that is you can scale down completely to zero. We call this hibernation, you shut down your database and your database will maintain full consistency in a rest state in your S3 bucket and then when you need access to your database again, you simply recreate your cluster and revive your database and you can access your database once again. That concludes the rapid scaling portion of, why we created Eon Mode. To take us through workload isolation is Yuanzhe Bei, Yuanzhe. >> Yuanzhe: Thanks Dave, for presenting how Eon works in general. In the next section, I will show you another important capability of Vertica Eon Mode, the workload isolation. Dave used a pecan pie as an example of database. Now let's say it's time for the main course. Does anyone still have a problem with food touching on their plates. Parents know that it's a common problem for kids. Well, we have a similar problem in database as well. So there could be multiple different workloads accessing your database at the same time. Say you have ETL jobs running regularly. While at the same time, there are dashboards running short queries against your data. You may also have the end of month report running and their can be ad hoc data scientists, connect to the database and do whatever the data analysis they want to do and so on. How to make these mixed workload requests not interfere with each other is a real challenge for many DBAs. Vertica Eon Mode provides you the solution. I'm very excited here to introduce to you to the important concept in Eon Mode called subclusters. In Eon Mode, nodes they belong to the predefined subclusters rather than the whole cluster. DBAs can define different subcluster for different kinds of workloads and it redirects those workloads to the specific subclusters. For example, you can have an ETL subcluster, dashboard subcluster, report subcluster and the analytic machine learning subcluster. Vertica Eon subcluster is designed to achieve the three main goals. First of all, strong workload isolation. That means any operation in one subcluster should not affect or be affected by other subclusters. For example, say the subcluster running the report is quite overloaded and already there can be, the data scienctists running crazy analytic jobs, machine learning jobs on the analytics subcluster and making it very slow, even stuck or crash or whatever. In such scenario, your ETL and dashboards subcluster should not be or at least very minimum be impacted by this crisis and which means your ETL job which should not lag behind and dashboard should respond timely. We have done a lot of improvements as of 10.0 release and will continue to deliver improvements in this category. Secondly, fully customized subcluster settings. That means any subcluster can be set up and tuned for very different workloads without affecting other subclusters. Users should be able to tune up, tune down, certain parameters based on the actual needs of the individual subcluster workload requirements. As of today, Vertica already supports few settings that can be done at the subcluster level for example, the depot pinning policy and then we will continue extending more that is like resource pools (mumbles) in the near future. Lastly, Vertica subclusters should be easy to operate and cost efficient. What it means is that the subcluster should be able to turn on, turn off, add or remove or should be available for use according to rapid changing workloads. Let's say in this case, you want to spin up more dashboard subclusters because we need higher scores report, we can do that. You might need to run several report subclusters because you might want to run multiple reports at the same time. While on the other hand, you can shut down your analytic machine learning subcluster because no data scientists need to use it at this moment. So we made automate a lot of change, the improvements in this category, which I'll explain in detail later and one of the ultimate goal is to support auto scaling To sum up, what we really want to deliver for subcluster is very simple. You just need to remember that accessing subclusters should be just like accessing individual clusters. Well, these subclusters do share the same catalog. So you don't have to work out the stale data and don't need to worry about data synchronization. That'd be a nice goal, Vertica upcoming 10.0 release is certainly a milestone towards that goal, which will deliver a large part of the capability in this direction and then we will continue to improve it after 10.0 release. In the next couple of slides, I will highlight some issues about workload isolation in the initial Eon release and show you how we resolve these issues. First issue when we initially released our first or so called subcluster mode, it was implemented using fault groups. Well, fault groups and the subcluster have something in common. Yes, they are both defined as a set of nodes. However, they are very different in all the other ways. So, that was very confusing in the first place, when we implement this. As of 9.3.0 version, we decided to detach subcluster definition from the fault groups, which enabled us to further extend the capability of subclusters. Fault groups in the pre 9.3.0 versions will be converted into subclusters during the upgrade and this was a very important step that enabled us to provide all the amazing, following improvements on subclusters. The second issue in the past was that it's hard to control the execution groups for different types of workloads. There are two types of problems here and I will use some example to explain. The first issue is about control group size. There you allocate six nodes for your dashboard subcluster and what you really want is on the left, the three pairs of nodes as three execution groups, and each pair of nodes will need to subscribe to all the four shards. However, that's not really what you get. What you really get is there on the right side that the first four nodes subscribed to one shard each and the rest two nodes subscribed to two dangling shards. So you won't really get three execusion groups but instead only get one and two extra nodes have no value at all. The solution is to use subclusters. So instead of having a subcluster with six nodes, you can split it up into three smaller ones. Each subcluster will guarantee to subscribe to all the shards and you can further handle this three subcluster using load balancer across them. In this way you achieve the three real exclusion groups. The second issue is that the session participation is non-deterministic. Any session will just pick four random nodes from the subcluster as long as this covers one shard each. In other words, you don't really know which set of nodes will make up your execution group. What's the problem? So in this case, the fourth node will be doubled booked by two concurrent sessions. And you can imagine that the resource usage will be imbalanced and both queries performance will suffer. What is even worse is that these queries of the two concurrent sessions target different table They will cause the issue, that depot efficiency will be reduced, because both session will try to fetch the files on to two tables into the same depot and if your depot is not large enough, they will evict each other, which will be very bad. To solve this the same way, you can solve this by declaring subclusters, in this case, two subclusters and a load balancer group across them. The reason it solved the problem is because the session participation would not go across the boundary. So there won't be a case that any node is double booked and in terms of the depot and if you use the subcluster and avoid using a load balancer group, and carefully send the first workload to the first subcluster and the second to the second subcluster and then the result is that depot isolation is achieved. The first subcluster will maintain the data files for the first query and you don't need to worry about the file being evicted by the second kind of session. Here comes the next issue, it's the scaling down. In the old way of defining subclusters, you may have several execution groups in the subcluster. You want to shut it down, one or two execution groups to save cost. Well, here comes the pain, because you don't know which nodes may be used by which session at any point, it is hard to find the right timing to hit the shutdown button of any of the instances. And if you do and get unlucky, say in this case, you pull the first four nodes, one of the session will fail because it's participating in the node two and node four at that point. User of that session will notice because their query fails and we know that for many business this is critical problem and not acceptable. Again, with subclusters this problem is resolved. Same reason, session cannot go across the subcluster boundary. So all you need to do is just first prevent query sent to the first subcluster and then you can shut down the instances in that subcluster. You are guaranteed to not break any running sessions. Now, you're happy and you want to shut down more subclusters then you hit the issue four, the whole cluster will go down, why? Because the cluster loses quorum. As a distributed system, you need to have at least more than half of a node to be up in order to commit and keep the cluster up. This is to prevent the catalog diversion from happening, which is important. But do you still want to shut down those nodes? Because what's the point of keeping those nodes up and if you are not using them and let them cost you money right. So Vertica has a solution, you can define a subcluster as secondary to allow them to shut down without worrying about quorum. In this case, you can define the first three subclusters as secondary and the fourth one as primary. By doing so, this secondary subclusters will not be counted towards the quorum because we changed the rule. Now instead of requiring more than half of node to be up, it only require more than half of the primary node to be up. Now you can shut down your second subcluster and even shut down your third subcluster as well and keep the remaining primary subcluster to be still running healthily. There are actually more benefits by defining secondary subcluster in addition to the quorum concern, because the secondary subclusters no longer have the voting power, they don't need to persist catalog anymore. This means those nodes are faster to deploy, and can be dropped and re-added. Without the worry about the catalog persistency. For the most the subcluster that only need to read only query, it's the best practice to define them as secondary. The commit will be faster on this secondary subcluster as well, so running this query on the secondary subcluster will have less spikes. Primary subcluster as usual handle everything is responsible for consistency, the background tasks will be running. So DBAs should make sure that the primary subcluster is stable and assume is running all the time. Of course, you need to at least one primary subcluster in your database. Now with the secondary subcluster, user can start and stop as they need, which is very convenient and this further brings up another issue is that if there's an ETL transaction running and in the middle, a subcluster starting and it become up. In older versions, there is no catalog resync mechanism to keep the new subcluster up to date. So Vertica rolls back to ETL session to keep the data consistency. This is actually quite disruptive because real world ETL workloads can sometimes take hours and rolling back at the end means, a large waste of resources. We resolved this issue in 9.3.1 version by introducing a catalog resync mechanism when such situation happens. ETL transactions will not roll back anymore, but instead will take some time to resync the catalog and commit and the problem is resolved. And last issue I would like to talk about is the subscription. Especially for large subcluster when you start it, the startup time is quite long, because the subscription commit used to be serialized. In one of the in our internal testing with large catalogs committing a subscription, you can imagine it takes five minutes. Secondary subcluster is better, because it doesn't need to persist the catalog during the commit but still take about two seconds to commit. So what's the problem here? Let's do the math and look at this chart. The X axis is the time in the minutes and the Y axis is the number of nodes to be subscribed. The dark blues represents your primary subcluster and light blue represents the secondary subcluster. Let's say the subcluster have 16 nodes in total and if you start a secondary subcluster, it will spend about 30 seconds in total, because the 2 seconds times 16 is 32. It's not actually that long time. but if you imagine that starting secondary subcluster, you expect it to be super fast to react to the fast changing workload and 30 seconds is no longer trivial anymore and what is even worse is on the primary subcluster side. Because the commit is much longer than five minutes let's assume, then at the point, you are committing to six nodes subscription all other nodes already waited for 30 minutes for GCLX or we know the global catalog lock, and the Vertica will crash the nodes, if any node cannot get the GCLX for 30 minutes. So the end result is that your whole database crashed. That's a serious problem and we know that and that's why we are already planning for the fix, for the 10.0, so that all the subscription will be batched up and all the nodes will commit at the same time concurrently. And by doing that, you can imagine the primary subcluster can finish commiting in five minutes instead of crashing and the secondary subcluster can be finished even in seconds. That summarizes the highlights for the improvements we have done as of 10.0, and I hope you already get excited about Emerging Eon Deployment Pattern that's shown here. A primary subcluster that handles data loading, ETL jobs and tuple mover jobs is the backbone of the database and you keep it running all the time. At the same time defining different secondary subcluster for different workloads and provision them when the workload requirement arrives and then de-provision them when the workload is done to save the operational cost. So can't wait to play with the subcluster. Here as are some Admin Tools command you can start using. And for more details, check out our Eon subcluster documentation for more details. And thanks everyone for listening and I'll head back to Dave to talk about the Eon on-prem. >> David: Thanks Yuanzhe. At the same time that Yuanzhe and the rest of the dev team were working on the improvements that Yuanzhe described in and other improvements. This guy, John Yovanovich, stood on stage and told us about his deployment at at&t where he was running Eon Mode on-prem. Now this was only six months after we had launched Eon Mode on AWS. So when he told us that he was putting it into production on-prem, we nearly fell out of our chairs. How is this possible? We took a look back at Eon and determined that the workload isolation and the improvement to the operations for restoring nodes and other things had sufficient value that John wanted to run it on-prem. And he was running it on the Pure Storage FlashBlade. Taking a second look at the FlashBlade we thought alright well, does it have the performance? Yes, it does. The FlashBlade is a collection of individual blades, each one of them with NVMe storage on it, which is not only performance but it's scalable and so, we then asked is it durable? The answer is yes. The data safety is implemented with the N+2 redundancy which means that up to two blades can fail and the data remains available. And so with this we realized DBAs can sleep well at night, knowing that their data is safe, after all Eon Mode outsources the durability to the communal storage data store. Does FlashBlade have the capacity for growth? Well, yes it does. You can start as low as 120 terabytes and grow as high as about eight petabytes. So it certainly covers the range for most enterprise usages. And operationally, it couldn't be easier to use. When you want to grow your database. You can simply pop new blades into the FlashBlade unit, and you can do that hot. If one goes bad, you can pull it out and replace it hot. So you don't have to take your data store down and therefore you don't have to take Vertica down. Knowing all of these things we got behind Pure Storage and partnered with them to implement the first version of Eon on-premise. That changed our roadmap a little bit. We were imagining it would start with Amazon and then go to Google and then to Azure and at some point to Alibaba cloud, but as you can see from the left column, we started with Amazon and went to Pure Storage. And then from Pure Storage, we went to Minio and we launched Eon Mode on Minio at the end of last year. Minio is a little bit different than Pure Storage. It's software only, so you can run it on pretty much any x86 servers and you can cluster them with storage to serve up an S3 bucket. It's a great solution for up to about 120 terabytes Beyond that, we're not sure about performance implications cause we haven't tested it but for your dev environments or small production environments, we think it's great. With Vertica 10, we're introducing Eon Mode on Google Cloud. This means not only running Eon Mode in the cloud, but also being able to launch it from the marketplace. We're also offering Eon Mode on HDFS with version 10. If you have a Hadoop environment, and you want to breathe new fresh life into it with the high performance of Vertica, you can do that starting with version 10. Looking forward we'll be moving Eon mode to Microsoft Azure. We expect to have something breathing in the fall and offering it to select customers for beta testing and then we expect to release it sometime in 2021 Following that, further on horizon is Alibaba cloud. Now, to be clear we will be putting, Vertica in Enterprise Mode on Alibaba cloud in 2020 but Eon Mode is going to trail behind whether it lands in 2021 or not, we're not quite sure at this point. Our goal is to deliver Eon Mode anywhere you want to run it, on-prem or in the cloud, or both because that is one of the great value propositions of Vertica is the hybrid capability, the ability to run in both your on prem environment and in the cloud. What's next, I've got three priority and roadmap slides. This is the first of the three. We're going to start with improvements to the core of Vertica. Starting with query crunching, which allows you to run long running queries faster by getting nodes to collaborate, you'll see that coming very soon. We'll be making improvements to large clusters and specifically large cluster mode. The management of large clusters over 60 nodes can be tedious. We intend to improve that. In part, by creating a third network channel to offload some of the communication that we're now loading onto our spread or agreement protocol. We'll be improving depot efficiency. We'll be pushing down more controls to the subcluster level, allowing you to control your resource pools at the subcluster level and we'll be pairing tuple moving with data loading. From an operational flexibility perspective, we want to make it very easy to shut down and revive primaries and secondaries on-prem and in the cloud. Right now, it's a little bit tedious, very doable. We want to make it as easy as a walk in the park. We also want to allow you to be able to revive into a different size subcluster and last but not least, in fact, probably the most important, the ability to change shard count. This has been a sticking point for a lot of people and it puts a lot of pressure on the early decision of how many shards should my database be? Whether it's in 2020 or 2021. We know it's important to you so it's important to us. Ease of use is also important to us and we're making big investments in the management console, to improve managing subclusters, as well as to help you manage your load balancer groups. We also intend to grow and extend Eon Mode to new environments. Now we'll take questions and answers

Published Date : Mar 30 2020

SUMMARY :

and our engineering team is planning to join the forums and going to the third tick, you can see that and the second to the second subcluster and the improvement to the

ENTITIES

Entity	Category	Confidence
David Sprogis	PERSON	0.99+
David	PERSON	0.99+
one	QUANTITY	0.99+
Dave	PERSON	0.99+
John Yovanovich	PERSON	0.99+
10 users	QUANTITY	0.99+
Paige Roberts	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Yuanzhe Bei	PERSON	0.99+
John	PERSON	0.99+
five minutes	QUANTITY	0.99+
2020	DATE	0.99+
Amazon	ORGANIZATION	0.99+
30 seconds	QUANTITY	0.99+
50	QUANTITY	0.99+
second issue	QUANTITY	0.99+
12	QUANTITY	0.99+
Yuanzhe	PERSON	0.99+
120 terabytes	QUANTITY	0.99+
30 users	QUANTITY	0.99+
two types	QUANTITY	0.99+
2021	DATE	0.99+
Paige	PERSON	0.99+
30 minutes	QUANTITY	0.99+
three pairs	QUANTITY	0.99+
second	QUANTITY	0.99+
first	QUANTITY	0.99+
nine nodes	QUANTITY	0.99+
first subcluster	QUANTITY	0.99+
two tables	QUANTITY	0.99+
two nodes	QUANTITY	0.99+
first issue	QUANTITY	0.99+
each copy	QUANTITY	0.99+
2 seconds	QUANTITY	0.99+
36 hours	QUANTITY	0.99+
second subcluster	QUANTITY	0.99+
fourth node	QUANTITY	0.99+
each	QUANTITY	0.99+
six nodes	QUANTITY	0.99+
third subcluster	QUANTITY	0.99+
both	QUANTITY	0.99+
twice	QUANTITY	0.99+
First issue	QUANTITY	0.99+
three segments	QUANTITY	0.99+
today	DATE	0.99+
three bars	QUANTITY	0.99+
24	QUANTITY	0.99+
5X	QUANTITY	0.99+
Today	DATE	0.99+
16 nodes	QUANTITY	0.99+
Alibaba	ORGANIZATION	0.99+
each segment	QUANTITY	0.99+
first node	QUANTITY	0.99+
three slices	QUANTITY	0.99+
Each subcluster	QUANTITY	0.99+
each nodes	QUANTITY	0.99+
three nodes	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
two subclusters	QUANTITY	0.98+
three servers	QUANTITY	0.98+
four shards	QUANTITY	0.98+
3X	QUANTITY	0.98+
three	QUANTITY	0.98+
two concurrent sessions	QUANTITY	0.98+

Dan Woicke, Cerner Corporation | Virtual Vertica BDC 2020

(gentle electronic music) >> Hello, everybody, welcome back to the Virtual Vertica Big Data Conference. My name is Dave Vellante and you're watching theCUBE, the leader in digital coverage. This is the Virtual BDC, as I said, theCUBE has covered every Big Data Conference from the inception, and we're pleased to be a part of this, even though it's challenging times. I'm here with Dan Woicke, the senior director of CernerWorks Engineering. Dan, good to see ya, how are things where you are in the middle of the country? >> Good morning, challenging times, as usual. We're trying to adapt to having the kids at home, out of school, trying to figure out how they're supposed to get on their laptop and do virtual learning. We all have to adapt to it and figure out how to get by. >> Well, it sure would've been my pleasure to meet you face to face in Boston at the Encore Casino, hopefully next year we'll be able to make that happen. But let's talk about Cerner and CernerWorks Engineering, what is that all about? >> So, CernerWorks Engineering, we used to be part of what's called IP, or Intellectual Property, which is basically the organization at Cerner that does all of our software development. But what we did was we made a decision about five years ago to organize my team with CernerWorks which is the hosting side of Cerner. So, about 80% of our clients choose to have their domains hosted within one of the two Kansas City data centers. We have one in Lee's Summit, in south Kansas City, and then we have one on our main campus that's a brand new one in downtown, north Kansas City. About 80, so we have about 27,000 environments that we manage in the Kansas City data centers. So, what my team does is we develop software in order to make it easier for us to monitor, manage, and keep those clients healthy within our data centers. >> Got it. I mean, I think of Cerner as a real advanced health tech company. It's the combination of healthcare and technology, the collision of those two. But maybe describe a little bit more about Cerner's business. >> So we have, like I said, 27,000 facilities across the world. Growing each day, thank goodness. And, our goal is to ensure that we reduce errors and we digitize the entire medical records for all of our clients. And we do that by having a consulting practice, we do that by having engineering, and then we do that with my team, which manages those particular clients. And that's how we got introduced to the Vertica side as well, when we introduced them about seven years ago. We were actually able to take a tremendous leap forward in how we manage our clients. And I'd be more than happy to talk deeper about how we do that. >> Yeah, and as we get into it, I want to understand, healthcare is all about outcomes, about patient outcomes and you work back from there. IT, for years, has obviously been a contributor but removed, and somewhat indirect from those outcomes. But, in this day and age, especially in an organization like yours, it really starts with the outcomes. I wonder if you could ratify that and talk about what that means for Cerner. >> Sorry, are you talking about medical outcomes? >> Yeah, outcomes of your business. >> So, there's two different sides to Cerner, right? There's the medical side, the clinical side, which is obviously our main practice, and then there's the side that I manage, which is more of the operational side. Both are very important, but they go hand in hand together. On the operational side, the goal is to ensure that our clinicians are on the system, and they don't know they're on the system, right? Things are progressing, doctors don't want to be on the system, trust me. My job is to ensure they're having the most seamless experience possible while they're on the EMR and have it just be one of their side jobs as opposed to taking their attention away from the patients. That make sense? >> Yeah it does, I mean, EMR and meaningful use, around the Affordable Care Act, really dramatically changed the unit. I mean, people had to demonstrate in order to get paid, and so that became sort of an unfunded mandate for folks and you really had to respond to that, didn't you? >> We did, we did that about three to four years ago. And we had to help our clients get through what's called meaningful use, there was different stages of meaningful use. And what we did, is we have the website called the Lights On Network which is free to all of our clients. Once you get onto the website the Lights On Network, you can actually show how you're measured and whether or not you're actually completing the different necessary tasks in order to get those payments for meaningful use. And it also allows you to see what your performance is on your domain, how the clinicians are doing on the system, how many hours they're spending on the system, how many orders they're executing. All of that is completely free and visible to our clients on the Lights On Network. And that's actually backed by some of the Vertica software that we've invested in. >> Yeah, so before we get into that, it sounds like your mission, really, is just great user experiences for the people that are on the network. Full stop. >> We do. So, one of the things that we invented about 10 years ago is called RTMS Timers. They're called Response Time Measurement System. And it started off as a way of us proving that clients are actually using the system, and now it's turned into more of a user outcomes. What we do is we collect 2.5 billion timers per day across all of our clients across the world. And every single one of those records goes to the Vertica platform. And then we've also developed a system on that which allows us in real time to go and see whether or not they're deviating from their normal. So we do baselines every hour of the week and then if they're deviating from those baselines, we can immediately call a service center and have them engage the client before they call in. >> So, Dan, I wonder if you could paint a picture. By the way, that's awesome. I wonder if you could paint a picture of your analytics environment. What does it look like? Maybe give us a sense of the scale. >> Okay. So, I've been describing how we operate, our remote hosted clients in the two Kansas City data centers, but all the software that we write, we also help our client hosted agents as well. Not only do we take care of what's going on at the Kansas City data center, but we do write software to ensure that all of clients are treated the same and we provide the same level of care and performance management across all those clients. So what we do is we have 90,000 agents that we have split across all these clients across the world. And every single hour, we're committing a billion rows to Vertica of operational data. So I talked a little bit about the RTMS timers, but we do things just like everyone else does for CPU, memory, Java Heap Stack. We can tell you how many concurrent users are on the system, I can tell you if there's an application that goes down unexpected, like a crash. I can tell you the response time from the network as most of us use Citrix at Cerner. And so what we do is we measure the amount of time it takes from the client side to PCs, it's sitting in the virtual data centers, sorry, in the hospitals, and then round trip to the Citrix servers that are sitting in the Kansas City data center. That's called the RTT, our round trip transactions. And what we've done is, over the last couple of years, what we've done is we've switched from just summarizing CPU and memory and all that high-level stuff, in order to go down to a user level. So, what are you doing, Dr. Smith, today? How many hours are you using the EMR? Have you experienced any slowness? Have you experienced any hourglass holding within your application? Have you experienced, unfortunately, maybe a crash? Have you experienced any slowness compared to your normal use case? And that's the step we've taken over the last few years, to go from summarization of high-level CPU memory, over to outcome metrics, which are what is really happening with a particular user. >> So, really granular views of how the system is being used and deep analytics on that. I wonder, go ahead, please. >> And, we weren't able to do that by summarizing things in traditional databases. You have to actually have the individual rows and you can't summarize information, you have to have individual metrics that point to exactly what's going on with a particular clinician. >> So, okay, the MPP architecture, the columnar store, the scalability of Vertica, that's what's key. That was my next question, let me take us back to the days of traditional RDBMS and then you brought in Vertica. Maybe you could give us a sense as to why, what that did for you, the before and after. >> Right. So, I'd been painting a picture going forward here about how traditionally, eight years ago, all we could do was summarize information. If CPU was going to go and jump up 8%, I could alarm the data center and say, hey, listen, CPU looks like it's higher, maybe an application's hanging more than it has been in the past. Things are a little slower, but I wouldn't be able to tell you who's affected. And that's where the whole thing has changed, when we brought Vertica in six years ago is that, we're able to take those 90,000 agents and commit a billion rows per hour operational data, and I can tell you exactly what's going on with each of our clinicians. Because you know, it's important for an entire domain to be healthy. But what about the 10 doctors that are experiencing frustration right now? If you're going to summarize that information and roll it up, you'll never know what those 10 doctors are experiencing and then guess what happens? They call the data center and complain, right? The squeaky wheels? We don't want that, we want to be able to show exactly who's experiencing a bad performance right now and be able to reach out to them before they call the help desk. >> So you're able to be proactive there, so you've gone from, Houston, we have a problem, we really can't tell you what it is, go figure it out, to, we see that there's an issue with these docs, or these users, and go figure that out and focus narrowly on where the problem is as opposed to trying to whack-a-mole. >> Exactly. And the other big thing that we've been able to do is corelation. So, we operate two gigantic data centers. And there's things that are shared, switches, network, shared storage, those things are shared. So if there is an issue that goes on with one of those pieces of equipment, it could affect multiple clients. Now that we have every row in Vertica, we have a new program in place called performance abnormality flags. And what we're able to do is provide a website in real time that goes through the entire stack from Citrix to network to database to back-end tier, all the way to the end-user desktop. And so if something was going to be related because we have a network switch going out of the data center or something's backing up slow, you can actually see which clients are on that switch, and, what we did five years ago before this, is we would deploy out five different teams to troubleshoot, right? Because five clients would call in, and they would all have the same problem. So, here you are having to spare teams trying to investigate why the same problem is happening. And now that we have all of the data within Vertica, we're able to show that in a real time fashion, through a very transparent dashboard. >> And so operational metrics throughout the stack, right? A game changer. >> It's very compact, right? I just label five different things, the stack from your end-user device all the way through the back-end to your database and all the way back. All that has to work properly, right? Including the network. >> How big is this, what are we talking about? However you measure it, terabytes, clusters. What can you share there? >> Sorry, you mean, the amount of data that we process within our data centers? >> Give us a fun fact. >> Absolute petabytes, yeah, for sure. And in Vertica right now we have two petabytes of data, and I purge it out every year, one year's worth of data within two different clusters. So we have to two different data centers I've been describing, what we've done is we've set Vertica up to be in both data centers, to be highly redundant, and then one of those is configured to do real-time analysis and corelation research, and then the other one is to provide service towards what I described earlier as our Lights On Network, so it's a very dedicated hardened cluster in one of our data centers to allow the Lights On Network to provide the transparency directly to our clients. So we want that one to be pristine, fast, and nobody touch it. As opposed to the other one, where, people are doing real-time, ad hoc queries, which sometimes aren't the best thing in the world. No matter what kind of database or how fast it is, people do bad things in databases and we just don't want that to affect what we show our clients in a transparent fashion. >> Yeah, I mean, for our audience, Vertica has always been aimed at these big, hairy, analytic problems, it's not for a tiny little data mart in a department, it's really the big scale problems. I wonder if I could ask you, so you guys, obviously, healthcare, with HIPAA and privacy, are you doing anything in the cloud, or is it all on-prem today? >> So, in the operational space that I manage, it's all on-premises, and that is changing. As I was describing earlier, we have an initiative to go to AWS and provide levels of service to countries like Sweden which does not want any operational data to leave that country's walls, whether it be operational data or whether it be PHI. And so, we have to be able to adapt into Vertia Eon Mode in order to provide the same services within Sweden. So obviously, Cerner's not going to go up and build a data center in every single country that requires us, so we're going to leverage our partnership with AWS to make this happen. >> Okay, so, I was going to ask you, so you're not running Eon Mode today, it's something that you're obviously interested in. AWS will allow you to keep the data locally in that region. In talking to a lot of practitioners, they're intrigued by this notion of being able to scale independently, storage from compute. They've said they wished that's a much more efficient way, I don't have to buy in chunks, if I'm out of storage, I don't have to buy compute, and vice-versa. So, maybe you could share with us what you're thinking, I know it's early days, but what's the logic behind the business case there? >> I think you're 100% correct in your assessment of taking compute away from storage. And, we do exactly what you say, we buy a server. And it has so much compute on it, and so much storage. And obviously, it's not scaled properly, right? Either storage runs out first or compute runs out first, but you're still paying big bucks for the entire server itself. So that's exactly why we're doing the POC right now for Eon Mode. And I sit on Vertica's TAB, the advisory board, and they've been doing a really good job of taking our requirements and listening to us, as to what we need. And that was probably number one or two on everybody's lists, was to separate storage from compute. And that's exactly what we're trying to do right now. >> Yeah, it's interesting, I've talked to some other customers that are on the customer advisory board. And Vertica is one of these companies that're pretty transparent about what goes on there. And I think that for the early adopters of Eon Mode there were some challenges with getting data into the new system, I know Vertica has been working on that very hard but you guys push Vertica pretty hard and from what I can tell, they listen. Your thoughts. >> They do listen, they do a great job. And even though the Big Data Conference is canceled, they're committed to having us go virtually to the CAD meeting on Monday, so I'm looking forward to that. They do listen to our requirements and they've been very very responsive. >> Nice. So, I wonder if you could give us some final thoughts as to where you want to take this thing. If you look down the road a year or two, what does success look like, Dan? >> That's a good question. Success means that we're a little bit more nimble as far as the different regions across the world that we can provide our services to. I want to do more corelation. I want to gather more information about what users are actually experiencing. I want to be able to have our phone never ring in our data center, I know that's a grand thought there. But I want to be able to look forward to measuring the data internally and reaching out to our clients when they have issues and then doing the proper corelation so that I can understand how things are intertwining if multiple clients are having an issue. That's the goal going forward. >> Well, in these trying times, during this crisis, it's critical that your operations are running smoothly. The last thing that organizations need right now, especially in healthcare, is disruption. So thank you for all the hard work that you and your teams are doing. I wish you and your family all the best. Stay safe, stay healthy, and thanks so much for coming on theCUBE. >> I really appreciate it, thanks for the opportunity. >> You're very welcome, and thank you, everybody, for watching, keep it right there, we'll be back with our next guest. This is Dave Vellante for theCUBE. Covering Virtual Vertica Big Data Conference. We'll be right back. (upbeat electronic music)

Published Date : Mar 31 2020

SUMMARY :

in the middle of the country? and figure out how to get by. been my pleasure to meet you and then we have one on our main campus and technology, the and then we do that with my team, Yeah, and as we get into it, the goal is to ensure that our clinicians in order to get paid, and so that became in order to get those for the people that are on the network. So, one of the things that we invented I wonder if you could paint a picture from the client side to PCs, of how the system is being used that point to exactly what's going on and then you brought in Vertica. and be able to reach out to them we really can't tell you what it is, And now that we have all And so operational metrics and all the way back. are we talking about? And in Vertica right now we in the cloud, or is it all on-prem today? So, in the operational I don't have to buy in chunks, and listening to us, as to what we need. that are on the customer advisory board. so I'm looking forward to that. as to where you want to take this thing. and reaching out to our that you and your teams are doing. thanks for the opportunity. and thank you, everybody,

ENTITIES

Entity	Category	Confidence
Dan Woicke	PERSON	0.99+
Dave Vellante	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Cerner	ORGANIZATION	0.99+
Affordable Care Act	TITLE	0.99+
Boston	LOCATION	0.99+
100%	QUANTITY	0.99+
Dan	PERSON	0.99+
10 doctors	QUANTITY	0.99+
Sweden	LOCATION	0.99+
90,000 agents	QUANTITY	0.99+
five clients	QUANTITY	0.99+
CernerWorks	ORGANIZATION	0.99+
8%	QUANTITY	0.99+
two	QUANTITY	0.99+
Kansas City	LOCATION	0.99+
Smith	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Cerner Corporation	ORGANIZATION	0.99+
next year	DATE	0.99+
Monday	DATE	0.99+
Both	QUANTITY	0.99+
today	DATE	0.99+
one year	QUANTITY	0.99+
a year	QUANTITY	0.99+
27,000 facilities	QUANTITY	0.99+
Houston	LOCATION	0.99+
one	QUANTITY	0.99+
two petabytes	QUANTITY	0.99+
five years ago	DATE	0.99+
CernerWorks Engineering	ORGANIZATION	0.98+
south Kansas City	LOCATION	0.98+
eight years ago	DATE	0.98+
about 80%	QUANTITY	0.98+
Virtual Vertica Big Data Conference	EVENT	0.98+
Citrix	ORGANIZATION	0.98+
two different data centers	QUANTITY	0.97+
each day	QUANTITY	0.97+
four years ago	DATE	0.97+
two different clusters	QUANTITY	0.97+
six years ago	DATE	0.97+
each	QUANTITY	0.97+
north Kansas City	LOCATION	0.97+
HIPAA	TITLE	0.97+
five different teams	QUANTITY	0.97+
first	QUANTITY	0.96+
five different things	QUANTITY	0.95+
two different sides	QUANTITY	0.95+
about 27,000 environments	QUANTITY	0.95+
both data centers	QUANTITY	0.95+
About 80	QUANTITY	0.95+
Response Time Measurement System	OTHER	0.95+
two gigantic data centers	QUANTITY	0.93+
Java Heap	TITLE	0.92+

UNLIST TILL 4/2 - A Deep Dive into the Vertica Management Console Enhancements and Roadmap

>> Jeff: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "A Deep Dive "into the Vertica Mangement Console Enhancements and Roadmap." I'm Jeff Healey of Vertica Marketing. I'll be your host for this breakout session. Joining me are Bhavik Gandhi and Natalia Stavisky from Vertica engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively visit Vertica Forums at forum.vertica.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going well after the event. Also, a reminder that you can maximize the screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to you on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Bhavik. >> Bhavik: All right. So hello, and welcome, everybody doing this presentation of "Deep Dive into the Vertica Management Console Enhancements and Roadmap." Myself, Bhavik, and my team member, Natalia Stavisky, will go over a few useful announcements on Vertica Management Console, discussing a few real scenarios. All right. So today we will go forward with the brief introduction about the Management Console, then we will discuss the benefits of using Management Console by going over a couple of user scenarios for the query taking too long to run and receiving email alerts from Management Console. Then we will go over a few MC features for what we call Eon Mode databases, like provisioning and reviving the Eon Mode databases from MC, managing the subcluster and understanding the Depot. Then we will go over some of the future announcements on MC that we are planning. All right, so let's get started. All right. So, do you want to know about how to provision a new Vertica cluster from MC? How to analyze and understand a database workload by monitoring the queries on the database? How do you balance the resource pools and use alerts and thresholds on MC? So, the Management Console is basically our answer and we'll talk about its capabilities and new announcements in this presentation. So just to give a brief overview of the Management Console, who uses Management Console, it's generally used by IT administrators and DB admins. Management Console can be used to monitor both Eon Mode and Enterprise Mode databases. Why to use Management Console? You can use Management Console for provisioning Vertica databases and cluster. You can manage the already existing Vertica databases and cluster you have, and you can use various tools on Management Console like query execution, Database Designer, Workload Analyzer, and set up alerts and thresholds to get notified by some of your activities on the MC. So let's go over a few benefits of using Management Console. Okay. So using Management Console, you can view and optimize resource pool usage. Management Console helps you to identify some critical conditions on your Vertica cluster. Additionally, you can set up various thresholds thresholds in MC and get other data if those thresholds are triggered on the database. So now let's dig into the couple of scenarios. So for the first scenario, we will discuss about queries taking too long and using workload analyzer to possibly help to solve the problem. In the second scenario, we will go over alert email that you received from your Management Console and analyzing the problem and taking required actions to solve the problem. So let's go over the scenario where queries are taking too long to run. So in this example, we have this one query that we are running using the query execution on MC. And for some reason we notice that it's taking about 14.8 seconds seconds to execute this query, which is higher than the expected run time of the query. The query that we are running happens to be the query used by MC during the extended monitoring. Notice that the table name and the schema name which is ds_requests_issued, and, is the schema used for extended monitoring. Now in 10.0 MC we have redesigned the Workload Analyzer and Recommendations feature to show the recommendations and allow you to execute those recommendations. In our example, we have taken the table name and figured the tuning descriptions to see if there are any tuning recommendations related to this table. As we see over here, there are three tuning recommendations available for that table. So now in 10.0 MC, you can select those recommendations and then run them. So let's run the recommendations. All right. So once recommendations are run successfully, you can go and see all the processed recommendations that you have run previously. Over here we see that there are three recommendations that we had selected earlier have successfully processed. Now we take the same query and run it on the query execution on MC and hey, it's running really faster and we see that it takes only 0.3 seconds to run the query and, which is about like 98% decrease in original runtime of the query. So in this example we saw that using a Workload Analyzer tool on MC you can possibly triage and solve issue for your queries which are taking to long to execute. All right. So now let's go over another user scenario where DB admin's received some alert email messages from MC and would like to understand and analyze the problem. So to know more about what's going on on the database and proactively react to the problems, DB admins using the Management Console can create set of thresholds and get alerted about the conditions on the database if the threshold values is reached and then respond to the problem thereafter. Now as a DB admin, I see some email message notifications from MC and upon checking the emails, I see that there are a couple of email alerts received from MC on my email. So one of the messages that I received was for Query Resource Rejections greater than 5, pool, midpool7. And then around the same time, I received another email from the MC for the Failed Queries greater than 5, and in this case I see there are 80 failed queries. So now let's go on the MC and investigate the problem. So before going into the deep investigation about failures, let's review the threshold settings on MC. So as we see, we have set up the thresholds under the database settings page for failed queries in the last 10 minutes greater than 5 and MC should send an email to the individual if the threshold is triggered. And also we have a threshold set up for queries and resource rejections in the last five minutes for midpool7 set to greater than 5. There are various other thresholds on this page that you can set if you desire to. Now let's go and triage those email alerts about the failed queries and resource rejections that we had received. To analyze the failed queries, let's take a look at the query statistics page on the database Overview page on MC. Let's take a look at the Resource Pools graph and especially for the failed queries for each resource pools. And over to the right under the failed query section, I see about like, in the last 24 hours, there are about 6,000 failed queries for midpool7. And now I switch to view to see the statistics for each user and on this page I see for User MaryLee on the right hand side there are a high number of failed queries in last 24 hours. And to know more about the failed queries for this user, I can click on the graph for this user and get the reasons behind it. So let's click on the graph and see what's going on. And so clicking on this graph, it takes me to the failed queries view on the Query Monitoring page for database, on Database activities tab. And over here, I see there are a high number of failed queries for this user, MaryLee, with the reasons stated as, exceeding high limit. To drill down more and to know more reasons behind it, I can click on the plus icon on the left hand side for each failed queries to get the failure reason for each node on the database. So let's do that. And clicking the plus icon, I see for the two nodes that are listed, over here it says there are insufficient resources like memory and file handles for midpool7. Now let's go and analyze the midpool7 configurations and activities on it. So to do so, I will go over to the Resource Pool Monitoring view and select midpool7. I see the resource allocations for this resource pool is very low. For example, the max memory is just 1MB and the max concurrency is set to 0. Hmm, that's very odd configuration for this resource pool. Also in the bottom right graph for the resource rejections for midpool7, the graph shows very high values for resource rejection. All right. So since we saw some odd configurations and odd resource allocations for midpool7, I would like to see when this resource, when the settings were changed on the resource pools. So to do this, I can preview the audit logs on, are available on the Management Console. So I can go onto the Vertica Audit Logs and see the logs for the resource pool. So I just (mumbles) for the logs and figuring the logs for midpool7. I see on February 17th, the memory and other attributes for midpool7 were modified. So now let's analyze the resource activity for midpool7 around the time when the configurations were changed. So in our case we are using extended monitoring on MC for this database, so we can go back in time and see the statistics over the larger time range for midpool7. So viewing the activities for midpool7 around February 17th, around the time when these configurations were changed, we see a decrease in resource pool usage. Also, on the bottom right, we see the resource rejections for this midpool7 have an increase, linear increase, after the configurations were changed. I can select a point on the graph to get the more details about the resource rejections. Now to analyze the effects of the modifications on midpool7. Let's go over to the Query Monitoring page. All right, I will adjust the time range around the time when the configurations were changed for midpool7 and completed activities queries for user MaryLee. And I see there are no completed queries for this user. Now I'm taking a look at the Failed Queries tab and adjusting the time range around the time when the configurations were changed. I can do so because we are using extended monitoring. So again, adjusting the time, I can see there are high number of failed queries for this user. There about about like 10,000 failed queries for this user after the configurations were changed on this resource pool. So now let's go and modify the settings since we know after the configurations were changed, this user was not able to run the queries. So you can change the resource pool settings of using Management Console's database settings page and under the Resource Pools tab. So selecting the midpool7, I see the same odd configurations for this resource pool that we saw earlier. So now let's go and modify it, the settings. So I will increase the max memory and modify the settings for midpool7 so that it has adequate resources to run the queries for the user. Hit apply on the right hand top to see the settings. Now let's do the validation after we change the resource pool attributes. So let's go over to the same query monitoring page and see if MaryLee user is able to run the queries for midpool7. We see that now, after the configuration, after the change, after we changed the configuration for midpool7, the user can run the queries successfully and the count for Completed Queries has increased after we modified the settings for this midpool7 resource pool. And also viewing the resource pool monitoring page, we can validate that after the new configurations for midpool7 has been applied and also the resource pool usage after the configuration change has increased. And also on the bottom right graph, we can see that the resource rejections for midpool7 has decreased over the time after we modified the settings. And since we are using extended monitoring for this database, I can see that the trend in data for these resource pools, the before and after effects of modifying the settings. So initially when the settings were changed, there were high resource rejections and after we again modified the settings, the resource rejections went down. Right. So now let's go work with the provisioning and reviving the Eon Mode Vertica database cluster using the Management Console on different platform. So Management Console supports provisioning and reviving of Eon Mode databases on various cloud environments like AWS, the Google Cloud Platform, and Pure Storage. So for Google, for provisioning the Vertica Management Console on Google Cloud Platform you can use launch a template. Or on AWS environment you can use the cloud formation templates available for different OS's. Once you have provisioned Vertica Management Console, you can provision the Vertica cluster and databases from MC itself. So you can provision a Vertica cluster, you can select the Create new database button available on the homepage. This will open up the wizard to create a new database and cluster. In this example, we are using we are using the Google Cloud Platform. So the wizard will ask me for varius authentication parameters for the Google Cloud Platform. And if you're on AWS, it'll ask you for the authentication parameters for the AWS environment. And going forward on the Wizard, it'll ask me to select the instance Type. I will select for the new Vertica cluster. And also provide the communal location url for my Eon Mode database and all the other preferences related to the new cluster. Once I have selected all the preferences for my new cluster I can preview the settings and I can hit, if I am, I can hit Create if all looks okay. So if I hit Create, this will create a new, MC will create a new GCP instances because we are on the GCP environment in this example. It will create a cluster on this instance, it'll create a Vertica Eon Mode Database on this cluster. And it will, additionally, you can load the test data on it if you like to. Now let's go over and revive the existing Eon Mode database from the communal location. So you can do it the same using the Management Console by selecting the Revive Eon Mode database button on the homepage. This will again open up the wizard for reviving the Eon Mode database. Again, in this example, since we are using GCP Platform, it will ask me for the Google Cloud storage authentication attributes. And for reviving, it will ask me for the communal location so I can enter the Google Storage bucket and my folder and it will discover all the Eon Mode databases located under this folder. And I can select one of the databases that I would like to revive. And it will ask me for other Vertica preferences and for this video, for this database reviving. And once I enter all the preferences and review all the preferences I can hit Revive the database button on the Wizard. So after I hit Revive database it will create the GCP instances. The number of GCP instances that I created would be seen as the number of hosts on the original Vertica cluster. It will install the Vertica cluster on this data, on this instances and it will revive the database and it will start the database. And after starting the database, it will be imported on the MC so you can start monitoring on it. So in this example, we saw you can provision and revive the Vertica database on the GCP Platform. Additionally, you can use AWS environment to provision and revive. So now since we have the Eon Mode database on MC, Natalia will go over some Eon Mode features on MC like managing subcluster and Depot activity monitoring. Over to you, Natalia. >> Natalia: Okay, thank you. Hello, my name is Natalia Stavisky. I am also a member of Vertica Management Console Team. And I will talk today about the work I did to allow users to manage subclusters using the Management Console, and also the work I did to help users understand what's going on in their Depot in the Vertica Eon Mode database. So let's look at the picture of the subclusters. On the Manage page of Vertica Management Console, you can see here is a page that has blue tabs, and the tab that's active is Subclusters. You can see that there are two subclusters are available in this database. And for each of the subclusters, you can see subcluster properties, whether this is the primary subcluster or secondary. In this case, primary is the default subcluster. It's indicated by a star. You can see what nodes belong to each subcluster. You can see the node state and node statistics. You can also easily add a new subcluster. And we're quickly going to do this. So once you click on the button, you'll launch the wizard that'll take you through the steps. You'll enter the name of the subcluster, indicate whether this is secondary or primary subcluster. I should mention that Vertica recommends having only one primary subcluster. But we have both options here available. You will enter the number of nodes for your subcluster. And once the subcluster has been created, you can manage the subcluster. What other options for managing subcluster we have here? You can scale up an existing subcluster and that's a similar approach, you launch the wizard and (mumbles) nodes. You want to add to your existing subcluster. You can scale down a subcluster. And MC validates requirements for maintaining minimal number of nodes to prevent database shutdown. So if you can not remove any nodes from a subcluster, this option will not be available. You can stop a subcluster. And depending on whether this is a primary subcluster or secondary subcluster, this option may be available or not available. Like in this picture, we can see that for the default subcluster this option is not available. And this is because shutting down the default subcluster will cause the database to shut down as well. You can terminate a subcluster. And again, the MC warns you not to terminate the primary subcluster and validates requirements for maintaining minimal number of nodes to prevent database shutdown. So now we are going to talk a little more about how the MC helps you to understand what's going on in your Depot. So Depot is one of the core of Eon Mode database. And what are the frequently asked questions about the Depot? Is the Depot size sufficient? Are a subset of users putting a high load on the database? What tables are fetched and evicted repeatedly, we call it "re-fetched," in Depot? So here in the Depot Activity Monitoring page, we now have four tabs that allow you to answer those questions. And we'll go a little more in detail through each of them, but I'll just mention what they are for now. At a Glance shows you basic Depot configuration and also shows you query executing. Depot Efficiency, we'll talk more about that and other tabs. Depot Content, that shows you what tables are currently in your Depot. And Depot Pinning allows you to see what pinning policies have been created and to create new pinning policies. Now let's go through a scenario. Monitoring performance of workloads on one subcluster. As you know, Eon Mode database allows you to have multiple subclusters and we'll explore how this feature is useful and how we can use the Management Console to make decisions regarding whether you would like to have multiple subclusters. So here we have, in my setup, a single subcluster called default_subcluster. It has two users that are running queries that are accessing tables, mostly in schema public. So the query started executing and we can see that after fetching tables from Communal, which is the red line, the rest of the time the queries are executing in Depot. The green line is indicating queries running in Depot. The all nodes Depot is about 88% full, a steady flow, and the depot size seems to be sufficient for query executions from Depot only. That's the good case scenario. Now at around 17 :15, user Sherry got an urgent request to generate a report. And at, she started running her queries. We can see that picture is quite different now. The tables Sherry is querying are in a different schema and are much larger. Now we can see multiple lines in different colors. We can see a bunch of fetches and evictions which are indicated by blue and purple bars, and a lot of queries are now spilling into Communal. This is the red and orange lines. Orange line is an indicator of a query running partially in Depot and partially getting fetched from Communal. And the red line is data fetched from Communal storage. Let's click on the, one of the lines. Each data point, each point on the line, it'll take you to the Query Details page where you can see more about what's going on. So this is the page that shows us what queries have been run in this particular time interval which is on top of this page in orange color. So that's about one minute time interval and now we can see user Sherry among the users that are running queries. Sherry's queries involve large tables and are running against a different schema. We can see the clickstream schema in the name of the, in part of the query request. So what is happening, there is not enough Depot space for both the schema that's already in use and the one Sherry needs. As a result, evictions and fetches have started occurring. What other questions we can ask ourself to help us understand what's going on? So how about, what tables are most frequently re-fetched? So for that, we will go to the Depot Efficiency page and look at the middle, the middle chart here. We can see the larger version of this chart if we expand it. So now we have 10 tables listed that are most frequently being re-fetched. We can see that there is a clickstream schema and there are other schemas so all of those tables are being used in the queries, fetched, and then there is not enough space in the Depot, they getting evicted and they get re-fetched again. So what can be done to enable all queries to run in Depot? Option one can be increase the Depot size. So we can do this by running the following queries, which (mumbles) which nodes and storage location and the new Depot size. And I should mention that we can run this query from the Management Console from the query execution page. So this would have helped us to increase the Depot size. What other options do we have, for example, when increasing Depot size is not an option? We can also provision a second subcluster to isolate workloads like Sherry's. So we are going to do this now and we will provision a second subcluster using the Manage page. Here we're creating subcluster for Sherry or for workloads like hers. And we're going to create a (mumbles). So Sherry's subcluster has been created. We can see it here, added to the list of the subclusters. It's a secondary subcluster. Sherry has been instructed to use the new SherrySubcluster for her work. Now let's see what happened. We'll go again at Depot Activity page and we'll look at the At a Glance tab. We can see that around >> 18: 07, Sherry switched to running her queries on SherrySubcluster. On top of this page, you can see subcluster selected. So we currently have two subclusters and I'm looking, what happened to SherrySubcluster once it has been provisioned? So Sherry started using it and the lines after initial fetching from Depot, which was from Communal, which was the red line, after that, all Sherry's queries fit in Depot, which is indicated by green line. Also the Depot is pretty full on those nodes, about 90% full. But the queries are processed efficiently, there is no spilling into Communal. So that's a good case scenario. Let's now go back and take a look at the original subcluster, default subcluster. So on the left portion of the chart we can see multiple lines, that was activity before Sherry switched to her own designated subcluster. At around 18:07, after Sherry switched from the subcluster to using her designated subcluster, there is no, she is no longer using the subcluster, she is not putting a load in it. So the lines after that are turning a green color, which means the queries that are still running in default subcluster are all running in Depot. We can also see that Depot fetches and evictions bars, those purple and blue bars, are no longer showing significant numbers. Also we can check the second chart that shows Communal Storage Access. And we can see that the bars have also dropped, so there is no significant access for Communal Storage. So this problem has been solved. Each of the subclusters are serving queries from Depot and that's our most efficient scenario. Let's also look at the other tabs that we have for Depot monitoring. Let's look at Depot Efficiency tab. It has six charts and I'll go through each one of them quickly. Files Reads by Location gives an indicator of where the majority of query execution took place in Depot or in Communal. Top 10 Re-Fetches into Depot, and imagine the charts earlier in our user case, it shows tables that are most frequently fetched and evicted and then fetched again. These are good candidates to get pinned if increasing Depot size is not an option. Note that both of these charts have an option to select time interval using calendar widget. So you can get the information about the activity that happened during that time interval. Depot Pinning shows what portion of your Depot is pinned, both by byte count and by table count. And the three tables at the bottom show Depot structure. How long tables stay in Depot, we would like tables to be fetched in Depot and stay there for a long time, how often they are accessed, again, the tables in Depot, we would like to see them accessed frequently, and what the size range of tables in Depot. Depot Content. This tab allows us to search for tables that are currently in Depot and also to see stats like table size in Depot. How often tables are accessed and when were they last accessed. And the same information that's available for tables in Depot is also available on projections and partition levels for those tables. Depot Pinning. This tab allows users to see what policies are currently existing and so you can do this by clicking on the first little button and click search. This'll show you all existing policies that are already created. The second option allows you to search for a table and create a policy. You can also use the action column to modify existing policies or delete them. And the third option provides details about most frequently re-fetched tables, including fetch count, total access count, and number of re-fetched bytes. So all this information can help to make decisions regarding pinning specific tables. So that's about it about the Depot. And I should mention that the server team also has a very good presentation on the, webinar, on the Eon Mode database Depot management and subcluster management. that strongly recommend it to attend or download the slide presentation. Let's talk quickly about the Management Console Roadmap, what we are planning to do in the future. So we are going to continue focusing on subcluster management, there is still a lot of things we can do here. Promoting/demoting subclusters. Load balancing across subclusters, scheduling subcluster actions, support for large cluster mode. We'll continue working on Workload Analyzer enhancement recommendation, on backup and restore from the MC. Building custom thresholds, and Eon on HDFS support. Okay, so we are ready now to take any questions you may have now. Thank you.

Published Date : Mar 30 2020

SUMMARY :

for the virtual Vertica BDC 2020. and all the other preferences related to the new cluster. and the depot size seems to be sufficient So on the left portion of the chart

ENTITIES

Entity	Category	Confidence
Natalia Stavisky	PERSON	0.99+
Sherry	PERSON	0.99+
MaryLee	PERSON	0.99+
Jeff Healey	PERSON	0.99+
Natalia	PERSON	0.99+
Jeff	PERSON	0.99+
February 17th	DATE	0.99+
second scenario	QUANTITY	0.99+
10 tables	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
AWS	ORGANIZATION	0.99+
1MB	QUANTITY	0.99+
two users	QUANTITY	0.99+
first scenario	QUANTITY	0.99+
second option	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Bhavik	PERSON	0.99+
80 failed queries	QUANTITY	0.99+
today	DATE	0.99+
Depot	ORGANIZATION	0.99+
third	QUANTITY	0.99+
Each	QUANTITY	0.99+
six charts	QUANTITY	0.99+
both	QUANTITY	0.99+
each point	QUANTITY	0.99+
three recommendations	QUANTITY	0.99+
Today	DATE	0.99+
each	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Bhavik Gandhi	PERSON	0.99+
midpool7	TITLE	0.99+
two nodes	QUANTITY	0.99+
second chart	QUANTITY	0.99+
two subclusters	QUANTITY	0.98+
second subcluster	QUANTITY	0.98+
Each data point	QUANTITY	0.98+
each user	QUANTITY	0.98+
both options	QUANTITY	0.98+
4/2	DATE	0.98+
Eon	ORGANIZATION	0.97+
this week	DATE	0.97+
each subcluster	QUANTITY	0.97+
about 90%	QUANTITY	0.97+
three tables	QUANTITY	0.96+
0	QUANTITY	0.96+
about 14.8 seconds seconds	QUANTITY	0.96+
one subcluster	QUANTITY	0.95+

UNLIST TILL 4/2 - A Technical Overview of Vertica Architecture

>> Paige: Hello, everybody and thank you for joining us today on the Virtual Vertica BDC 2020. Today's breakout session is entitled A Technical Overview of the Vertica Architecture. I'm Paige Roberts, Open Source Relations Manager at Vertica and I'll be your host for this webinar. Now joining me is Ryan Role-kuh? Did I say that right? (laughs) He's a Vertica Senior Software Engineer. >> Ryan: So it's Roelke. (laughs) >> Paige: Roelke, okay, I got it, all right. Ryan Roelke. And before we begin, I want to be sure and encourage you guys to submit your questions or your comments during the virtual session while Ryan is talking as you think of them as you go along. You don't have to wait to the end, just type in your question or your comment in the question box below the slides and click submit. There'll be a Q and A at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to get back to you offline. Now, alternatively, you can visit the Vertica forums to post your question there after the session as well. Our engineering team is planning to join the forums to keep the conversation going, so you can have a chat afterwards with the engineer, just like any other conference. Now also, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides and before you ask, yes, this virtual session is being recorded and it will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now, let's get started. Over to you, Ryan. >> Ryan: Thanks, Paige. Good afternoon, everybody. My name is Ryan and I'm a Senior Software Engineer on Vertica's Development Team. I primarily work on improving Vertica's query execution engine, so usually in the space of making things faster. Today, I'm here to talk about something that's more general than that, so we're going to go through a technical overview of the Vertica architecture. So the intent of this talk, essentially, is to just explain some of the basic aspects of how Vertica works and what makes it such a great database software and to explain what makes a query execute so fast in Vertica, we'll provide some background to explain why other databases don't keep up. And we'll use that as a starting point to discuss an academic database that paved the way for Vertica. And then we'll explain how Vertica design builds upon that academic database to be the great software that it is today. I want to start by sharing somebody's approximation of an internet minute at some point in 2019. All of the data on this slide is generated by thousands or even millions of users and that's a huge amount of activity. Most of the applications depicted here are backed by one or more databases. Most of this activity will eventually result in changes to those databases. For the most part, we can categorize the way these databases are used into one of two paradigms. First up, we have online transaction processing or OLTP. OLTP workloads usually operate on single entries in a database, so an update to a retail inventory or a change in a bank account balance are both great examples of OLTP operations. Updates to these data sets must be visible immediately and there could be many transactions occurring concurrently from many different users. OLTP queries are usually key value queries. The key uniquely identifies the single entry in a database for reading or writing. Early databases and applications were probably designed for OLTP workloads. This example on the slide is typical of an OLTP workload. We have a table, accounts, such as for a bank, which tracks information for each of the bank's clients. An update query, like the one depicted here, might be run whenever a user deposits $10 into their bank account. Our second category is online analytical processing or OLAP which is more about using your data for decision making. If you have a hardware device which periodically records how it's doing, you could analyze trends of all your devices over time to observe what data patterns are likely to lead to failure or if you're Google, you might log user search activity to identify which links helped your users find the answer. Analytical processing has always been around but with the advent of the internet, it happened at scales that were unimaginable, even just 20 years ago. This SQL example is something you might see in an OLAP workload. We have a table, searches, logging user activity. We will eventually see one row in this table for each query submitted by users. If we want to find out what time of day our users are most active, then we could write a query like this one on the slide which counts the number of unique users running searches for each hour of the day. So now let's rewind to 2005. We don't have a picture of an internet minute in 2005, we don't have the data for that. We also don't have the data for a lot of other things. The term Big Data is not quite yet on anyone's radar and The Cloud is also not quite there or it's just starting to be. So if you have a database serving your application, it's probably optimized for OLTP workloads. OLAP workloads just aren't mainstream yet and database engineers probably don't have them in mind. So let's innovate. It's still 2005 and we want to try something new with our database. Let's take a look at what happens when we do run an analytic workload in 2005. Let's use as a motivating example a table of stock prices over time. In our table, the symbol column identifies the stock that was traded, the price column identifies the new price and the timestamp column indicates when the price changed. We have several other columns which, we should know that they're there, but we're not going to use them in any example queries. This table is designed for analytic queries. We're probably not going to make any updates or look at individual rows since we're logging historical data and want to analyze changes in stock price over time. Our database system is built to serve OLTP use cases, so it's probably going to store the table on disk in a single file like this one. Notice that each row contains all of the columns of our data in row major order. There's probably an index somewhere in the memory of the system which will help us to point lookups. Maybe our system expects that we will use the stock symbol and the trade time as lookup keys. So an index will provide quick lookups for those columns to the position of the whole row in the file. If we did have an update to a single row, then this representation would work great. We would seek to the row that we're interested in, finding it would probably be very fast using the in-memory index. And then we would update the file in place with our new value. On the other hand, if we ran an analytic query like we want to, the data access pattern is very different. The index is not helpful because we're looking up a whole range of rows, not just a single row. As a result, the only way to find the rows that we actually need for this query is to scan the entire file. We're going to end up scanning a lot of data that we don't need and that won't just be the rows that we don't need, there's many other columns in this table. Many information about who made the transaction, and we'll also be scanning through those columns for every single row in this table. That could be a very serious problem once we consider the scale of this file. Stocks change a lot, we probably have thousands or millions or maybe even billions of rows that are going to be stored in this file and we're going to scan all of these extra columns for every single row. If we tried out our stocks use case behind the desk for the Fortune 500 company, then we're probably going to be pretty disappointed. Our queries will eventually finish, but it might take so long that we don't even care about the answer anymore by the time that they do. Our database is not built for the task we want to use it for. Around the same time, a team of researchers in the North East have become aware of this problem and they decided to dedicate their time and research to it. These researchers weren't just anybody. The fruits of their labor, which we now like to call the C-Store Paper, was published by eventual Turing Award winner, Mike Stonebraker, along with several other researchers from elite universities. This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. That sounds exactly like what we want for our stocks use case. Reasoning about what makes our queries executions so slow brought our researchers to the Memory Hierarchy, which essentially is a visualization of the relative speeds of different parts of a computer. At the top of the hierarchy, we have the fastest data units, which are, of course, also the most expensive to produce. As we move down the hierarchy, components get slower but also much cheaper and thus you can have more of them. Our OLTP databases data is stored in a file on the hard disk. We scanned the entirety of this file, even though we didn't need most of the data and now it turns out, that is just about the slowest thing that our query could possibly be doing by over two orders of magnitude. It should be clear, based on that, that the best thing we can do to optimize our query's execution is to avoid reading unnecessary data from the disk and that's what the C-Store researchers decided to look at. The key innovation of the C-Store paper does exactly that. Instead of storing data in a row major order, in a large file on disk, they transposed the data and stored each column in its own file. Now, if we run the same select query, we read only the relevant columns. The unnamed columns don't factor into the table scan at all since we don't even open the files. Zooming out to an internet scale sized data set, we can appreciate the savings here a lot more. But we still have to read a lot of data that we don't need to answer this particular query. Remember, we had two predicates, one on the symbol column and one on the timestamp column. Our query is only interested in AAPL stock, but we're still reading rows for all of the other stocks. So what can we do to optimize our disk read even more? Let's first partition our data set into different files based on the timestamp date. This means that we will keep separate files for each date. When we query the stocks table, the database knows all of the files we have to open. If we have a simple predicate on the timestamp column, as our sample query does, then the database can use it to figure out which files we don't have to look at at all. So now all of our disk reads that we have to do to answer our query will produce rows that pass the timestamp predicate. This eliminates a lot of wasteful disk reads. But not all of them. We do have another predicate on the symbol column where symbol equals AAPL. We'd like to avoid disk reads of rows that don't satisfy that predicate either. And we can avoid those disk reads by clustering all the rows that match the symbol predicate together. If all of the AAPL rows are adjacent, then as soon as we see something different, we can stop reading the file. We won't see any more rows that can pass the predicate. Then we can use the positions of the rows we did find to identify which pieces of the other columns we need to read. One technique that we can use to cluster the rows is sorting. So we'll use the symbol column as a sort key for all of the columns. And that way we can reconstruct a whole row by seeking to the same row position in each file. It turns out, having sorted all of the rows, we can do a bit more. We don't have any more wasted disk reads but we can still be more efficient with how we're using the disk. We've clustered all of the rows with the same symbol together so we don't really need to bother repeating the symbol so many times in the same file. Let's just write the value once and say how many rows we have. This one length encoding technique can compress large numbers of rows into a small amount of space. In this example, we do de-duplicate just a few rows but you can imagine de-duplicating many thousands of rows instead. This encoding is great for reducing the amounts of disk we need to read at query time, but it also has the additional benefit of reducing the total size of our stored data. Now our query requires substantially fewer disk reads than it did when we started. Let's recap what the C-Store paper did to achieve that. First, we transposed our data to store each column in its own file. Now, queries only have to read the columns used in the query. Second, we partitioned the data into multiple file sets so that all rows in a file have the same value for the partition column. Now, a predicate on the partition column can skip non-matching file sets entirely. Third, we selected a column of our data to use as a sort key. Now rows with the same value for that column are clustered together, which allows our query to stop reading data once it finds non-matching rows. Finally, sorting the data this way enables high compression ratios, using one length encoding which minimizes the size of the data stored on the disk. The C-Store system combined each of these innovative ideas to produce an academically significant result. And if you used it behind the desk of a Fortune 500 company in 2005, you probably would've been pretty pleased. But it's not 2005 anymore and the requirements of a modern database system are much stricter. So let's take a look at how C-Store fairs in 2020. First of all, we have designed the storage layer of our database to optimize a single query in a single application. Our design optimizes the heck out of that query and probably some similar ones but if we want to do anything else with our data, we might be in a bit of trouble. What if we just decide we want to ask a different question? For example, in our stock example, what if we want to plot all the trade made by a single user over a large window of time? How do our optimizations for the previous query measure up here? Well, our data's partitioned on the trade date, that could still be useful, depending on our new query. If we want to look at a trader's activity over a long period of time, we would have to open a lot of files. But if we're still interested in just a day's worth of data, then this optimization is still an optimization. Within each file, our data is ordered on the stock symbol. That's probably not too useful anymore, the rows for a single trader aren't going to be clustered together so we will have to scan all of the rows in order to figure out which ones match. You could imagine a worse design but as it becomes crucial to optimize this new type of query, then we might have to go as far as reconfiguring the whole database. The next problem of one of scale. One server is probably not good enough to serve a database in 2020. C-Store, as described, runs on a single server and stores lots of files. What if the data overwhelms this small system? We could imagine exhausting the file system's inodes limit with lots of small files due to our partitioning scheme. Or we could imagine something simpler, just filling up the disk with huge volumes of data. But there's an even simpler problem than that. What if something goes wrong and C-Store crashes? Then our data is no longer available to us until the single server is brought back up. A third concern, another one of scalability, is that one deployment does not really suit all possible things and use cases we could imagine. We haven't really said anything about being flexible. A contemporary database system has to integrate with many other applications, which might themselves have pretty restricted deployment options. Or the demands imposed by our workloads have changed and the setup you had before doesn't suit what you need now. C-Store doesn't do anything to address these concerns. What the C-Store paper did do was lead very quickly to the founding of Vertica. Vertica's architecture and design are essentially all about bringing the C-Store designs into an enterprise software system. The C-Store paper was just an academic exercise so it didn't really need to address any of the hard problems that we just talked about. But Vertica, the first commercial database built upon the ideas of the C-Store paper would definitely have to. This brings us back to the present to look at how an analytic query runs in 2020 on the Vertica Analytic Database. Vertica takes the key idea from the paper, can we significantly improve query performance by changing the way our data is stored and give its users the tools to customize their storage layer in order to heavily optimize really important or commonly wrong queries. On top of that, Vertica is a distributed system which allows it to scale up to internet-sized data sets, as well as have better reliability and uptime. We'll now take a brief look at what Vertica does to address the three inadequacies of the C-Store system that we mentioned. To avoid locking into a single database design, Vertica provides tools for the database user to customize the way their data is stored. To address the shortcomings of a single node system, Vertica coordinates processing among multiple nodes. To acknowledge the large variety of desirable deployments, Vertica does not require any specialized hardware and has many features which smoothly integrate it with a Cloud computing environment. First, we'll look at the database design problem. We're a SQL database, so our users are writing SQL and describing their data in SQL way, the Create Table statement. Create Table is a logical description of what your data looks like but it doesn't specify the way that it has to be stored, For a single Create Table, we could imagine a lot of different storage layouts. Vertica adds some extensions to SQL so that users can go even further than Create Table and describe the way that they want the data to be stored. Using terminology from the C-Store paper, we provide the Create Projection statement. Create Projection specifies how table data should be laid out, including column encoding and sort order. A table can have multiple projections, each of which could be ordered on different columns. When you query a table, Vertica will answer the query using the projection which it determines to be the best match. Referring back to our stock example, here's a sample Create Table and Create Projection statement. Let's focus on our heavily optimized example query, which had predicates on the stock symbol and date. We specify that the table data is to be partitioned by date. The Create Projection Statement here is excellent for this query. We specify using the order by clause that the data should be ordered according to our predicates. We'll use the timestamp as a secondary sort key. Each projection stores a copy of the table data. If you don't expect to need a particular column in a projection, then you can leave it out. Our average price query didn't care about who did the trading, so maybe our projection design for this query can leave the trader column out entirely. If the question we want to ask ever does change, maybe we already have a suitable projection, but if we don't, then we can create another one. This example shows another projection which would be much better at identifying trends of traders, rather than identifying trends for a particular stock. Next, let's take a look at our second problem, that one, or excuse me, so how should you decide what design is best for your queries? Well, you could spend a lot of time figuring it out on your own, or you could use Vertica's Database Designer tool which will help you by automatically analyzing your queries and spitting out a design which it thinks is going to work really well. If you want to learn more about the Database Designer Tool, then you should attend the session Vertica Database Designer- Today and Tomorrow which will tell you a lot about what the Database Designer does and some recent improvements that we have made. Okay, now we'll move to our next problem. (laughs) The challenge that one server does not fit all. In 2020, we have several orders of magnitude more data than we had in 2005. And you need a lot more hardware to crunch it. It's not tractable to keep multiple petabytes of data in a system with a single server. So Vertica doesn't try. Vertica is a distributed system so will deploy multiple severs which work together to maintain such a high data volume. In a traditional Vertica deployment, each node keeps some of the data in its own locally-attached storage. Data is replicated so that there is a redundant copy somewhere else in the system. If any one node goes down, then the data that it served is still available on a different node. We'll also have it so that in the system, there's no special node with extra duties. All nodes are created equal. This ensures that there is no single point of failure. Rather than replicate all of your data, Vertica divvies it up amongst all of the nodes in your system. We call this segmentation. The way data is segmented is another parameter of storage customization and it can definitely have an impact upon query performance. A common way to segment data is by using a hash expression, which essentially randomizes the node that a row of data belongs to. But with a guarantee that the same data will always end up in the same place. Describing the way data is segmented is another part of the Create Projection Statement, as seen in this example. Here we segment on the hash of the symbol column so all rows with the same symbol will end up on the same node. For each row that we load into the system, we'll apply our segmentation expression. The result determines which segment the row belongs to and then we'll send the row to each node which holds the copy of that segment. In this example, our projection is marked KSAFE 1, so we will keep one redundant copy of each segment. When we load a row, we might find that its segment had copied on Node One and Node Three, so we'll send a copy of the row to each of those nodes. If Node One is temporarily disconnected from the network, then Node Three can serve the other copy of the segment so that the whole system remains available. The last challenge we brought up from the C-Store design was that one deployment does not fit all. Vertica's cluster design neatly addressed many of our concerns here. Our use of segmentation to distribute data means that a Vertica system can scale to any size of deployment. And since we lack any special hardware or nodes with special purposes, Vertica servers can run anywhere, on premise or in the Cloud. But let's suppose you need to scale out your cluster to rise to the demands of a higher workload. Suppose you want to add another node. This changes the division of the segmentation space. We'll have to re-segment every row in the database to find its new home and then we'll have to move around any data that belongs to a different segment. This is a very expensive operation, not something you want to be doing all that often. Traditional Vertica doesn't solve that problem especially well, but Vertica Eon Mode definitely does. Vertica's Eon Mode is a large set of features which are designed with a Cloud computing environment in mind. One feature of this design is elastic throughput scaling, which is the idea that you can smoothly change your cluster size without having to pay the expenses of shuffling your entire database. Vertica Eon Mode had an entire session dedicated to it this morning. I won't say any more about it here, but maybe you already attended that session or if you haven't, then I definitely encourage you to listen to the recording. If you'd like to learn more about the Vertica architecture, then you'll find on this slide links to several of the academic conference publications. These four papers here, as well as Vertica Seven Years Later paper which describes some of the Vertica designs seven years after the founding and also a paper about the innovations of Eon Mode and of course, the Vertica documentation is an excellent resource for learning more about what's going on in a Vertica system. I hope you enjoyed learning about the Vertica architecture. I would be very happy to take all of your questions now. Thank you for attending this session.

Published Date : Mar 30 2020

SUMMARY :

A Technical Overview of the Vertica Architecture. Ryan: So it's Roelke. in the question box below the slides and click submit. that the best thing we can do

ENTITIES

Entity	Category	Confidence
Ryan	PERSON	0.99+
Mike Stonebraker	PERSON	0.99+
Ryan Roelke	PERSON	0.99+
2005	DATE	0.99+
2020	DATE	0.99+
thousands	QUANTITY	0.99+
2019	DATE	0.99+
$10	QUANTITY	0.99+
Paige Roberts	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Paige	PERSON	0.99+
Node Three	TITLE	0.99+
Today	DATE	0.99+
First	QUANTITY	0.99+
each file	QUANTITY	0.99+
Roelke	PERSON	0.99+
each row	QUANTITY	0.99+
Node One	TITLE	0.99+
millions	QUANTITY	0.99+
each hour	QUANTITY	0.99+
each	QUANTITY	0.99+
Second	QUANTITY	0.99+
second category	QUANTITY	0.99+
each column	QUANTITY	0.99+
One technique	QUANTITY	0.99+
one	QUANTITY	0.99+
two predicates	QUANTITY	0.99+
each node	QUANTITY	0.99+
One server	QUANTITY	0.99+
SQL	TITLE	0.99+
C-Store	TITLE	0.99+
second problem	QUANTITY	0.99+
Ryan Role	PERSON	0.99+
Third	QUANTITY	0.99+
North East	LOCATION	0.99+
each segment	QUANTITY	0.99+
today	DATE	0.98+
single entry	QUANTITY	0.98+
each date	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
one row	QUANTITY	0.98+
one server	QUANTITY	0.98+
single server	QUANTITY	0.98+
single entries	QUANTITY	0.98+
both	QUANTITY	0.98+
20 years ago	DATE	0.98+
two paradigms	QUANTITY	0.97+
a day	QUANTITY	0.97+
this week	DATE	0.97+
billions of rows	QUANTITY	0.97+
Vertica	TITLE	0.97+
4/2	DATE	0.97+
single application	QUANTITY	0.97+
each query	QUANTITY	0.97+
Each projection	QUANTITY	0.97+

UNLIST TILL 4/2 - Vertica Big Data Conference Keynote

>> Joy: Welcome to the Virtual Big Data Conference. Vertica is so excited to host this event. I'm Joy King, and I'll be your host for today's Big Data Conference Keynote Session. It's my honor and my genuine pleasure to lead Vertica's product and go-to-market strategy. And I'm so lucky to have a passionate and committed team who turned our Vertica BDC event, into a virtual event in a very short amount of time. I want to thank the thousands of people, and yes, that's our true number who have registered to attend this virtual event. We were determined to balance your health, safety and your peace of mind with the excitement of the Vertica BDC. This is a very unique event. Because as I hope you all know, we focus on engineering and architecture, best practice sharing and customer stories that will educate and inspire everyone. I also want to thank our top sponsors for the virtual BDC, Arrow, and Pure Storage. Our partnerships are so important to us and to everyone in the audience. Because together, we get things done faster and better. Now for today's keynote, you'll hear from three very important and energizing speakers. First, Colin Mahony, our SVP and General Manager for Vertica, will talk about the market trends that Vertica is betting on to win for our customers. And he'll share the exciting news about our Vertica 10 announcement and how this will benefit our customers. Then you'll hear from Amy Fowler, VP of strategy and solutions for FlashBlade at Pure Storage. Our partnership with Pure Storage is truly unique in the industry, because together modern infrastructure from Pure powers modern analytics from Vertica. And then you'll hear from John Yovanovich, Director of IT at AT&T, who will tell you about the Pure Vertica Symphony that plays live every day at AT&T. Here we go, Colin, over to you. >> Colin: Well, thanks a lot joy. And, I want to echo Joy's thanks to our sponsors, and so many of you who have helped make this happen. This is not an easy time for anyone. We were certainly looking forward to getting together in person in Boston during the Vertica Big Data Conference and Winning with Data. But I think all of you and our team have done a great job, scrambling and putting together a terrific virtual event. So really appreciate your time. I also want to remind people that we will make both the slides and the full recording available after this. So for any of those who weren't able to join live, that is still going to be available. Well, things have been pretty exciting here. And in the analytic space in general, certainly for Vertica, there's a lot happening. There are a lot of problems to solve, a lot of opportunities to make things better, and a lot of data that can really make every business stronger, more efficient, and frankly, more differentiated. For Vertica, though, we know that focusing on the challenges that we can directly address with our platform, and our people, and where we can actually make the biggest difference is where we ought to be putting our energy and our resources. I think one of the things that has made Vertica so strong over the years is our ability to focus on those areas where we can make a great difference. So for us as we look at the market, and we look at where we play, there are really three recent and some not so recent, but certainly picking up a lot of the market trends that have become critical for every industry that wants to Win Big With Data. We've heard this loud and clear from our customers and from the analysts that cover the market. If I were to summarize these three areas, this really is the core focus for us right now. We know that there's massive data growth. And if we can unify the data silos so that people can really take advantage of that data, we can make a huge difference. We know that public clouds offer tremendous advantages, but we also know that balance and flexibility is critical. And we all need the benefit that machine learning for all the types up to the end data science. We all need the benefits that they can bring to every single use case, but only if it can really be operationalized at scale, accurate and in real time. And the power of Vertica is, of course, how we're able to bring so many of these things together. Let me talk a little bit more about some of these trends. So one of the first industry trends that we've all been following probably now for over the last decade, is Hadoop and specifically HDFS. So many companies have invested, time, money, more importantly, people in leveraging the opportunity that HDFS brought to the market. HDFS is really part of a much broader storage disruption that we'll talk a little bit more about, more broadly than HDFS. But HDFS itself was really designed for petabytes of data, leveraging low cost commodity hardware and the ability to capture a wide variety of data formats, from a wide variety of data sources and applications. And I think what people really wanted, was to store that data before having to define exactly what structures they should go into. So over the last decade or so, the focus for most organizations is figuring out how to capture, store and frankly manage that data. And as a platform to do that, I think, Hadoop was pretty good. It certainly changed the way that a lot of enterprises think about their data and where it's locked up. In parallel with Hadoop, particularly over the last five years, Cloud Object Storage has also given every organization another option for collecting, storing and managing even more data. That has led to a huge growth in data storage, obviously, up on public clouds like Amazon and their S3, Google Cloud Storage and Azure Blob Storage just to name a few. And then when you consider regional and local object storage offered by cloud vendors all over the world, the explosion of that data, in leveraging this type of object storage is very real. And I think, as I mentioned, it's just part of this broader storage disruption that's been going on. But with all this growth in the data, in all these new places to put this data, every organization we talk to is facing even more challenges now around the data silo. Sure the data silos certainly getting bigger. And hopefully they're getting cheaper per bit. But as I said, the focus has really been on collecting, storing and managing the data. But between the new data lakes and many different cloud object storage combined with all sorts of data types from the complexity of managing all this, getting that business value has been very limited. This actually takes me to big bet number one for Team Vertica, which is to unify the data. Our goal, and some of the announcements we have made today plus roadmap announcements I'll share with you throughout this presentation. Our goal is to ensure that all the time, money and effort that has gone into storing that data, all the data turns into business value. So how are we going to do that? With a unified analytics platform that analyzes the data wherever it is HDFS, Cloud Object Storage, External tables in an any format ORC, Parquet, JSON, and of course, our own Native Roth Vertica format. Analyze the data in the right place in the right format, using a single unified tool. This is something that Vertica has always been committed to, and you'll see in some of our announcements today, we're just doubling down on that commitment. Let's talk a little bit more about the public cloud. This is certainly the second trend. It's the second wave maybe of data disruption with object storage. And there's a lot of advantages when it comes to public cloud. There's no question that the public clouds give rapid access to compute storage with the added benefit of eliminating data center maintenance that so many companies, want to get out of themselves. But maybe the biggest advantage that I see is the architectural innovation. The public clouds have introduced so many methodologies around how to provision quickly, separating compute and storage and really dialing-in the exact needs on demand, as you change workloads. When public clouds began, it made a lot of sense for the cloud providers and their customers to charge and pay for compute and storage in the ratio that each use case demanded. And I think you're seeing that trend, proliferate all over the place, not just up in public cloud. That architecture itself is really becoming the next generation architecture for on-premise data centers, as well. But there are a lot of concerns. I think we're all aware of them. They're out there many times for different workloads, there are higher costs. Especially if some of the workloads that are being run through analytics, which tend to run all the time. Just like some of the silo challenges that companies are facing with HDFS, data lakes and cloud storage, the public clouds have similar types of siloed challenges as well. Initially, there was a belief that they were cheaper than data centers, and when you added in all the costs, it looked that way. And again, for certain elastic workloads, that is the case. I don't think that's true across the board overall. Even to the point where a lot of the cloud vendors aren't just charging lower costs anymore. We hear from a lot of customers that they don't really want to tether themselves to any one cloud because of some of those uncertainties. Of course, security and privacy are a concern. We hear a lot of concerns with regards to cloud and even some SaaS vendors around shared data catalogs, across all the customers and not enough separation. But security concerns are out there, you can read about them. I'm not going to jump into that bandwagon. But we hear about them. And then, of course, I think one of the things we hear the most from our customers, is that each cloud stack is starting to feel even a lot more locked in than the traditional data warehouse appliance. And as everybody knows, the industry has been running away from appliances as fast as it can. And so they're not eager to get locked into another, quote, unquote, virtual appliance, if you will, up in the cloud. They really want to make sure they have flexibility in which clouds, they're going to today, tomorrow and in the future. And frankly, we hear from a lot of our customers that they're very interested in eventually mixing and matching, compute from one cloud with, say storage from another cloud, which I think is something that we'll hear a lot more about. And so for us, that's why we've got our big bet number two. we love the cloud. We love the public cloud. We love the private clouds on-premise, and other hosting providers. But our passion and commitment is for Vertica to be able to run in any of the clouds that our customers choose, and make it portable across those clouds. We have supported on-premises and all public clouds for years. And today, we have announced even more support for Vertica in Eon Mode, the deployment option that leverages the separation of compute from storage, with even more deployment choices, which I'm going to also touch more on as we go. So super excited about our big bet number two. And finally as I mentioned, for all the hype that there is around machine learning, I actually think that most importantly, this third trend that team Vertica is determined to address is the need to bring business critical, analytics, machine learning, data science projects into production. For so many years, there just wasn't enough data available to justify the investment in machine learning. Also, processing power was expensive, and storage was prohibitively expensive. But to train and score and evaluate all the different models to unlock the full power of predictive analytics was tough. Today you have those massive data volumes. You have the relatively cheap processing power and storage to make that dream a reality. And if you think about this, I mean with all the data that's available to every company, the real need is to operationalize the speed and the scale of machine learning so that these organizations can actually take advantage of it where they need to. I mean, we've seen this for years with Vertica, going back to some of the most advanced gaming companies in the early days, they were incorporating this with live data directly into their gaming experiences. Well, every organization wants to do that now. And the accuracy for clickability and real time actions are all key to separating the leaders from the rest of the pack in every industry when it comes to machine learning. But if you look at a lot of these projects, the reality is that there's a ton of buzz, there's a ton of hype spanning every acronym that you can imagine. But most companies are struggling, do the separate teams, different tools, silos and the limitation that many platforms are facing, driving, down sampling to get a small subset of the data, to try to create a model that then doesn't apply, or compromising accuracy and making it virtually impossible to replicate models, and understand decisions. And if there's one thing that we've learned when it comes to data, prescriptive data at the atomic level, being able to show end of one as we refer to it, meaning individually tailored data. No matter what it is healthcare, entertainment experiences, like gaming or other, being able to get at the granular data and make these decisions, make that scoring applies to machine learning just as much as it applies to giving somebody a next-best-offer. But the opportunity has never been greater. The need to integrate this end-to-end workflow and support the right tools without compromising on that accuracy. Think about it as no downsampling, using all the data, it really is key to machine learning success. Which should be no surprise then why the third big bet from Vertica is one that we've actually been working on for years. And we're so proud to be where we are today, helping the data disruptors across the world operationalize machine learning. This big bet has the potential to truly unlock, really the potential of machine learning. And today, we're announcing some very important new capabilities specifically focused on unifying the work being done by the data science community, with their preferred tools and platforms, and the volume of data and performance at scale, available in Vertica. Our strategy has been very consistent over the last several years. As I said in the beginning, we haven't deviated from our strategy. Of course, there's always things that we add. Most of the time, it's customer driven, it's based on what our customers are asking us to do. But I think we've also done a great job, not trying to be all things to all people. Especially as these hype cycles flare up around us, we absolutely love participating in these different areas without getting completely distracted. I mean, there's a variety of query tools and data warehouses and analytics platforms in the market. We all know that. There are tools and platforms that are offered by the public cloud vendors, by other vendors that support one or two specific clouds. There are appliance vendors, who I was referring to earlier who can deliver package data warehouse offerings for private data centers. And there's a ton of popular machine learning tools, languages and other kits. But Vertica is the only advanced analytic platform that can do all this, that can bring it together. We can analyze the data wherever it is, in HDFS, S3 Object Storage, or Vertica itself. Natively we support multiple clouds on-premise deployments, And maybe most importantly, we offer that choice of deployment modes to allow our customers to choose the architecture that works for them right now. It still also gives them the option to change move, evolve over time. And Vertica is the only analytics database with end-to-end machine learning that can truly operationalize ML at scale. And I know it's a mouthful. But it is not easy to do all these things. It is one of the things that highly differentiates Vertica from the rest of the pack. It is also why our customers, all of you continue to bet on us and see the value that we are delivering and we will continue to deliver. Here's a couple of examples of some of our customers who are powered by Vertica. It's the scale of data. It's the millisecond response times. Performance and scale have always been a huge part of what we have been about, not the only thing. I think the functionality all the capabilities that we add to the platform, the ease of use, the flexibility, obviously with the deployment. But if you look at some of the numbers they are under these customers on this slide. And I've shared a lot of different stories about these customers. Which, by the way, it still amaze me every time I talk to one and I get the updates, you can see the power and the difference that Vertica is making. Equally important, if you look at a lot of these customers, they are the epitome of being able to deploy Vertica in a lot of different environments. Many of the customers on this slide are not using Vertica just on-premise or just in the cloud. They're using it in a hybrid way. They're using it in multiple different clouds. And again, we've been with them on that journey throughout, which is what has made this product and frankly, our roadmap and our vision exactly what it is. It's been quite a journey. And that journey continues now with the Vertica 10 release. The Vertica 10 release is obviously a massive release for us. But if you look back, you can see that building on that native columnar architecture that started a long time ago, obviously, with the C-Store paper. We built it to leverage that commodity hardware, because it was an architecture that was never tightly integrated with any specific underlying infrastructure. I still remember hearing the initial pitch from Mike Stonebreaker, about the vision of Vertica as a software only solution and the importance of separating the company from hardware innovation. And at the time, Mike basically said to me, "there's so much R&D in innovation that's going to happen in hardware, we shouldn't bake hardware into our solution. We should do it in software, and we'll be able to take advantage of that hardware." And that is exactly what has happened. But one of the most recent innovations that we embraced with hardware is certainly that separation of compute and storage. As I said previously, the public cloud providers offered this next generation architecture, really to ensure that they can provide the customers exactly what they needed, more compute or more storage and charge for each, respectively. The separation of compute and storage, compute from storage is a major milestone in data center architectures. If you think about it, it's really not only a public cloud innovation, though. It fundamentally redefines the next generation data architecture for on-premise and for pretty much every way people are thinking about computing today. And that goes for software too. Object storage is an example of the cost effective means for storing data. And even more importantly, separating compute from storage for analytic workloads has a lot of advantages. Including the opportunity to manage much more dynamic, flexible workloads. And more importantly, truly isolate those workloads from others. And by the way, once you start having something that can truly isolate workloads, then you can have the conversations around autonomic computing, around setting up some nodes, some compute resources on the data that won't affect any of the other data to do some things on their own, maybe some self analytics, by the system, etc. A lot of things that many of you know we've already been exploring in terms of our own system data in the product. But it was May 2018, believe it or not, it seems like a long time ago where we first announced Eon Mode and I want to make something very clear, actually about Eon mode. It's a mode, it's a deployment option for Vertica customers. And I think this is another huge benefit that we don't talk about enough. But unlike a lot of vendors in the market who will dig you and charge you for every single add-on like hit-buy, you name it. You get this with the Vertica product. If you continue to pay support and maintenance, this comes with the upgrade. This comes as part of the new release. So any customer who owns or buys Vertica has the ability to set up either an Enterprise Mode or Eon Mode, which is a question I know that comes up sometimes. Our first announcement of Eon was obviously AWS customers, including the trade desk, AT&T. Most of whom will be speaking here later at the Virtual Big Data Conference. They saw a huge opportunity. Eon Mode, not only allowed Vertica to scale elastically with that specific compute and storage that was needed, but it really dramatically simplified database operations including things like workload balancing, node recovery, compute provisioning, etc. So one of the most popular functions is that ability to isolate the workloads and really allocate those resources without negatively affecting others. And even though traditional data warehouses, including Vertica Enterprise Mode have been able to do lots of different workload isolation, it's never been as strong as Eon Mode. Well, it certainly didn't take long for our customers to see that value across the board with Eon Mode. Not just up in the cloud, in partnership with one of our most valued partners and a platinum sponsor here. Joy mentioned at the beginning. We announced Vertica Eon Mode for Pure Storage FlashBlade in September 2019. And again, just to be clear, this is not a new product, it's one Vertica with yet more deployment options. With Pure Storage, Vertica in Eon mode is not limited in any way by variable cloud, network latency. The performance is actually amazing when you take the benefits of separate and compute from storage and you run it with a Pure environment on-premise. Vertica in Eon Mode has a super smart cache layer that we call the depot. It's a big part of our secret sauce around Eon mode. And combined with the power and performance of Pure's FlashBlade, Vertica became the industry's first advanced analytics platform that actually separates compute and storage for on-premises data centers. Something that a lot of our customers are already benefiting from, and we're super excited about it. But as I said, this is a journey. We don't stop, we're not going to stop. Our customers need the flexibility of multiple public clouds. So today with Vertica 10, we're super proud and excited to announce support for Vertica in Eon Mode on Google Cloud. This gives our customers the ability to use their Vertica licenses on Amazon AWS, on-premise with Pure Storage and on Google Cloud. Now, we were talking about HDFS and a lot of our customers who have invested quite a bit in HDFS as a place, especially to store data have been pushing us to support Eon Mode with HDFS. So as part of Vertica 10, we are also announcing support for Vertica in Eon Mode using HDFS as the communal storage. Vertica's own Roth format data can be stored in HDFS, and actually the full functionality of Vertica is complete analytics, geospatial pattern matching, time series, machine learning, everything that we have in there can be applied to this data. And on the same HDFS nodes, Vertica can actually also analyze data in ORC or Parquet format, using External tables. We can also execute joins between the Roth data the External table holds, which powers a much more comprehensive view. So again, it's that flexibility to be able to support our customers, wherever they need us to support them on whatever platform, they have. Vertica 10 gives us a lot more ways that we can deploy Eon Mode in various environments for our customers. It allows them to take advantage of Vertica in Eon Mode and the power that it brings with that separation, with that workload isolation, to whichever platform they are most comfortable with. Now, there's a lot that has come in Vertica 10. I'm definitely not going to be able to cover everything. But we also introduced complex types as an example. And complex data types fit very well into Eon as well in this separation. They significantly reduce the data pipeline, the cost of moving data between those, a much better support for unstructured data, which a lot of our customers have mixed with structured data, of course, and they leverage a lot of columnar execution that Vertica provides. So you get complex data types in Vertica now, a lot more data, stronger performance. It goes great with the announcement that we made with the broader Eon Mode. Let's talk a little bit more about machine learning. We've been actually doing work in and around machine learning with various extra regressions and a whole bunch of other algorithms for several years. We saw the huge advantage that MPP offered, not just as a sequel engine as a database, but for ML as well. Didn't take as long to realize that there's a lot more to operationalizing machine learning than just those algorithms. It's data preparation, it's that model trade training. It's the scoring, the shaping, the evaluation. That is so much of what machine learning and frankly, data science is about. You do know, everybody always wants to jump to the sexy algorithm and we handle those tasks very, very well. It makes Vertica a terrific platform to do that. A lot of work in data science and machine learning is done in other tools. I had mentioned that there's just so many tools out there. We want people to be able to take advantage of all that. We never believed we were going to be the best algorithm company or come up with the best models for people to use. So with Vertica 10, we support PMML. We can import now and export PMML models. It's a huge step for us around that operationalizing machine learning projects for our customers. Allowing the models to get built outside of Vertica yet be imported in and then applying to that full scale of data with all the performance that you would expect from Vertica. We also are more tightly integrating with Python. As many of you know, we've been doing a lot of open source projects with the community driven by many of our customers, like Uber. And so now with Python we've integrated with TensorFlow, allowing data scientists to build models in their preferred language, to take advantage of TensorFlow. But again, to store and deploy those models at scale with Vertica. I think both these announcements are proof of our big bet number three, and really our commitment to supporting innovation throughout the community by operationalizing ML with that accuracy, performance and scale of Vertica for our customers. Again, there's a lot of steps when it comes to the workflow of machine learning. These are some of them that you can see on the slide, and it's definitely not linear either. We see this as a circle. And companies that do it, well just continue to learn, they continue to rescore, they continue to redeploy and they want to operationalize all that within a single platform that can take advantage of all those capabilities. And that is the platform, with a very robust ecosystem that Vertica has always been committed to as an organization and will continue to be. This graphic, many of you have seen it evolve over the years. Frankly, if we put everything and everyone on here wouldn't fit on a slide. But it will absolutely continue to evolve and grow as we support our customers, where they need the support most. So, again, being able to deploy everywhere, being able to take advantage of Vertica, not just as a business analyst or a business user, but as a data scientists or as an operational or BI person. We want Vertica to be leveraged and used by the broader organization. So I think it's fair to say and I encourage everybody to learn more about Vertica 10, because I'm just highlighting some of the bigger aspects of it. But we talked about those three market trends. The need to unify the silos, the need for hybrid multiple cloud deployment options, the need to operationalize business critical machine learning projects. Vertica 10 has absolutely delivered on those. But again, we are not going to stop. It is our job not to, and this is how Team Vertica thrives. I always joke that the next release is the best release. And, of course, even after Vertica 10, that is also true, although Vertica 10 is pretty awesome. But, you know, from the first line of code, we've always been focused on performance and scale, right. And like any really strong data platform, the execution engine, the optimizer and the execution engine are the two core pieces of that. Beyond Vertica 10, some of the big things that we're already working on, next generation execution engine. We're already actually seeing incredible early performance from this. And this is just one example, of how important it is for an organization like Vertica to constantly go back and re-innovate. Every single release, we do the sit ups and crunches, our performance and scale. How do we improve? And there's so many parts of the core server, there's so many parts of our broader ecosystem. We are constantly looking at coverages of how we can go back to all the code lines that we have, and make them better in the current environment. And it's not an easy thing to do when you're doing that, and you're also expanding in the environment that we are expanding into to take advantage of the different deployments, which is a great segue to this slide. Because if you think about today, we're obviously already available with Eon Mode and Amazon, AWS and Pure and actually MinIO as well. As I talked about in Vertica 10 we're adding Google and HDFS. And coming next, obviously, Microsoft Azure, Alibaba cloud. So being able to expand into more of these environments is really important for the Vertica team and how we go forward. And it's not just running in these clouds, for us, we want it to be a SaaS like experience in all these clouds. We want you to be able to deploy Vertica in 15 minutes or less on these clouds. You can also consume Vertica, in a lot of different ways, on these clouds. As an example, in Amazon Vertica by the Hour. So for us, it's not just about running, it's about taking advantage of the ecosystems that all these cloud providers offer, and really optimizing the Vertica experience as part of them. Optimization, around automation, around self service capabilities, extending our management console, we now have products that like the Vertica Advisor Tool that our Customer Success Team has created to actually use our own smarts in Vertica. To take data from customers that give it to us and help them tune automatically their environment. You can imagine that we're taking that to the next level, in a lot of different endeavors that we're doing around how Vertica as a product can actually be smarter because we all know that simplicity is key. There just aren't enough people in the world who are good at managing data and taking it to the next level. And of course, other things that we all hear about, whether it's Kubernetes and containerization. You can imagine that that probably works very well with the Eon Mode and separating compute and storage. But innovation happens everywhere. We innovate around our community documentation. Many of you have taken advantage of the Vertica Academy. The numbers there are through the roof in terms of the number of people coming in and certifying on it. So there's a lot of things that are within the core products. There's a lot of activity and action beyond the core products that we're taking advantage of. And let's not forget why we're here, right? It's easy to talk about a platform, a data platform, it's easy to jump into all the functionality, the analytics, the flexibility, how we can offer it. But at the end of the day, somebody, a person, she's got to take advantage of this data, she's got to be able to take this data and use this information to make a critical business decision. And that doesn't happen unless we explore lots of different and frankly, new ways to get that predictive analytics UI and interface beyond just the standard BI tools in front of her at the right time. And so there's a lot of activity, I'll tease you with that going on in this organization right now about how we can do that and deliver that for our customers. We're in a great position to be able to see exactly how this data is consumed and used and start with this core platform that we have to go out. Look, I know, the plan wasn't to do this as a virtual BDC. But I really appreciate you tuning in. Really appreciate your support. I think if there's any silver lining to us, maybe not being able to do this in person, it's the fact that the reach has actually gone significantly higher than what we would have been able to do in person in Boston. We're certainly looking forward to doing a Big Data Conference in the future. But if I could leave you with anything, know this, since that first release for Vertica, and our very first customers, we have been very consistent. We respect all the innovation around us, whether it's open source or not. We understand the market trends. We embrace those new ideas and technologies and for us true north, and the most important thing is what does our customer need to do? What problem are they trying to solve? And how do we use the advantages that we have without disrupting our customers? But knowing that you depend on us to deliver that unified analytics strategy, it will deliver that performance of scale, not only today, but tomorrow and for years to come. We've added a lot of great features to Vertica. I think we've said no to a lot of things, frankly, that we just knew we wouldn't be the best company to deliver. When we say we're going to do things we do them. Vertica 10 is a perfect example of so many of those things that we from you, our customers have heard loud and clear, and we have delivered. I am incredibly proud of this team across the board. I think the culture of Vertica, a customer first culture, jumping in to help our customers win no matter what is also something that sets us massively apart. I hear horror stories about support experiences with other organizations. And people always seem to be amazed at Team Vertica's willingness to jump in or their aptitude for certain technical capabilities or understanding the business. And I think sometimes we take that for granted. But that is the team that we have as Team Vertica. We are incredibly excited about Vertica 10. I think you're going to love the Virtual Big Data Conference this year. I encourage you to tune in. Maybe one other benefit is I know some people were worried about not being able to see different sessions because they were going to overlap with each other well now, even if you can't do it live, you'll be able to do those sessions on demand. Please enjoy the Vertica Big Data Conference here in 2020. Please you and your families and your co-workers be safe during these times. I know we will get through it. And analytics is probably going to help with a lot of that and we already know it is helping in many different ways. So believe in the data, believe in data's ability to change the world for the better. And thank you for your time. And with that, I am delighted to now introduce Micro Focus CEO Stephen Murdoch to the Vertica Big Data Virtual Conference. Thank you Stephen. >> Stephen: Hi, everyone, my name is Stephen Murdoch. I have the pleasure and privilege of being the Chief Executive Officer here at Micro Focus. Please let me add my welcome to the Big Data Conference. And also my thanks for your support, as we've had to pivot to this being virtual rather than a physical conference. Its amazing how quickly we all reset to a new normal. I certainly didn't expect to be addressing you from my study. Vertica is an incredibly important part of Micro Focus family. Is key to our goal of trying to enable and help customers become much more data driven across all of their IT operations. Vertica 10 is a huge step forward, we believe. It allows for multi-cloud innovation, genuinely hybrid deployments, begin to leverage machine learning properly in the enterprise, and also allows the opportunity to unify currently siloed lakes of information. We operate in a very noisy, very competitive market, and there are people, who are in that market who can do some of those things. The reason we are so excited about Vertica is we genuinely believe that we are the best at doing all of those things. And that's why we've announced publicly, you're under executing internally, incremental investment into Vertica. That investments targeted at accelerating the roadmaps that already exist. And getting that innovation into your hands faster. This idea is speed is key. It's not a question of if companies have to become data driven organizations, it's a question of when. So that speed now is really important. And that's why we believe that the Big Data Conference gives a great opportunity for you to accelerate your own plans. You will have the opportunity to talk to some of our best architects, some of the best development brains that we have. But more importantly, you'll also get to hear from some of our phenomenal Roth Data customers. You'll hear from Uber, from the Trade Desk, from Philips, and from AT&T, as well as many many others. And just hearing how those customers are using the power of Vertica to accelerate their own, I think is the highlight. And I encourage you to use this opportunity to its full. Let me close by, again saying thank you, we genuinely hope that you get as much from this virtual conference as you could have from a physical conference. And we look forward to your engagement, and we look forward to hearing your feedback. With that, thank you very much. >> Joy: Thank you so much, Stephen, for joining us for the Vertica Big Data Conference. Your support and enthusiasm for Vertica is so clear, and it makes a big difference. Now, I'm delighted to introduce Amy Fowler, the VP of Strategy and Solutions for FlashBlade at Pure Storage, who was one of our BDC Platinum Sponsors, and one of our most valued partners. It was a proud moment for me, when we announced Vertica in Eon mode for Pure Storage FlashBlade and we became the first analytics data warehouse that separates compute from storage for on-premise data centers. Thank you so much, Amy, for joining us. Let's get started. >> Amy: Well, thank you, Joy so much for having us. And thank you all for joining us today, virtually, as we may all be. So, as we just heard from Colin Mahony, there are some really interesting trends that are happening right now in the big data analytics market. From the end of the Hadoop hype cycle, to the new cloud reality, and even the opportunity to help the many data science and machine learning projects move from labs to production. So let's talk about these trends in the context of infrastructure. And in particular, look at why a modern storage platform is relevant as organizations take on the challenges and opportunities associated with these trends. The answer is the Hadoop hype cycles left a lot of data in HDFS data lakes, or reservoirs or swamps depending upon the level of the data hygiene. But without the ability to get the value that was promised from Hadoop as a platform rather than a distributed file store. And when we combine that data with the massive volume of data in Cloud Object Storage, we find ourselves with a lot of data and a lot of silos, but without a way to unify that data and find value in it. Now when you look at the infrastructure data lakes are traditionally built on, it is often direct attached storage or data. The approach that Hadoop took when it entered the market was primarily bound by the limits of networking and storage technologies. One gig ethernet and slower spinning disk. But today, those barriers do not exist. And all FlashStorage has fundamentally transformed how data is accessed, managed and leveraged. The need for local data storage for significant volumes of data has been largely mitigated by the performance increases afforded by all Flash. At the same time, organizations can achieve superior economies of scale with that segregation of compute and storage. With compute and storage, you don't always scale in lockstep. Would you want to add an engine to the train every time you add another boxcar? Probably not. But from a Pure Storage perspective, FlashBlade is uniquely architected to allow customers to achieve better resource utilization for compute and storage, while at the same time, reducing complexity that has arisen from the siloed nature of the original big data solutions. The second and equally important recent trend we see is something I'll call cloud reality. The public clouds made a lot of promises and some of those promises were delivered. But cloud economics, especially usage based and elastic scaling, without the control that many companies need to manage the financial impact is causing a lot of issues. In addition, the risk of vendor lock-in from data egress, charges, to integrated software stacks that can't be moved or deployed on-premise is causing a lot of organizations to back off the all the way non-cloud strategy, and move toward hybrid deployments. Which is kind of funny in a way because it wasn't that long ago that there was a lot of talk about no more data centers. And for example, one large retailer, I won't name them, but I'll admit they are my favorites. They several years ago told us they were completely done with on-prem storage infrastructure, because they were going 100% to the cloud. But they just deployed FlashBlade for their data pipelines, because they need predictable performance at scale. And the all cloud TCO just didn't add up. Now, that being said, well, there are certainly challenges with the public cloud. It has also brought some things to the table that we see most organizations wanting. First of all, in a lot of cases applications have been built to leverage object storage platforms like S3. So they need that object protocol, but they may also need it to be fast. And the said object may be oxymoron only a few years ago, and this is an area of the market where Pure and FlashBlade have really taken a leadership position. Second, regardless of where the data is physically stored, organizations want the best elements of a cloud experience. And for us, that means two main things. Number one is simplicity and ease of use. If you need a bunch of storage experts to run the system, that should be considered a bug. The other big one is the consumption model. The ability to pay for what you need when you need it, and seamlessly grow your environment over time totally nondestructively. This is actually pretty huge and something that a lot of vendors try to solve for with finance programs. But no finance program can address the pain of a forklift upgrade, when you need to move to next gen hardware. To scale nondestructively over long periods of time, five to 10 years plus is a crucial architectural decisions need to be made at the outset. Plus, you need the ability to pay as you use it. And we offer something for FlashBlade called Pure as a Service, which delivers exactly that. The third cloud characteristic that many organizations want is the option for hybrid. Even if that is just a DR site in the cloud. In our case, that means supporting appplication of S3, at the AWS. And the final trend, which to me represents the biggest opportunity for all of us, is the need to help the many data science and machine learning projects move from labs to production. This means bringing all the machine learning functions and model training to the data, rather than moving samples or segments of data to separate platforms. As we all know, machine learning needs a ton of data for accuracy. And there is just too much data to retrieve from the cloud for every training job. At the same time, predictive analytics without accuracy is not going to deliver the business advantage that everyone is seeking. You can kind of visualize data analytics as it is traditionally deployed as being on a continuum. With that thing, we've been doing the longest, data warehousing on one end, and AI on the other end. But the way this manifests in most environments is a series of silos that get built up. So data is duplicated across all kinds of bespoke analytics and AI, environments and infrastructure. This creates an expensive and complex environment. So historically, there was no other way to do it because some level of performance is always table stakes. And each of these parts of the data pipeline has a different workload profile. A single platform to deliver on the multi dimensional performances, diverse set of applications required, that didn't exist three years ago. And that's why the application vendors pointed you towards bespoke things like DAS environments that we talked about earlier. And the fact that better options exists today is why we're seeing them move towards supporting this disaggregation of compute and storage. And when it comes to a platform that is a better option, one with a modern architecture that can address the diverse performance requirements of this continuum, and allow organizations to bring a model to the data instead of creating separate silos. That's exactly what FlashBlade is built for. Small files, large files, high throughput, low latency and scale to petabytes in a single namespace. And this is importantly a single rapid space is what we're focused on delivering for our customers. At Pure, we talk about it in the context of modern data experience because at the end of the day, that's what it's really all about. The experience for your teams in your organization. And together Pure Storage and Vertica have delivered that experience to a wide range of customers. From a SaaS analytics company, which uses Vertica on FlashBlade to authenticate the quality of digital media in real time, to a multinational car company, which uses Vertica on FlashBlade to make thousands of decisions per second for autonomous cars, or a healthcare organization, which uses Vertica on FlashBlade to enable healthcare providers to make real time decisions that impact lives. And I'm sure you're all looking forward to hearing from John Yavanovich from AT&T. To hear how he's been doing this with Vertica and FlashBlade as well. He's coming up soon. We have been really excited to build this partnership with Vertica. And we're proud to provide the only on-premise storage platform validated with Vertica Eon Mode. And deliver this modern data experience to our customers together. Thank you all so much for joining us today. >> Joy: Amy, thank you so much for your time and your insights. Modern infrastructure is key to modern analytics, especially as organizations leverage next generation data center architectures, and object storage for their on-premise data centers. Now, I'm delighted to introduce our last speaker in our Vertica Big Data Conference Keynote, John Yovanovich, Director of IT for AT&T. Vertica is so proud to serve AT&T, and especially proud of the harmonious impact we are having in partnership with Pure Storage. John, welcome to the Virtual Vertica BDC. >> John: Thank you joy. It's a pleasure to be here. And I'm excited to go through this presentation today. And in a unique fashion today 'cause as I was thinking through how I wanted to present the partnership that we have formed together between Pure Storage, Vertica and AT&T, I want to emphasize how well we all work together and how these three components have really driven home, my desire for a harmonious to use your word relationship. So, I'm going to move forward here and with. So here, what I'm going to do the theme of today's presentation is the Pure Vertica Symphony live at AT&T. And if anybody is a Westworld fan, you can appreciate the sheet music on the right hand side. What we're going to what I'm going to highlight here is in a musical fashion, is how we at AT&T leverage these technologies to save money to deliver a more efficient platform, and to actually just to make our customers happier overall. So as we look back, and back as early as just maybe a few years ago here at AT&T, I realized that we had many musicians to help the company. Or maybe you might want to call them data scientists, or data analysts. For the theme we'll stay with musicians. None of them were singing or playing from the same hymn book or sheet music. And so what we had was many organizations chasing a similar dream, but not exactly the same dream. And, best way to describe that is and I think with a lot of people this might resonate in your organizations. How many organizations are chasing a customer 360 view in your company? Well, I can tell you that I have at least four in my company. And I'm sure there are many that I don't know of. That is our problem because what we see is a repetitive sourcing of data. We see a repetitive copying of data. And there's just so much money to be spent. This is where I asked Pure Storage and Vertica to help me solve that problem with their technologies. What I also noticed was that there was no coordination between these departments. In fact, if you look here, nobody really wants to play with finance. Sales, marketing and care, sure that you all copied each other's data. But they actually didn't communicate with each other as they were copying the data. So the data became replicated and out of sync. This is a challenge throughout, not just my company, but all companies across the world. And that is, the more we replicate the data, the more problems we have at chasing or conquering the goal of single version of truth. In fact, I kid that I think that AT&T, we actually have adopted the multiple versions of truth, techno theory, which is not where we want to be, but this is where we are. But we are conquering that with the synergies between Pure Storage and Vertica. This is what it leaves us with. And this is where we are challenged and that if each one of our siloed business units had their own stories, their own dedicated stories, and some of them had more money than others so they bought more storage. Some of them anticipating storing more data, and then they really did. Others are running out of space, but can't put anymore because their bodies aren't been replenished. So if you look at it from this side view here, we have a limited amount of compute or fixed compute dedicated to each one of these silos. And that's because of the, wanting to own your own. And the other part is that you are limited or wasting space, depending on where you are in the organization. So there were the synergies aren't just about the data, but actually the compute and the storage. And I wanted to tackle that challenge as well. So I was tackling the data. I was tackling the storage, and I was tackling the compute all at the same time. So my ask across the company was can we just please play together okay. And to do that, I knew that I wasn't going to tackle this by getting everybody in the same room and getting them to agree that we needed one account table, because they will argue about whose account table is the best account table. But I knew that if I brought the account tables together, they would soon see that they had so much redundancy that I can now start retiring data sources. I also knew that if I brought all the compute together, that they would all be happy. But I didn't want them to tackle across tackle each other. And in fact that was one of the things that all business units really enjoy. Is they enjoy the silo of having their own compute, and more or less being able to control their own destiny. Well, Vertica's subclustering allows just that. And this is exactly what I was hoping for, and I'm glad they've brought through. And finally, how did I solve the problem of the single account table? Well when you don't have dedicated storage, and you can separate compute and storage as Vertica in Eon Mode does. And we store the data on FlashBlades, which you see on the left and right hand side, of our container, which I can describe in a moment. Okay, so what we have here, is we have a container full of compute with all the Vertica nodes sitting in the middle. Two loader, we'll call them loader subclusters, sitting on the sides, which are dedicated to just putting data onto the FlashBlades, which is sitting on both ends of the container. Now today, I have two dedicated storage or common dedicated might not be the right word, but two storage racks one on the left one on the right. And I treat them as separate storage racks. They could be one, but i created them separately for disaster recovery purposes, lashing work in case that rack were to go down. But that being said, there's no reason why I'm probably going to add a couple of them here in the future. So I can just have a, say five to 10, petabyte storage, setup, and I'll have my DR in another 'cause the DR shouldn't be in the same container. Okay, but I'll DR outside of this container. So I got them all together, I leveraged subclustering, I leveraged separate and compute. I was able to convince many of my clients that they didn't need their own account table, that they were better off having one. I eliminated, I reduced latency, I reduced our ticketing I reduce our data quality issues AKA ticketing okay. I was able to expand. What is this? As work. I was able to leverage elasticity within this cluster. As you can see, there are racks and racks of compute. We set up what we'll call the fixed capacity that each of the business units needed. And then I'm able to ramp up and release the compute that's necessary for each one of my clients based on their workloads throughout the day. And so while they compute to the right before you see that the instruments have already like, more or less, dedicated themselves towards all those are free for anybody to use. So in essence, what I have, is I have a concert hall with a lot of seats available. So if I want to run a 10 chair Symphony or 80, chairs, Symphony, I'm able to do that. And all the while, I can also do the same with my loader nodes. I can expand my loader nodes, to actually have their own Symphony or write all to themselves and not compete with any other workloads of the other clusters. What does that change for our organization? Well, it really changes the way our database administrators actually do their jobs. This has been a big transformation for them. They have actually become data conductors. Maybe you might even call them composers, which is interesting, because what I've asked them to do is morph into less technology and more workload analysis. And in doing so we're able to write auto-detect scripts, that watch the queues, watch the workloads so that we can help ramp up and trim down the cluster and subclusters as necessary. There has been an exciting transformation for our DBAs, who I need to now classify as something maybe like DCAs. I don't know, I have to work with HR on that. But I think it's an exciting future for their careers. And if we bring it all together, If we bring it all together, and then our clusters, start looking like this. Where everything is moving in harmonious, we have lots of seats open for extra musicians. And we are able to emulate a cloud experience on-prem. And so, I want you to sit back and enjoy the Pure Vertica Symphony live at AT&T. (soft music) >> Joy: Thank you so much, John, for an informative and very creative look at the benefits that AT&T is getting from its Pure Vertica symphony. I do really like the idea of engaging HR to change the title to Data Conductor. That's fantastic. I've always believed that music brings people together. And now it's clear that analytics at AT&T is part of that musical advantage. So, now it's time for a short break. And we'll be back for our breakout sessions, beginning at 12 pm Eastern Daylight Time. We have some really exciting sessions planned later today. And then again, as you can see on Wednesday. Now because all of you are already logged in and listening to this keynote, you already know the steps to continue to participate in the sessions that are listed here and on the previous slide. In addition, everyone received an email yesterday, today, and you'll get another one tomorrow, outlining the simple steps to register, login and choose your session. If you have any questions, check out the emails or go to www.vertica.com/bdc2020 for the logistics information. There are a lot of choices and that's always a good thing. Don't worry if you want to attend one or more or can't listen to these live sessions due to your timezone. All the sessions, including the Q&A sections will be available on demand and everyone will have access to the recordings as well as even more pre-recorded sessions that we'll post to the BDC website. Now I do want to leave you with two other important sites. First, our Vertica Academy. Vertica Academy is available to everyone. And there's a variety of very technical, self-paced, on-demand training, virtual instructor-led workshops, and Vertica Essentials Certification. And it's all free. Because we believe that Vertica expertise, helps everyone accelerate their Vertica projects and the advantage that those projects deliver. Now, if you have questions or want to engage with our Vertica engineering team now, we're waiting for you on the Vertica forum. We'll answer any questions or discuss any ideas that you might have. Thank you again for joining the Vertica Big Data Conference Keynote Session. Enjoy the rest of the BDC because there's a lot more to come

Published Date : Mar 30 2020

SUMMARY :

And he'll share the exciting news And that is the platform, with a very robust ecosystem some of the best development brains that we have. the VP of Strategy and Solutions is causing a lot of organizations to back off the and especially proud of the harmonious impact And that is, the more we replicate the data, Enjoy the rest of the BDC because there's a lot more to come

ENTITIES

Entity	Category	Confidence
Stephen	PERSON	0.99+
Amy Fowler	PERSON	0.99+
Mike	PERSON	0.99+
John Yavanovich	PERSON	0.99+
Amy	PERSON	0.99+
Colin Mahony	PERSON	0.99+
AT&T	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
John Yovanovich	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Joy King	PERSON	0.99+
Mike Stonebreaker	PERSON	0.99+
John	PERSON	0.99+
May 2018	DATE	0.99+
100%	QUANTITY	0.99+
Wednesday	DATE	0.99+
Colin	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Vertica Academy	ORGANIZATION	0.99+
five	QUANTITY	0.99+
Joy	PERSON	0.99+
2020	DATE	0.99+
two	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
Stephen Murdoch	PERSON	0.99+
Vertica 10	TITLE	0.99+
Pure Storage	ORGANIZATION	0.99+
one	QUANTITY	0.99+
today	DATE	0.99+
Philips	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
AT&T.	ORGANIZATION	0.99+
September 2019	DATE	0.99+
Python	TITLE	0.99+
www.vertica.com/bdc2020	OTHER	0.99+
One gig	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
Second	QUANTITY	0.99+
First	QUANTITY	0.99+
15 minutes	QUANTITY	0.99+
yesterday	DATE	0.99+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Eon Mode: