UNLIST TILL 4/2 - Optimizing Query Performance and Resource Pool Tuning
>> Jeff: Hello, everybody and thank you for Joining us today for the virtual "Vertica VBC" 2020. Today's breakout session has been titled "Optimizing Query Performance and Resource Pool Tuning" I'm Jeff Ealing, I lead Vertica marketing. I'll be your host for this breakout session. Joining me today are Rakesh Banula, and Abhi Thakur, Vertica product technology engineers and key members of the Vertica customer success team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. We'll answer as many questions we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively, visit Vertica forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to Join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of your slides. And yes, this virtual session is being recorded, will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you Rakesh. >> Rakesh: Thank you, Jeff. Hello, everyone. My name is Rakesh Bankula. Along with me, we have Bir Abhimanu Thakur. We both are going to cover the present session on "Optimizing Query Performance and Resource Pool Tuning" In this session, we are going to discuss query optimization, how to review the query plans and how to get the best query plans with proper production design. Then discuss on resource allocations and how to find resource contention. And we will continue the discussion on important use cases. In general, to successfully complete any activity or any project, the main things it requires are the plan. Plan for that activity on what to do first, what to do next, what are things you can do in parallel? The next thing you need, the best people to work on that project as per the plan. So, first thing is a plan and next is the people or resources. If you overload the same set of people, our resources by involving them in multiple projects or activities or if any person or resource is sick in a given project is going to impact on the overall completion of that project. The same analogy we can apply through query performance too. For a query to perform well, it needs two main things. One is the best query plan and other is the best resources to execute the plan. Of course, in some cases, resource contention, whether it can be from system side or within the database may slow down the query even when we have best query plan and best resource allocations. We are going to discuss each of these three items a little more in depth. Let us start with query plan. User submits the query to database and Vertica Optimizer generates the query plan. In generating query plans, optimizer uses the statistics information available on the tables. So, statistics plays a very important role in generating good query plans. As a best practice, always maintain up-to-date statistics. If you want to see how query plan looks like, add explain keyword in front of your query and run that query. It displays the query plan on the screen. Other option is BC explained plans. It saves all the explained plans of the queries run on the database. So, once you have a query plan, once you're checking it to make sure plan is good. The first thing I would look for, no statistics are predicted out of range. If you see any of these, means table involved in the query, have no up to date statistics. It is now the time to update the statistics. Next thing to explain plans are broadcast, three segments around the Join operator, global re segments around a group by operators. These indicate during the runtime of the query, data flow between the nodes over the network and will slow down the query execution. As far as possible, prevent such operations. How to prevent this, we will discuss in the projection design topic. Regarding the Join order, check on inner side and outer side, which tables are used, how many rows each side processing. In (mumbles) picking a table, having smaller number of rows is good in case of as shown as, as Join built in memory, smaller the number of rows, faster it is to build the hash table and also helps in consuming less memory. Then check if the plan is picking query specific projection or default projections. If optimizer ignoring any query specific projection, but picking the default super projection will show you how to use query specific hints to follow the plant to pick query specific projections which helps in improving the performance. Okay, here is one example query plan of a query trying to find number of products sold from a store in a given state. This query is having Joins between store table, product table and group by operation to find the count. So, first look for no statistics particularly around storage access path. This plan is not reporting any no statistics. This means statistics are up to date and plan is good so far. Then check what projections are used. This is also around the storage access part. For Join orders check, we have Hash Join in path ID 4 having it In Path ID 6 processing 60,000 rows and outer is in Path ID 7 processing 20 million rows. Inner side processing last record is good. This helps in building hash table quicker by using less memory. Check if any broadcast re segments, Joins in Path ID 4 and also Path ID 3. Both are having inner broadcast, Inners are having 60,000 records are broadcasted to all nodes in the cluster. This could impact the query performance negatively. These are some of the main things which we normally check in the explained plans. Still now, We have seen that how to get good query plans. To get good query plans, we need to maintain up to date statistics and also discussed how to review query plans. Projection design is the next important thing in getting good query plans, particularly in preventing broadcasts re segments. Broadcast re segments happens during Join operation, random existing segmentation class of the projections involved in the Join not matching with the Join columns in the query. These operations causes data flow over the network and negatively impacts the query performance particularly when it transfers millions or billions of rows. These operations also causes query acquire more memory particularly in network send and receive operations. One can avoid these broadcast re segments with proper projection segmentation, say, Join involved between two fact tables, T1, T2 on column I then segment the projections on these T1, T2 tables on column I. This is also called identically segmenting projections. In other cases, Join involved between a fact table and a dimension table then replicate or create an unsegmented projection on dimension table will help avoiding broadcast re segments during Join operation. During group by operation, global re segment groups causes data flow over the network. This can also slow down the query performance. To avoid these global re segment groups, create segmentation class of the projection to match with the group by columns in the query. In previous slides, we have seen the importance of projection segmentation plus in preventing the broadcast re segments during the Join operation. The order by class of production design plays important role in picking the Join method. We have two important Join methods, Merge Join and Hash Join. Merge Join is faster and consumes less memory than hash Join. Query plan uses Merge Join when both projections involved in the Join operation are segmented and ordered on the Join keys. In all other cases, Hash Join method will be used. In case of group by operation too, we have two methods. Group by pipeline and group by Hash. Group by pipeline is faster and consumes less memory compared to group by Hash. The requirements for group by pipeline is, projection must be segmented and ordered by on grouping columns. In all other cases, group by hash method will be used. After all, we have seen importance of stats and projection design in getting good query plans. As statistics are based on estimates over sample of data, it is possible in a very rare cases, default query plan may not be as good as you expected, even after maintaining up-to-date stats and good projection design. To work around this, Vertica providing you some query hints to force optimizer to generate even better query plans. Here are some example Join hints which helps in picking Join method and how to distribute the data, that is broadcast or re segment on inner or outer side and also which group by method to pick. The table level hints helps to force pick query specific projection or skipping any particular projection in a given query. These all hints are available in Vertica documentation. Here are a few general hints useful in controlling how to load data with the class materialization et cetera. We are going to discuss some examples on how to use these query hints. Here is an example on how to force query plan to pick Hash Join. The hint used here is JTYPE, which takes arguments, H for HashJoin, M for MergeJoin. How to place this hint, just after the Join keyword in the query as shown in the example here. Another important Join in this, JFMT, Join For My Type hint. This hint is useful in case when Join columns are lost workers. By default Vertica allocates memory based on column data type definition, not by looking at the actual data length in those columns. Say for example, Join column defined as (mumbles) 1000, 5000 or more, but actual length of the data in this column is, say, less than 50 characters. Vertica going to use more memory to process such columns in Join and also slow down the Join processing. JSMP hint is useful in this particular case. JSMP parameter uses the actual length of the Join column. As shown in the example, using JFMP of V hint helps in reducing the memory requirement for this query and executes faster too. Distrib hint helps in how to force inner or outer side of the Join operator to be distributed using broadcast or re segment. Distrib takes two parameters. First is the outer site and second is the inner site. As shown in the example, DISTRIB(A,R) after Join keyword in the query helps to force re segment the inner side of the Join, outer side, leaving it to optimizer to choose that distribution method. GroupBy Hint helps in forcing query plan to pick Group by Hash or Group by Pipeline. As shown in the example, GB type or hash, used just after group by class in the query helps to force this query to pick Group by Hashtag. See now, we discussed the first part of query performance, which is query plans. Now, we are moving on to discuss next part of query performance, which is resource allocation. Resource Manager allocates resources to queries based on the settings on resource pools. The main resources which resource pools controls are memory, CPU, query concurrency. The important resource pool parameters, which we have to tune according to the workload are memory size, plan concurrency, mass concurrency and execution parallelism. Query budget plays an important role in query performance. Based on the query budget, query planner allocate worker threads to process the query request. If budget is very low, query gets less number of threads, and if that query requires to process huge data, then query takes longer time to execute because of less threads or less parallelism. In other case, if the budget is very high and query executed on the pool is a simple one which results in a waste of resources, that is, query which acquires the resources holds it till it complete the execution, and that resource is not available to other queries. Every resource pool has its own query budget. This query budget is calculated based on the memory size and client and currency settings on that pool. Resource pool status table has a column called Query Budget KB, which shows the budget value of a given resource pool. The general recommendation for query budget is to be in the range of one GB to 10 GB. We can do a few checks to validate if the existing resource pool settings are good or not. First thing we can check to see if query is getting resource allocations quickly, or waiting in the resource queues longer. You can check this in resource queues table on a live system multiple times, particularly during your peak workload hours. If large number of queries are waiting in resource queues, indicates the existing resource pool settings not matching with your workload requirements. Might be, memory allocated is not enough, or max concurrency settings are not proper. If query's not spending much time in resource queues indicates resources are allocated to meet your peak workload, but not sure if you have over or under allocated the resources. For this, check the budget in resource pool status table to find any pool having way larger than eight GB or much smaller than one GB. Both over allocation and under allocation of budget is not good for query performance. Also check in DC resource acquisitions table to find any transaction acquire additional memory during the query execution. This indicates the original given budget is not sufficient for the transaction. Having too many resource pools is also not good. How to create resource pools or even existing resource pools. Resource pool settings should match to the present workload. You can categorize the workload into well known workload and ad-hoc workload. In case of well-known workload, where you will be running same queries regularly like daily reports having same set of queries processing similar size of data or daily ETL jobs et cetera. In this case, queries are fixed. Depending on the complexity of the queries, you can further divide it into low, medium, high resource required pools. Then try setting the budget to 1 GB, 4 GB, 8 GB on these pools by allocating the memory and setting the plan concurrency as per your requirement. Then run the query and measure the execution time. Try couple UP iterations by increasing and then decreasing the budget to find the best settings for your resource pools. For category of ad-hoc workload where there is no control over the number of users going to run the queries concurrently, or complexity of queries user going to submit. For this category, we cannot estimate, in advance, the optimum query budget. So for this category of workload, we have to use cascading resource pool settings where query starts on the pool based on the runtime they have set, then query resources moves to a secondary pool. This helps in preventing smaller queries waiting for resources, longer time when a big query consuming all resources and rendering for a longer time. Some important resource pool monitoring tables, analyze system, you can query resource cues table to find any transaction waiting for resources. You will also find on which resource pool transaction is waiting, how long it is waiting, how many queries are waiting on the pool. Resource pool status gives info on how many queries are in execution on each resource pool, how much memory in use and additional info. For resource consumption of a transaction which was already completed, you can play DC resource acquisitions to find how much memory a given transaction used per node. DC resource pool move table shows info on what our transactions moved from primary to secondary pool in case of cascading resource pools. DC resource rejections gives info on which node, which resource a given transaction failed or rejected. Query consumptions table gives info on how much CPU disk network resources a given transaction utilized. Till now, we discussed query plans and how to allocate resources for better query performance. It is possible for queries to perform slower when there is any resource contention. This contention can be within database or from system side. Here are some important system tables and queries which helps in finding resource contention. Table DC query execution gives the information on transaction level, how much time it took for each execution step. Like how much time it took for planning, resource allocation, actual execution etc. If the time taken is more in planning, which is mostly due to catalog contentions, you can play DC lock releases table as shown here to see how long transactions are waiting to acquire global catalog lock, how long transaction holding GCL x. Normally, GCL x acquire and release should be done within a couple of milliseconds. If the transactions are waiting for a few seconds to acquire GCL x or holding GCL x longer indicates some catalog contention, which may be due to too many concurrent queries or due to long running queries, or system services holding catalog mutexes and causing other transactions to queue up. A query is given here, particularly the system tables will help you further narrow down the contention. You can vary sessions table to find any long-running user queries. You can query system services table to find any service like analyze row counts, move out, merge operation and running for a long time. DC all evens table gives info on what are slower events happening. You can also query system resource usage table to find any particular system resource like CPU memory, disk IO or network throughput, saturating on any node. It is possible once slow node in the cluster could impact overall performance of queries negatively. To identify any slow node in the cluster, we use queries. Select one, and (mumbles) Clearly key one query just executes on initiative node. On a good node, kV one query returns within 50 milliseconds. As shown here, you can use a script to run this, select kV one query on all nodes in the cluster. You can repeat this test multiple times, say five to 10 times then reveal the time taken by this query on all nodes in all tech (mumbles) . If there is any one node taking more than a few seconds compared to other notes taking just milliseconds, then something is wrong with that node. To find what is going on with the node, which took more time for kV one query, run perf top. Perf top gives info on stopped only lister functions in which system spending most of the time. These functions can be counter functions or Vertica functions, as shown here. Based on their systemic spending most of the time we'll get some clue on what is going on with that code. Abhi will continue with the remaining part of the session. Over to you Abhi. >> Bir: Hey, thanks, Rakesh. My name is Abhimanu Thakur and today I will cover some performance cases which we had addressed recently in our customer clusters which we will be applying the best practices just showed by Rakesh. Now, to find where the performance problem is, it is always easy if we know where the problem is. And to understand that, like Rakesh just explained, the life of a query has different phases. The phases are pre execution, which is the planning, execution and post execution which is releasing all the required resources. This is something very similar to a plane taking a flight path where it prepares itself, gets onto the runway, takes off and lands back onto the runway. So, let's prepare our flight to take off. So, this is a use case which is from a dashboard application where the dashboard fails to refresh once in a while, and there is a batch of queries which are sent by the dashboard to the Vertica database. And let's see how we can be able to see where the failure is or where the slowness is. To reveal the dashboard application, these are very shortly queries, we need to see what were the historical executions and from the historical executions, we basically try to find where is the exact amount of time spent, whether it is in the planning phase, execution phase or in the post execution and if they are pretty consistent all the time, which means the plan has not changed in the execution which will also help us determine what is the memory used and if the memory budget is ideal. As just showed by Rakesh, the budget plays a very important role. So DC query executions, one-stop place to go and find your timings, whether it is a timing extra or is it execute plan or is it an abandoned plan. So, looking at the queries which we received and the times from the scrutinize, we find most of the time average execution, the execution is pretty consistent and there is some time, extra time spent in the planning phase which users of (mumbles) resource contention. This is a very simple matrix which you can follow to find if you have issues. So the system resource convention catalog contention and resource contention, all of these contribute mostly because of the concurrency. And let's see if we can drill down further to find the issue in these dashboard application queries. So, to get the concurrency, we pull out the number of queries issued, what is the max concurrency achieved, what are the number of threads, what is the overall percentage of query duration and all this data is available in the V advisor report. So, as soon as you provide scrutinize, we generate the V advisor report which helps us get complete insight of this data. So, based on this we definitely see there is very high concurrency and most of the queries finish in less than a second which is good. There are queries which go beyond 10 seconds and over a minute, but so definitely, the cluster had concurrency. What is more interesting is to find from this graph is... I'm sorry if this is not very readable, but the topmost line what you see is the Select and the bottom two or three lines are the create, drop and alters. So definitely this cluster is having a lot of DDL and DMLs being issued and what do they contribute is if there is a large DDL and DMLs, they cause catalog contention. So, we need to make sure that the batch, what we're sending is not causing too many catalog contention into the cluster which delays the complete plan face as the system resources are busy. And the same time, what we also analyze is the analyze tactics running every hour which is very aggressive, I would say. It should be scheduled to be need only so if a table has not changed drastically that's not scheduled analyzed tactics for the table. A couple more settings has shared by Rakesh is, it definitely plays a important role in the modeled and mode operations. So now, let's look at the budget of the query. The budget of the resource pool is currently at about two GB and it is the 75 percentile memory. Queries are definitely executing at that same budget, which is good and bad because these are dashboard queries, they don't need such a large amount of memory. The max memory as shown here from the capture data is about 20 GB which is pretty high. So what we did is, we found that there are some queries run by different user who are running in the same dashboard pool which should not be happening as dashboard pool is something like a premium pool or kind of a private run way to run your own private jet. And why I made that statement is as you see, resource pools are lik runways. You have different resource pools, different runways to cater different types of plane, different types of flights which... So, as you can manage your resource pools differently, your flights can take off and land easily. So, from this we did remind that the budget is something which could be well done. Now let's look... As we saw in the previous numbers that there were some resource weights and like I said, because resource pools are like your runways. So if you have everything ready, your plane is waiting just to get onto the runway to take off, you would definitely not want to be in that situation. So in this case, what we found is the coolest... There're quite a bit number of queries which have been waited in the pool and they waited almost a second and which can be avoided by modifying the the amount of resources allocated to the resource pool. So in this case, we increase the resource pool to provide more memory which is 80 GB and reduce the budget from two GB to one GB. Also making sure that the plan concurrency is increased to match the memory budget and also we moved the user who was running into the dashboard query pool. So, this is something which we have gone, which we found also in the resource pool is the execution parallelism and how this affects and what what number changes. So, execution parallelism is something which allocates the plan, allocates the number of threads, network buffers and all the data around it before even the query executes. And in this case, this pool had auto, which defaults to the core count. And so, dashboard queries not being too high on resources, they need to just get what they want. So we reduced the execution parallelism to eight and this drastically brought down the amount of threads which were needed without changing the time of execution. So, this is all what we saw how we could tune before the query takes off. Now, let's see what path we followed. This is the exact path what we followed. Hope of this diagram helps and these are the things which we took care of. So, tune your resource pool, adjust your execution parallelism based on the type of the queries the resource pool is catering to and match your memory sizes and don't be too aggressive on your resource budget. And see if you could replace your staging tables with temporary tables as they help a lot in reducing the DDLs and DMLs, reducing the catalog contention and the places where you cannot replace them with the truncate tables, reduce your analyzed statics duration and if possible, follow the best practices for a couple more operations. So moving on, let's let our query take a flight and see what best practices can be applied here. So this is another, I would say, very classic example of query where the query has been running and suddenly stops to fail. And if there is... I think most of the other seniors in a Join did not fit in memory. What does this mean? It basically means the inner table is trying to build a large Hash table, and it needs a lot of memory to fit. There are only two reasons why it could fail. One, your statics are outdated and your resource pool is not letting you grab all the memory needed. So in this particular case, the resource pool is not allowing all the memory it needs. As you see, the query acquire 180 GB of memory, and it failed. When looking at the... In most cases, you should be able to figure out the issue looking at the explained plan of the query as shared by Rakesh earlier. But in this case if you see, the explained plan looks awesome. There's no other operator like in a broadcast or outer V segment or something like that, it's just Join hash. So looking further we find into the projection. So inner is on segmented projection, the outer is segmented. Excellent. This is what is needed. So in this case, what we would recommend is go find further what is the cost. The cost to scan this row seems to be pretty high. There's the table DC query execution in their profiles in Vertica, which helps you drill down to every smallest amount of time, memory and what were the number of rows used by individual operators per pack. So, while looking into the execution engine profile details for this query, we found the amount of time spent is on the Join operator and it's the Join inner Hash table build time, which has taking huge amount of time. It's just waiting basically for the lower operators can and storage union to pass the data. So, how can we avoid this? Clearly, we can avoid it by creating a segmented projection instead of unsegmented projection on such a large table with one billion rows. Following the practice to create the projection... So this is a projection which was created and it was segmented on the column which is part of the select clause over here. Now, that plan looks nice and clean still, and the execution of this query now executes in 22 minutes 15 seconds and the most important you see is the memory. It executes in just 15 GB of memory. So, basically to what was done is the unsegmented projection which acquires a lot of memory per node is now not taking that much of memory and executing faster as it has been divided by the number of nodes per node to execute only a small share of data. But, the customer was still not happy as 22 minutes is still high. And let's see if we can tune it further to make the cost go down and execution time go down. So, looking at the explained plan again, like I said, most of the time, you could see the plan and say, "What's going on?" In this case, there is an inner re segment. So, how could we avoid the inner re segments? We can avoid the inner re segment... Most of the times, all the re segments just by creating the projection which are identically segmented which means your inner and outer both have the same amount, same segmentation clause. The same was done over here, as you see, there's now segment on sales ID and also ordered by sales ID which helps us execute the query drop from 22 minutes to eight minutes, and now the memory acquired is just equals to the pool budget which is 8 GB. And if you see, the most What is needed is the hash Join is converted into a merge Join being the ordered by the segmented clause and also the Join clause. So, what this gives us is, it has the new global data distribution and by changing the production design, we have improved the query performance. But there are times when you could not have changed the production design and there's nothing much which can be done. In all those cases, as even in the first case of Vertica after fail of the inner Join, the second Vertica replan (mumbles) spill to this operator. You could let the system degrade by acquiring 180 GB for whatever duration of minutes the query had. You could simply use this hand to replace and run the query in the very first go. Let the system have all the resources it needs. So, use hints wherever possible and filter disk is definitely your option where there're no other options for you to change your projection design. Now, there are times when you find that you have gone through your query plan, you have gone through every other thing and there's not much you see anywhere, but you definitely look at the query and you feel that, "Now, I think I can rewrite this query." And how what makes you decide that is you look at the query and you see that the same table has been accessed several times in my query plan, how can I rewrite this query to access my table just once? And in this particular use case, a very simple use case where a table is scanned three times for several different filters and then a union in Vertica union is kind of costly operator I would say, because union does not know what's the amount of data which should be coming from the underlying query. So we allocate a lot of resources to keep the union running. Now, we could simply replace all these unions by simple "Or" clause. So, simple "Or" clause changes the complete plan of the query and the cost drops down drastically. And now the optimizer almost know the exact amount of rows it has to process. So change, look at your query plans and see if you could make the execution in the profile or the optimizer do better job just by doing some small rewrites. Like if there are some tables frequently accessed you could even use a "With" clause which will do an early materialization and make use the better performance or for the union which I just shared and replace your left Joins with right Joins, use your (mumbles) like shade earlier for you changing your hash table types. This is the exact part what we have followed in this presentation. Hope this presentation was helpful in addressing, at least finding some performance issues in your queries or in your class test. So, thank you for listening to our presentation. Now we are ready for Q&A.
SUMMARY :
and key members of the Vertica customer success team. and other is the best resources to execute the plan. and the most important you see is the memory.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Rakesh Banula | PERSON | 0.99+ |
Rakesh | PERSON | 0.99+ |
Abhi Thakur | PERSON | 0.99+ |
Jeff Ealing | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
two GB | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
one GB | QUANTITY | 0.99+ |
180 GB | QUANTITY | 0.99+ |
80 GB | QUANTITY | 0.99+ |
Rakesh Bankula | PERSON | 0.99+ |
1 GB | QUANTITY | 0.99+ |
millions | QUANTITY | 0.99+ |
8 GB | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
One | QUANTITY | 0.99+ |
22 minutes | QUANTITY | 0.99+ |
60,000 records | QUANTITY | 0.99+ |
15 GB | QUANTITY | 0.99+ |
4 GB | QUANTITY | 0.99+ |
10 GB | QUANTITY | 0.99+ |
five | QUANTITY | 0.99+ |
20 million rows | QUANTITY | 0.99+ |
less than a second | QUANTITY | 0.99+ |
two methods | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
less than 50 characters | QUANTITY | 0.99+ |
Abhi | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
Abhimanu Thakur | PERSON | 0.99+ |
First | QUANTITY | 0.99+ |
eight minutes | QUANTITY | 0.99+ |
one billion rows | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
three lines | QUANTITY | 0.99+ |
10 times | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
three times | QUANTITY | 0.99+ |
one example | QUANTITY | 0.98+ |
each side | QUANTITY | 0.98+ |
Both | QUANTITY | 0.98+ |
5000 | QUANTITY | 0.98+ |
each | QUANTITY | 0.98+ |
over a minute | QUANTITY | 0.98+ |
60,000 rows | QUANTITY | 0.98+ |
2020 | DATE | 0.98+ |
Path ID 3 | OTHER | 0.98+ |
1000 | QUANTITY | 0.98+ |
first part | QUANTITY | 0.98+ |
Path ID 7 | OTHER | 0.98+ |
10 seconds | QUANTITY | 0.98+ |
two reasons | QUANTITY | 0.97+ |
three items | QUANTITY | 0.97+ |
each resource pool | QUANTITY | 0.97+ |
about 20 GB | QUANTITY | 0.97+ |
GCL x | TITLE | 0.97+ |
both projections | QUANTITY | 0.97+ |
two parameters | QUANTITY | 0.97+ |
more than a few seconds | QUANTITY | 0.97+ |
Path ID 4 | OTHER | 0.97+ |
T2 | OTHER | 0.97+ |
75 percentile | QUANTITY | 0.97+ |
Bir Abhimanu Thakur | PERSON | 0.97+ |
both | QUANTITY | 0.96+ |
50 milliseconds | QUANTITY | 0.96+ |
each execution | QUANTITY | 0.96+ |
about two GB | QUANTITY | 0.96+ |
Path ID 6 | OTHER | 0.95+ |
this week | DATE | 0.95+ |
two main things | QUANTITY | 0.93+ |
eight | QUANTITY | 0.93+ |
eight GB | QUANTITY | 0.93+ |
two | QUANTITY | 0.93+ |
Chris Wahl, Rubrik | VMworld 2017
>> ANNOUNCER: Live from Las Vegas, it's theCUBE. Covering VM World 2017. Brought to you by Vmware and its ecosystem partner. >> Hi, I'm Stu Miniman here with John Troyer and excited to welcome back to the program Chris Wahl, who's the Chief Technologist at Rubrik. Chris, thanks for joining us. >> Oh, my pleasure. It's my first VMworld CUBE appearance so I'm super stoked. >> Yeah, we're pretty excited that you hang out with, you know, just a couple of geeks as opposed to, what's it Kevin Durant and Ice Cube. Is this a technology conference or Did you and Bipple work for some Hollywood big time company? >> It's funny you say that, they'll be more tomorrow. So I'll allude to that. But ideally, why not hang out with some cool folks. I mean I live in Oakland. Hip Hop needs to be represented and the Golden State Warriors. >> It's pretty cool. I'm looking forward to the party. I know there will be huge lines. When Katie comes to throw down with a bunch of people. So looking forward to those videos. So we've been looking at Rubrik since, you know, came out of stealth. I got to interview Bipple, you know, really early on, so we've been watching. What you're on like the 4.0 release now right? How long has that taken and you know why don't you bring us up to speed with what's going on with Rubrik. >> Yeah, it's our ninth, our ninth major release over basically eight quarters. And along with that, we've announced we've hit like a 150 million dollar run rate that we've included when we started it was all about VMWare, doing back-ups providing those back-ups a place to land, meaning object store or AWS S3. And now it's, we protect Hyper-V, Acropolis from Nutanix, obviously the VMWare Suite, we can do archive to Azure, we can do, there's like 30 some-odd integration points. With various storage vendors, archive vendors, public cloud, etcetera. And the ulta release which is 4.0, just really extends that because now, not only can we provide backups and recovery and archive, which is kind of our bread and butter. But you can archive that to public cloud and now you can start running those workloads. Right, so what we call a cloud on, I can take either on demand or archive data that's been sent to S3, and I can start building virtual machines, like I said on demand. I can take the AMI, put it in EC2 and start running it right now. And I start taking advantage of the services and it's a backup product. Like, that's what always kind of blows my mind. This isn't, that's not the use case, it's one thing that we unlock from backup to archive data >> One of the challenges I usually see out there, is that people are like, oh Rubrik, you know they do backups for VMWare, how do you, you know, you're very much involved in educating and getting out there and telling people about it, how do you get over the, oh wait you heard what we were doing six months ago or six weeks ago, and now we're doing so much more. So how do you stay up with that? >> It's tough to keep up obviously, because every quarter we basically have either some kind of major or a dot release that comes out. I mean realistically, I set the table a little bit differently, I say, what are you looking to do? What are the outcomes that you're trying to drive? Simplicity's a huge one because everyone's dealing with I have a backup storage vendor and I have a storage vendor, and I have tape vendor, and all this other hodge podge things that they're dealing with. They're looking to save money, but ultimately they're trying to automate, start leveraging the cloud. Start really like, taking the headache out of providing something that's very necessary. And when I start talking about the services they can add, beyond that, because it's not just about taking a backup, leaving it in some rotting archive for 10 years, or whatever, it's really what can I do with the data once I have this duplicated and compressed, kind of pool, that I can start drawing from. And that's where people start to, their mind gets blown a little bit. Now that the individual features and check boxes sets, it is what it is, you know, like if you happen to need Hyper-V or Acropolis or whatever, it's really just where you are on that journey to start taking advantage of this data. And I think that's where people start to get really excited and we start white boarding and nerding out a little bit. >> Well Chris, so don't keep us in suspense, what kinds of things can you do once you have a copy of this data? It's still, it's all live, it's either on solid state or spinning disk or in the cloud somewhere. That's very different than just putting it on tape, so what do I do now, that I have all this data pool? >> So probably the most common use case is, I have VBC and a security group in Amazon. That exists today. I'm archiving to S3 in some way, shape, or form. Either IA or whatever flavor vessel you want. And then you're thinking, well I have these applications, what else can I do with them? What if I put it to a query service or a relational data base service, or what if I sped up 10 different copies because I need to for lode testing or some type of testing. I mean it all falls under the funnel of dev test, but I hate just capping it that way, because I think it's unimaginative. Realistically, we're saying here you have this giant pile of compute, that you're already leveraging the storage part of it, you the object store that is S3. What if you could unlock all the other services with no heavy lift? And the workload is actually built as an AMI. Right, so an ami, it's actually running an EC2, so there's no, you don't necessarily have to extend the Hyper Visor layer or anything like that. And it's essentially S3 questions, from the product perspective. It's you know, what security group, BCP, and shape of the format you want it to be. Like large, small, Xlarge, et cetera. That's it. So think about unlocking cloud potentials for less technical people or people that are dipping their toe in a public cloud. It really unlocks that ability and we control the data plane across it. >> Just one thing on that, because it's interesting, dev tests a lot of times, used to get shoved to the back. And it was like, oh you can run on that old gear, you know you don't have any money for it. We've actually found that it can increase, kind of the companies agility and development is a big part of creating big cool things out of a company, so you don't under sell what improving dev tests can do. So did you have some customer stories or great things that customers have done with what this capability has. >> Yeah, but to be fair, at first when I saw that we were going to start, basically taking VMWare backups and pushing that in archive and then turning those into EC2 instances of any shape or quantity. I was like, that's kind of crazy, who has really wanted that Then I started talking to customers and it was a huge request. And a lot of times, my architectural background would think, lift and shift, oh no, don't necessarily do that. I'm not a huge fan of that process. But while that is certainly something you can do, what they're really looking to do is, well, I have this binary package or application suite that's running on Elk Stack or some Linux distro, or whatever, and I can't do anything with that because it's in production and it's making me money, but I'd really like to see what could be done with that? Or potentially can I just eliminate it completely and turn it into a service. And so I've got some customers that completely what they're doing, they're archiving already and what they have the product doing is every time a new snapshot is taken and is sent to the cloud, it builds automatically that EC2 instance, and it starts running it. So they have a collection of various state points that they can start playing with. The actual backup is immutable, but then they're saying, alright, what if exactly what I kind of alluded to a little, what if I start using a native service in the cloud. Or potentially just discard that workload completely. And start turning it into a service, or refactor it, re platform it et cetera. And they're not having to provision, usually you have to buy infrastructure to do that. Like you're talking about the waterfall of Chinese stuff, that turns into dev stuff three years later. They don't have to do that, they can literally start taking advantage of this cloud resource. Run it for an hour or so, because devs are great at CDIC pipelines, let's just automate the whole stack, let's answer our question by running queries through jenkins or something like that. And then throw it away and it cost a couple of bucks. I think that's pretty huge. >> Well Chris, can you also use this capability for DR, for disaster recovery? Can you re hydrate your AMI's up there if everything goes South in your data center? >> Absolutely. I mean it's a journey and this is for dot zero. So I'm not going to wave my hands and say that it's an amazing DR solution. But the third kind of use case that we highlight with our product is that absolutely. You can take the work loads either as a planned event, and say I'm actually putting it here and this is a permanent thing. Or an unplanned event, which is what we all are trying to avoid. Where you're running the work loads in the cloud, for some deterministic period of time, and either the application layer or the file system layer, or even, like a data base layer, you're then protecting it, using our cloud cluster technology, which is Rubrik running in the cloud. Right there, it has access to S3 and EC2, you know, adjacently, there is not net fee and then you start protecting that and sending the data the other way. Because Rubriks software can talk to any other Rubrik's software. We don't care what format or package it's in. In the future we'd like to add more to that. I don't want to over sell it, but certainly that's the journey. >> Chris tell us about how your customers are feeling about the cloud in general. You know you've lived with the VM community for a lot of years, like many of us, and that journey to cloud and you know, what is Hybrid and multi-cloud mean to them, and you know, what you've been seeing at Rubrik over the last year. >> Yeah it's ahh, everybody has a different definition between hybrid, public, private-- >> Stu: Every customer I ever talked to will have a different answer to that. >> I just say multi cloud, because it feels the most safe And the technically correct version of that definition. It's certainly something that, everyone's looking to do. I think kind of the I want to build a private cloud phase of the journey is somewhat expired in some cases. >> Stu: Did you see Pat's keynote this morning? >> Yeah, the I want to build a private cloud using open stack and you know, build all my widgets. I feel that era of marketing or whatnot, that was kind of like 2008 or 2010. So that kind of era of marketing message has died a little bit. It's really just more I have on prem stuff, I'm trying to modernize it, using hyper-converge, or using software to find X, you know, networking et cetera But ultimately I have to start leveraging the places where my paths, my iya's and my sas are going to start running. How do I then cobble all that together. I mean at the sea level, I need visibility, I need control, I need to make executable decisions. That are financially impactful. And so having something they can look across to those different ecosystems, and give you actionable data, like here's where it's running, here's where it could run, you know, it's all still just a business decision, based on SLA. It's powerful. But then as you go kind of down message for maybe a director or someone's who's managing IT, that's really, someone's breathing down their neck, saying, we've got to have a strategy. But they're technically savvy, they don't want to just put stuff in the cloud and get that huge bill. Then they have to like explain that as well. So it kind of sits in a nice place where we can protect the modern apps, or kind of, I guess you can call them, modern slash legacy in the data center. But also start providing protection at a landing pad for the cloud native to use as an over watch term The stuff that's built for cloud that runs there, that's distributed and very sensitive to the fact that it charges per iota of use at the same time. >> Well Chris, originally Rubrik was deploying to customers as an appliance, right? So can you talk a little bit about that, right, you have many different options now, the customer, right? You can get open source, you can get commercial software, or you can get appliances, you can get SAS, and now it sounds like you're, there's also a piece that can run in the cloud, right? That it's not just a box that sits in a did center somewhere So can you talk about, again, what do customers want? What's the advantage of some of those different deployment mechanisms, what do you see? >> I'm not saying this as a stalling tactic, but I love that question. Because yes, when we started it made sense, build a turnkey appliance, make sure that it's simple. Like in deployment, we used to say it can deploy in an hour and that includes the time to take it out of the box and that only goes so far because that's one use case. So certainly, for the first year or so, the product that was where we were driving it, as a scale out node based solution then we added Rubrik edge as a virtual appliance. And really it was meant to, I have a data center and I'm covering those remote offices, type use cases. And we required that folks kind of tether the two, because it's a single node that's really just a suggesting data and bringing it back using policy. Then we introduced cloud cluster in 3.2 which is a couple of releases ago. And that allows you to literally build a four plus node cluster as your AWS, basically you give us your account info and we share the EMI with you or the VM in case of Azure and then you can just build it, right? And that's totally independent, like you can just be a customer. We have a couple of customers that are public, that's all they do, they deploy cloud cluster they backup things in that environment. And then they replicate or archive to various clouds or various regions within clouds. And there's no requirement to buy the appliance because that would be kind of no bueno to do that. >> Sure. >> So right, there's various packages or we have the idea now where you can bring your own hardware to the table. And we'll sell you the software, so like Lenovo and Cisco and things like that. It can be your choice based on the relationships you have. >> Wow Chris your teams are gone a lot, not just your personal team but the Rubrik team I walked by the booth and wait, I saw five more people that I know from various companies. Talk about the growth of like, you know Rubrik. You joined a year ago and it felt like a small company then. Now you guys are there, I get the report from this financial analyst firms and like, have you seen the latest unicorn, Rubrik and I'm like, Rubrik, I know those guys. And gals. So yeah absolutely, talk about the growth of the company. What's the company hiring for? Tell us a little bit about the culture inside. >> Sure, I mean, it's actually been a little over two years now that I've been there, it's kind of flying. I was in the first 50 hires for the company. So at the time I felt like the FNG, but I guess now, I'm kind like the old, old man. I think we're approaching or have crossed the 500 employee threshold and we're talking eight quarters essentially. A lot of investment, across the world, right, so we decided very early on to invest in Europe as a market. We had offices in Utruck in the Netherlands. And in London, the UK, we've got a bunch of engineering folks in India. So we've got two different engineering teams. As well as, we have an excellent, center of excellence, I think in Kansas City. So there's a whole bunch of different roots that we're planting as a company. As well as a global kind of effort to make sales, support, product, engineering, marketing obviously, something that scales everywhere. It's not like all the engineers are in Palo Alto and Silicone Valley and everyone else is just in sales. But we're kind of driving across everywhere. My team went from one to six. Over the last eight or nine months. So everything is growing. Which I guess is good. >> As part of that you also moved to Silicone Valley and so how does it compare to the TV show. >> Chris: It's in Oakland. >> Well it's close enough to Silicone Valley. >> It's Silicone Valley adjacent. I will say I used to visit all the time, you know. For various events and things like that. Or for VM World or whatnot. I always got the impression that I liked being there for about a week and then I wanted to leave before I really started drinking the kool aid a little heavily so it's nice being just slightly on the east bay area. At the same time, I go to events and things now. More as a local and it's kind of awesome to hear oh I invented whatever technology, I invented bootstrap or MPM or something like that. And they're just available to chat with. I tried it at that the, the sunscreen song, where he says, you know, move to california, but leave before you turn soft. So at some point I might have to go back to Texas or something to just to keep the scaley rigidity to my persona intact. >> Yeah, so you missed the barbecue? >> Well I don't know if you saw Franklin's barbecue actually burned down during the hurricane, so. >> No >> Yeah, if you're a, a huge barbecue fan in Austin, weep a tear, it might be a bad mojo for a little bit. >> Wow. Alright, we were alluding at the very beginning of the interview, you've got some VIP guests, we don't talk too much about, like, oh we're doing this tomorrow and everything, but you got some cool activities, the all stars, you know some of the things. Give us a little viewpoint, what's the goal coming into VM World this year and what are some of the cool things that you're team and the extended team are doing. >> Yeah, so kind of more on the nerdy fun side, we've actually built up, one of my team, Rebecca Fitzhughes build out this V all stars card deck so we picked a bunch of infuencers, and people that, you know friends and family kind of thing built them some trading cards and based on what you turn in you can win prizes and things like that. It was just a lot of other vendors have done things that I really respect. Like Solid Fire has the socks and the cards against humanity as an example. I wanted to do something similar and Rebecca had a great idea. She executed on that. Beyond that though, we obviously have Ice Cube coming in. He's going to be partying at the Marquis on Tuesday evening so he'll be, he'll be hanging around, you know the king of hip hop there. And on a more like fun, charitable note, we actually have Kevin Durant coming in tomorrow. We are shooting hoops for his charity fund. So everybody that sinks a goal, or ahh, I'm obviously not a basket ball person, but whoever sinks the ball into the hoop gets two dollars donated to his charity fund and you build it to win a jersey and things like that. So kind of spreading it across sports, music, and various digital transformation type things. To make sure that everyone who comes in, has a good time. VMWare's our roots, right? 1.0, the product was focused on that environment. It's been my roots for a long time. And we want to pay that back to the community. You can't forget where you came from, right? >> Alright, Chris Wahl, great to catch up with you. Thanks for joining us sporting your Alta t-shirt your Rubrik... >> I'm very branded. >> John Troyer and I will be back with lots more coverage here at VM World 2017, you're watching theCUBE.
SUMMARY :
Brought to you by Vmware and its ecosystem partner. and excited to welcome back to the program It's my first VMworld CUBE appearance so I'm super stoked. Yeah, we're pretty excited that you hang out with, It's funny you say that, they'll be more tomorrow. I got to interview Bipple, you know, really early on, And I start taking advantage of the services and it's is that people are like, oh Rubrik, you know they do I say, what are you looking to do? what kinds of things can you do once you have shape of the format you want it to be. And it was like, oh you can run on that old gear, you know And they're not having to provision, usually you have to Right there, it has access to S3 and EC2, you know, mean to them, and you know, Stu: Every customer I ever talked to will have a I just say multi cloud, because it feels the most safe the modern apps, or kind of, I guess you can call them, an hour and that includes the time to take it out of the box And we'll sell you the software, so like Talk about the growth of like, you know Rubrik. And in London, the UK, we've got a bunch of engineering As part of that you also moved to Silicone Valley I will say I used to visit all the time, you know. Well I don't know if you saw Franklin's barbecue Yeah, if you're a, a huge barbecue fan in Austin, you know some of the things. and you build it to win a jersey and things like that. Alright, Chris Wahl, great to catch up with you. John Troyer and I will be back with lots more
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Rebecca | PERSON | 0.99+ |
Lenovo | ORGANIZATION | 0.99+ |
Kevin Durant | PERSON | 0.99+ |
Cisco | ORGANIZATION | 0.99+ |
India | LOCATION | 0.99+ |
John Troyer | PERSON | 0.99+ |
Texas | LOCATION | 0.99+ |
Chris Wahl | PERSON | 0.99+ |
Chris | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
Europe | LOCATION | 0.99+ |
Oakland | LOCATION | 0.99+ |
London | LOCATION | 0.99+ |
Franklin | PERSON | 0.99+ |
Austin | LOCATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
Kansas City | LOCATION | 0.99+ |
Silicone Valley | LOCATION | 0.99+ |
Rebecca Fitzhughes | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Stu Miniman | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
Katie | PERSON | 0.99+ |
Pat | PERSON | 0.99+ |
Golden State Warriors | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
one | QUANTITY | 0.99+ |
2008 | DATE | 0.99+ |
two dollars | QUANTITY | 0.99+ |
Tuesday evening | DATE | 0.99+ |
Vmware | ORGANIZATION | 0.99+ |
ninth | QUANTITY | 0.99+ |
california | LOCATION | 0.99+ |
UK | LOCATION | 0.99+ |
six months ago | DATE | 0.99+ |
tomorrow | DATE | 0.99+ |
10 different copies | QUANTITY | 0.99+ |
six weeks ago | DATE | 0.99+ |
Netherlands | LOCATION | 0.99+ |
Rubrik | ORGANIZATION | 0.99+ |
EC2 | TITLE | 0.99+ |
500 employee | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
150 million dollar | QUANTITY | 0.98+ |
an hour | QUANTITY | 0.98+ |
VM World 2017 | EVENT | 0.98+ |
Nutanix | ORGANIZATION | 0.98+ |
single | QUANTITY | 0.98+ |
six | QUANTITY | 0.98+ |
VMWare | TITLE | 0.98+ |
three years later | DATE | 0.98+ |
today | DATE | 0.98+ |
last year | DATE | 0.98+ |
CDIC | ORGANIZATION | 0.98+ |
VBC | ORGANIZATION | 0.97+ |
a year ago | DATE | 0.97+ |
One | QUANTITY | 0.97+ |
Linux | TITLE | 0.97+ |
VM World | EVENT | 0.97+ |
first 50 hires | QUANTITY | 0.96+ |
first year | QUANTITY | 0.96+ |
Stu | PERSON | 0.96+ |
five more people | QUANTITY | 0.96+ |
two different engineering teams | QUANTITY | 0.95+ |
VMworld 2017 | EVENT | 0.95+ |
over two years | QUANTITY | 0.95+ |
VMWare Suite | TITLE | 0.95+ |
jenkins | PERSON | 0.94+ |