UNLIST TILL 4/2 - The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL
hello everybody and thank you for joining us today for the virtual verdict of BBC 2020 today's breakout session is entitled the shortest path to Vertica best practices for data warehouse migration ETL I'm Jeff Healey I'll leave verdict and marketing I'll be your host for this breakout session joining me today are Marco guesser and Mauricio lychee vertical product engineer is joining us from yume region but before we begin I encourage you to submit questions or comments or in the virtual session don't have to wait just type question in a comment in the question box below the slides that click Submit as always there will be a Q&A session the end of the presentation will answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forums that formed at vertical comm to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also reminder that you can maximize your screen by clicking the double arrow button and lower right corner of the sides and yes this virtual session is being recorded be available to view on demand this week send you a notification as soon as it's ready now let's get started over to you mark marco andretti oh hello everybody this is Marco speaking a sales engineer from Amir said I'll just get going ah this is the agenda part one will be done by me part two will be done by Mauricio the agenda is as you can see big bang or piece by piece and the migration of the DTL migration of the physical data model migration of et I saw VTL + bi functionality what to do with store procedures what to do with any possible existing user defined functions and migration of the data doctor will be by Maurice it you want to talk about emeritus Rider yeah hello everybody my name is Mauricio Felicia and I'm a birth record pre-sales like Marco I'm going to talk about how to optimize that were always using some specific vertical techniques like table flattening live aggregated projections so let me start with be a quick overview of the data browser migration process we are going to talk about today and normally we often suggest to start migrating the current that allows the older disease with limited or minimal changes in the overall architecture and yeah clearly we will have to port the DDL or to redirect the data access tool and we will platform but we should minimizing the initial phase the amount of changes in order to go go live as soon as possible this is something that we also suggest in the second phase we can start optimizing Bill arouse and which again with no or minimal changes in the architecture as such and during this optimization phase we can create for example dog projections or for some specific query or optimize encoding or change some of the visual spools this is something that we normally do if and when needed and finally and again if and when needed we go through the architectural design for these operations using full vertical techniques in order to take advantage of all the features we have in vertical and this is normally an iterative approach so we go back to name some of the specific feature before moving back to the architecture and science we are going through this process in the next few slides ok instead in order to encourage everyone to keep using their common sense when migrating to a new database management system people are you often afraid of it it's just often useful to use the analogy of how smooth in your old home you might have developed solutions for your everyday life that make perfect sense there for example if your old cent burner dog can't walk anymore you might be using a fork lifter to heap in through your window in the old home well in the new home consider the elevator and don't complain that the window is too small to fit the dog through this is very much in the same way as Narita but starting to make the transition gentle again I love to remain in my analogy with the house move picture your new house as your new holiday home begin to install everything you miss and everything you like from your old home once you have everything you need in your new house you can shut down themselves the old one so move each by feet and go for quick wins to make your audience happy you do bigbang only if they are going to retire the platform you are sitting on where you're really on a sinking ship otherwise again identify quick wings implement published and quickly in Vertica reap the benefits enjoy the applause use the gained reputation for further funding and if you find that nobody's using the old platform anymore you can shut it down if you really have to migrate you can still go to really go to big battle in one go only if you absolutely have to otherwise migrate by subject area use the group all similar clear divisions right having said that ah you start off by migrating objects objects in the database that's one of the very first steps it consists of migrating verbs the places where you can put the other objects into that is owners locations which is usually schemers then what do you have that you extract tables news then you convert the object definition deploy them to Vertica and think that you shouldn't do it manually never type what you can generate ultimate whatever you can use it enrolls usually there is a system tables in the old database that contains all the roads you can export those to a file reformat them and then you have a create role and create user scripts that you can apply to Vertica if LDAP Active Directory was used for the authentication the old database vertical supports anything within the l dubs standard catalogued schemas should be relatively straightforward with maybe sometimes the difference Vertica does not restrict you by defining a schema as a collection of all objects owned by a user but it supports it emulates it for old times sake Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use it it usually said it is always set to the name of the database in case of vertical having had now the schemas the catalogs the users and roles in place move the take the definition language of Jesus thought if you are allowed to it's best to use a tool that translates to date types in the PTL generated you might see as a mention of old idea to listen by memory to by the way several times in this presentation we are very happy to have it it actually can export the old database table definition because they got it works with the odbc it gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target schema to several target DBMS flavors the most important which is obviously vertical if they force you to use something else there are always tubes like sequel plots in Oracle the show table command in Tara data etc H each DBMS should have a set of tools to extract the object definitions to be deployed in the other instance of the same DBMS ah if I talk about youth views usually a very new definition also in the old database catalog one thing that you might you you use special a bit of special care synonyms is something that were to get emulated different ways depending on the specific needs I said I stop you on the view or table to be referred to or something that is really neat but other databases don't have the search path in particular that works that works very much like the path environment variable in Windows or Linux where you specify in a table an object name without the schema name and then it searched it first in the first entry of the search path then in a second then in third which makes synonym hugely completely unneeded when you generate uvl we remained in the analogy of moving house dust and clean your stuff before placing it in the new house if you see a table like the one here at the bottom this is usually corpse of a bad migration in the past already an ID is usually an integer and not an almost floating-point data type a first name hardly ever has 256 characters and that if it's called higher DT it's not necessarily needed to store the second when somebody was hired so take good care in using while you are moving dust off your stuff and use better data types the same applies especially could string how many bytes does a string container contains for eurozone's it's not for it's actually 12 euros in utf-8 in the way that Vertica encodes strings and ASCII characters one died but the Euro sign thinks three that means that you have to very often you have when you have a single byte character set up a source you have to pay attention oversize it first because otherwise it gets rejected or truncated and then you you will have to very carefully check what their best science is the best promising is the most promising approach is to initially dimension strings in multiples of very initial length and again ODP with the command you see there would be - I you 2 comma 4 will double the lengths of what otherwise will single byte character and multiply that for the length of characters that are wide characters in traditional databases and then load the representative sample of your cells data and profile using the tools that we personally use to find the actually longest datatype and then make them shorter notice you might be talking about the issues of having too long and too big data types on projection design are we live and die with our projects you might know remember the rules on how default projects has come to exist the way that we do initially would be just like for the profiling load a representative sample of the data collector representative set of already known queries from the Vertica database designer and you don't have to decide immediately you can always amend things and otherwise follow the laws of physics avoid moving data back and forth across nodes avoid heavy iOS if you can design your your projections initially by hand encoding matters you know that the database designer is a very tight fisted thing it would optimize to use as little space as possible you will have to think of the fact that if you compress very well you might end up using more time in reading it this is the testimony to run once using several encoding types and you see that they are l e is the wrong length encoded if sorted is not even visible while the others are considerably slower you can get those nights and look it in look at them in detail I will go in detail you now hear about it VI migrations move usually you can expect 80% of everything to work to be able to live to be lifted and shifted you don't need most of the pre aggregated tables because we have live like regain projections many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flatten tables that are going to be talked about later you might have to ride those by hand you will be able to switch off casting because vertical speeds of everything with laps Lyle aggregate projections and you have worked with molap cubes before you very probably won't meet them at all ETL tools what you will have to do is if you do it row by row in the old database consider changing everything to very big transactions and if you use in search statements with parameter markers consider writing to make pipes and using verticals copy command mouse inserts yeah copy c'mon that's what I have here ask you custom functionality you can see on this slide the verticals the biggest number of functions in the database we compare them regularly by far compared to any other database you might find that many of them that you have written won't be needed on the new database so look at the vertical catalog instead of trying to look to migrate a function that you don't need stored procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have very rarely you will have to actually write a procedure that involves a loop but it's really in our experience very very rarely usually you can just switch to standard scripting and this is basically repeating what Mauricio said in the interest of time I will skip this look at this one here the most of the database data warehouse migration talks should be automatic you can use you can automate GDL migration using ODB which is crucial data profiling it's not crucial but game-changing the encoding is the same thing you can automate at you using our database designer the physical data model optimization in general is game-changing you have the database designer use the provisioning use the old platforms tools to generate the SQL you have no objects without their onus is crucial and asking functions and procedures they are only crucial if they depict the company's intellectual property otherwise you can almost always replace them with something else that's it from me for now Thank You Marco Thank You Marco so we will now point our presentation talking about some of the Vertica that overall the presentation techniques that we can implement in order to improve the general efficiency of the dot arouse and let me start with a few simple messages well the first one is that you are supposed to optimize only if and when this is needed in most of the cases just a little shift from the old that allows to birth will provide you exhaust the person as if you were looking for or even better so in this case probably is not really needed to to optimize anything in case you want optimize or you need to optimize then keep in mind some of the vertical peculiarities for example implement delete and updates in the vertical way use live aggregate projections in order to avoid or better in order to limit the goodbye executions at one time used for flattening in order to avoid or limit joint and and then you can also implement invert have some specific birth extensions life for example time series analysis or machine learning on top of your data we will now start by reviewing the first of these ballots optimize if and when needed well if this is okay I mean if you get when you migrate from the old data where else to birth without any optimization if the first four month level is okay then probably you only took my jacketing but this is not the case one very easier to dispute in session technique that you can ask is to ask basket cells to optimize the physical data model using the birth ticket of a designer how well DB deal which is the vertical database designer has several interfaces here I'm going to use what we call the DB DB programmatic API so basically sequel functions and using other databases you might need to hire experts looking at your data your data browser your table definition creating indexes or whatever in vertical all you need is to run something like these are simple as six single sequel statement to get a very well optimized physical base model you see that we start creating a new design then we had to be redesigned tables and queries the queries that we want to optimize we set our target in this case we are tuning the physical data model in order to maximize query performances this is why we are using my design query and in our statement another possible journal tip would be to tune in order to reduce storage or a mix between during storage and cheering queries and finally we asked Vertica to produce and deploy these optimized design in a matter of literally it's a matter of minutes and in a few minutes what you can get is a fully optimized fiscal data model okay this is something very very easy to implement keep in mind some of the vertical peculiarities Vaska is very well tuned for load and query operations aunt Berta bright rose container to biscuits hi the Pharos container is a group of files we will never ever change the content of this file the fact that the Rose containers files are never modified is one of the political peculiarities and these approach led us to use minimal locks we can add multiple load operations in parallel against the very same table assuming we don't have a primary or unique constraint on the target table in parallel as a sage because they will end up in two different growth containers salad in read committed requires in not rocket fuel and can run concurrently with insert selected because the Select will work on a snapshot of the catalog when the transaction start this is what we call snapshot isolation the kappa recovery because we never change our rows files are very simple and robust so we have a huge amount of bandages due to the fact that we never change the content of B rows files contain indiarose containers but on the other side believes and updates require a little attention so what about delete first when you believe in the ethica you basically create a new object able it back so it appeared a bit later in the Rose or in memory and this vector will point to the data being deleted so that when the feed is executed Vertica will just ignore the rules listed in B delete records and it's not just about the leak and updating vertical consists of two operations delete and insert merge consists of either insert or update which interim is made of the little insert so basically if we tuned how the delete work we will also have tune the update in the merge so what should we do in order to optimize delete well remember what we said that every time we please actually we create a new object a delete vector so avoid committing believe and update too often we reduce work the work for the merge out for the removal method out activities that are run afterwards and be sure that all the interested projections will contain the column views in the dedicate this will let workers directly after access the projection without having to go through the super projection in order to create the vector and the delete will be much much faster and finally another very interesting optimization technique is trying to segregate the update and delete operation from Pyrenean third workload in order to reduce lock contention beliefs something we are going to discuss and these contain using partition partition operation this is exactly what I want to talk about now here you have a typical that arouse architecture so we have data arriving in a landing zone where the data is loaded that is from the data sources then we have a transformation a year writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end those green data structure we have at the end are the ones used by the data access tools when they run their queries sometimes we might need to change old data for example because we have late records or maybe because we want to fix some errors that have been originated in the facilities so what we do in this case is we just copied back the partition we want to change or we want to adjust from the green interior a the end to the stage in the area we have a very fast operation which is Tokyo Station then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area and at the very same time people continues to you with green data structures that are at the end so we will never have contention between the two operations when we updating the staging area is completed what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in be staging zone to the query area that is the green one at the end this swap partition is very fast is an atomic operation and basically what will happens is just that well exchange the pointer to the data this is a very very effective techniques and lot of customer useless so why flops on table and live aggregate for injections well basically we use slot in table and live aggregate objection to minimize or avoid joint this is what flatten table are used for or goodbye and this is what live aggregate projections are used for now compared to traditional data warehouses better can store and process and aggregate and join order of magnitudes more data that is a true columnar database joint and goodbye normally are not a problem at all they run faster than any traditional data browse that page there are still scenarios were deficits are so big and we are talking about petabytes of data and so quickly going that would mean be something in order to boost drop by and join performances and this is why you can't reduce live aggregate projections to perform aggregations hard loading time and limit the need for global appear on time and flux and tables to combine information from different entity uploading time and again avoid running joint has query undefined okay so live aggregate projections at this point in time we can use live aggregate projections using for built in aggregate functions which are some min Max and count okay let's see how this works suppose that you have a normal table in this case we have a table unit sold with three columns PIB their time and quantity which has been segmented in a given way and on top of this base table we call it uncle table we create a projection you see that we create the projection using the salad that will aggregate the data we get the PID we get the date portion of the time and we get the sum of quantity from from the base table grouping on the first two columns so PID and the date portion of day time okay what happens in this case when we load data into the base table all we have to do with load data into the base table when we load data into the base table we will feel of course big injections that assuming we are running with k61 we will have to projection to projections and we will know the data in those two projection with all the detail in data we are going to load into the table so PAB playtime and quantity but at the very same time at the very same time and without having to do nothing any any particular operation or without having to run any any ETL procedure we will also get automatically in the live aggregate projection for the data pre aggregated with be a big day portion of day time and the sum of quantity into the table name total quantity you see is something that we get for free without having to run any specific procedure and this is very very efficient so the key concept is that during the loading operation from VDL point of view is executed again the base table we do not explicitly aggregate data or we don't have any any plc do the aggregation is automatic and we'll bring the pizza to be live aggregate projection every time we go into the base table you see the two selection that we have we have on in this line on the left side and you see that those two selects will produce exactly the same result so running select PA did they trying some quantity from the base table or running the select star from the live aggregate projection will result exactly in the same data you know this is of course very useful but is much more useful result that if we and we can observe this if we run an explained if we run the select against the base table asking for this group data what happens behind the scene is that basically vertical itself that is a live aggregate projection with the data that has been already aggregating loading phase and rewrite your query using polite aggregate projection this happens automatically you see this is a query that ran a group by against unit sold and vertical decided to rewrite this clearly as something that has to be collected against the light aggregates projection because if I decrease this will save a huge amount of time and effort during the ETL cycle okay and is not just limited to be information you want to aggregate for example another query like select count this thing you might note that can't be seen better basically our goodbyes will also take advantage of the live aggregate injection and again this is something that happens automatically you don't have to do anything to get this okay one thing that we have to keep very very clear in mind Brassica what what we store in the live aggregate for injection are basically partially aggregated beta so in this example we have two inserts okay you see that we have the first insert that is entered in four volts and the second insert which is inserting five rules well in for each of these insert we will have a partial aggregation you will never know that after the first insert you will have a second one so better will calculate the aggregation of the data every time irin be insert it is a key concept and be also means that you can imagine lies the effectiveness of bees technique by inserting large chunk of data ok if you insert data row by row this technique live aggregate rejection is not very useful because for every goal that you insert you will have an aggregation so basically they'll live aggregate injection will end up containing the same number of rows that you have in the base table but if you everytime insert a large chunk of data the number of the aggregations that you will have in the library get from structure is much less than B base data so this is this is a key concept you can see how these works by counting the number of rows that you have in alive aggregate injection you see that if you run the select count star from the solved live aggregate rejection the query on the left side you will get four rules but actually if you explain this query you will see that he was reading six rows so this was because every of those two inserts that we're actively interested a few rows in three rows in India in the live aggregate projection so this is a key concept live aggregate projection keep partially aggregated data this final aggregation will always happen at runtime okay another which is very similar to be live aggregate projection or what we call top K projection we actually do not aggregate anything in the top case injection we just keep the last or limit the amount of rows that we collect using the limit over partition by all the by clothes and this again in this case we create on top of the base stable to top gay projection want to keep the last quantity that has been sold and the other one to keep the max quantity in both cases is just a matter of ordering the data in the first case using the B time column in the second page using quantity in both cases we fill projection with just the last roof and again this is something that we do when we insert data into the base table and this is something that happens automatically okay if we now run after the insert our select against either the max quantity okay or be lost wanted it okay we will get the very last you see that we have much less rows in the top k projections okay we told at the beginning that basically we can use for built-in function you might remember me max sum and count what if I want to create my own specific aggregation on top of the lid and customer sum up because our customers have very specific needs in terms of live aggregate projections well in this case you can code your own live aggregate production user-defined functions so you can create the user-defined transport function to implement any sort of complex aggregation while loading data basically after you implemented miss VPS you can deploy using a be pre pass approach that basically means the data is aggregated as loading time during the data ingestion or the batch approach that means that the data is when that woman is running on top which things to remember on live a granade projections they are limited to be built in function again some max min and count but you can call your own you DTF so you can do whatever you want they can reference only one table and for bass cab version before 9.3 it was impossible to update or delete on the uncle table this limit has been removed in 9.3 so you now can update and delete data from the uncle table okay live aggregate projection will follow the segmentation of the group by expression and in some cases the best optimizer can decide to pick the live aggregates objection or not depending on if depending on the fact that the aggregation is a consistent or not remember that if we insert and commit every single role to be uncoachable then we will end up with a live aggregate indirection that contains exactly the same number of rows in this case living block or using the base table it would be the same okay so this is one of the two fantastic techniques that we can implement in Burtka this live aggregate projection is basically to avoid or limit goodbyes the other which we are going to talk about is cutting table and be reused in order to avoid the means for joins remember that K is very fast running joints but when we scale up to petabytes of beta we need to boost and this is what we have in order to have is problem fixed regardless the amount of data we are dealing with so how what about suction table let me start with normalized schemas everybody knows what is a normalized scheme under is no but related stuff in this slide the main scope of an normalized schema is to reduce data redundancies so and the fact that we reduce data analysis is a good thing because we will obtain fast and more brides we will have to write into a database small chunks of data into the right table the problem with these normalized schemas is that when you run your queries you have to put together the information that arrives from different table and be required to run joint again jointly that again normally is very good to run joint but sometimes the amount of data makes not easy to deal with joints and joints sometimes are not easy to tune what happens in in the normal let's say traditional data browser is that we D normalize the schemas normally either manually or using an ETL so basically we have on one side in this light on the left side the normalized schemas where we can get very fast right on the other side on the left we have the wider table where we run all the three joints and pre aggregation in order to prepare the data for the queries and so we will have fast bribes on the left fast reads on the Left sorry fast bra on the right and fast read on the left side of these slides the probability in the middle because we will push all the complexity in the middle in the ETL that will have to transform be normalized schema into the water table and the way we normally implement these either manually using procedures that we call the door using ETL this is what happens in traditional data warehouse is that we will have to coach in ETL layer in order to round the insert select that will feed from the normalized schema and right into the widest table at the end the one that is used by the data access tools we we are going to to view store to run our theories so this approach is costly because of course someone will have to code this ETL and is slow because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was ok with the batch and is resource intensive of course and is also human being intensive because of the people that will have to code and check the results it ever thrown because it can fail and introduce a latency because there is a get in the time axis between the time t0 when you load the data into be normalized schema and the time t1 when we get the data finally ready to be to be queried so what would be inverter to facilitate this process is to create this flatten table with the flattened T work first you avoid data redundancy because you don't need the wide table on the normalized schema on the left side second is fully automatic you don't have to do anything you just have to insert the data into the water table and the ETL that you have coded is transformed into an insert select by vatika automatically you don't have to do anything it's robust and this Latin c0 is a single fast as soon as you load the data into the water table you will get all the joints executed for you so let's have a look on how it works in this case we have the table we are going to flatten and basically we have to focus on two different clauses the first one is you see that there is one table here I mentioned value 1 which can be defined as default and then the Select or set using okay the difference between the fold and set using is when the data is populated if we use default data is populated as soon as we know the data into the base table if we use set using Google Earth to refresh but everything is there I mean you don't need them ETL you don't need to code any transformation because everything is in the table definition itself and it's for free and of course is in latency zero so as soon as you load the other columns you will have the dimension value valued as well okay let's see an example here suppose here we have a dimension table customer dimension that is on the left side and we have a fact table on on the right you see that the fact table uses columns like o underscore name or Oh the score city which are basically the result of the salad on top of the customer dimension so Beezus were the join is executed as soon as a remote data into the fact table directly into the fact table without of course loading data that arise from the dimension all the data from the dimension will be populated automatically so let's have an example here suppose that we are running this insert as you can see we are running be inserted directly into the fact table and we are loading o ID customer ID and total we are not loading made a major name no city those name and city will be automatically populated by Vertica for you because of the definition of the flood table okay you see behave well all you need in order to have your widest tables built for you your flattened table and this means that at runtime you won't need any join between base fuck table and the customer dimension that we have used in order to calculate name and city because the data is already there this was using default the other option was is using set using the concept is absolutely the same you see that in this case on the on the right side we have we have basically replaced this all on the school name default with all underscore name set using and same is true for city the concept that I said is the same but in this case which we set using then we will have to refresh you see that we have to run these select trash columns and then the name of the table in this case all columns will be fresh or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension so this technique this technique is extremely useful the difference between default and said choosing just to summarize the most important differences remember you just have to remember that default will relate your target when you load set using when you refresh end and in some cases you might need to use them both so in some cases you might want to use both default end set using in this example here we'll see that we define the underscore name using both default and securing and this means that we love the data populated either when we load the data into the base table or when we run the Refresh this is summary of the technique that we can implement in birth in order to make our and other browsers even more efficient and well basically this is the end of our presentation thank you for listening and now we are ready for the Q&A session you
SUMMARY :
the end to the stage in the area we have
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Tom | PERSON | 0.99+ |
Marta | PERSON | 0.99+ |
John | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
David | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Chris Keg | PERSON | 0.99+ |
Laura Ipsen | PERSON | 0.99+ |
Jeffrey Immelt | PERSON | 0.99+ |
Chris | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Chris O'Malley | PERSON | 0.99+ |
Andy Dalton | PERSON | 0.99+ |
Chris Berg | PERSON | 0.99+ |
Dave Velante | PERSON | 0.99+ |
Maureen Lonergan | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Paul Forte | PERSON | 0.99+ |
Erik Brynjolfsson | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Andrew McCafee | PERSON | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Cheryl | PERSON | 0.99+ |
Mark | PERSON | 0.99+ |
Marta Federici | PERSON | 0.99+ |
Larry | PERSON | 0.99+ |
Matt Burr | PERSON | 0.99+ |
Sam | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
Dave Wright | PERSON | 0.99+ |
Maureen | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Cheryl Cook | PERSON | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
$8,000 | QUANTITY | 0.99+ |
Justin Warren | PERSON | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
2012 | DATE | 0.99+ |
Europe | LOCATION | 0.99+ |
Andy | PERSON | 0.99+ |
30,000 | QUANTITY | 0.99+ |
Mauricio | PERSON | 0.99+ |
Philips | ORGANIZATION | 0.99+ |
Robb | PERSON | 0.99+ |
Jassy | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Mike Nygaard | PERSON | 0.99+ |
UNLIST TILL 4/2 - End-to-End Security
>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled End-to-End Security in Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica. I'll be your host for this session. Joining me is Vertica Software Engineers, Fenic Fawkes and Chris Morris. Before we begin, I encourage you to submit your questions or comments during the virtual session. You don't have to wait until the end. Just type your question or comment in the question box below the slide as it occurs to you and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Also, you can visit Vertica forums to post your questions there after the session. Our team is planning to join the forums to keep the conversation going, so it'll be just like being at a conference and talking to the engineers after the presentation. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this whole session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. I think we're ready to get started. Over to you, Fen. >> Fenic: Hi, welcome everyone. My name is Fen. My pronouns are fae/faer and Chris will be presenting the second half, and his pronouns are he/him. So to get started, let's kind of go over what the goals of this presentation are. First off, no deployment is the same. So we can't give you an exact, like, here's the right way to secure Vertica because how it is to set up a deployment is a factor. But the biggest one is, what is your threat model? So, if you don't know what a threat model is, let's take an example. We're all working from home because of the coronavirus and that introduces certain new risks. Our source code is on our laptops at home, that kind of thing. But really our threat model isn't that people will read our code and copy it, like, over our shoulders. So we've encrypted our hard disks and that kind of thing to make sure that no one can get them. So basically, what we're going to give you are building blocks and you can pick and choose the pieces that you need to secure your Vertica deployment. We hope that this gives you a good foundation for how to secure Vertica. And now, what we're going to talk about. So we're going to start off by going over encryption, just how to secure your data from attackers. And then authentication, which is kind of how to log in. Identity, which is who are you? Authorization, which is now that we know who you are, what can you do? Delegation is about how Vertica talks to other systems. And then auditing and monitoring. So, how do you protect your data in transit? Vertica makes a lot of network connections. Here are the important ones basically. There are clients talk to Vertica cluster. Vertica cluster talks to itself. And it can also talk to other Vertica clusters and it can make connections to a bunch of external services. So first off, let's talk about client-server TLS. Securing data between, this is how you secure data between Vertica and clients. It prevents an attacker from sniffing network traffic and say, picking out sensitive data. Clients have a way to configure how strict the authentication is of the server cert. It's called the Client SSLMode and we'll talk about this more in a bit but authentication methods can disable non-TLS connections, which is a pretty cool feature. Okay, so Vertica also makes a lot of network connections within itself. So if Vertica is running behind a strict firewall, you have really good network, both physical and software security, then it's probably not super important that you encrypt all traffic between nodes. But if you're on a public cloud, you can set up AWS' firewall to prevent connections, but if there's a vulnerability in that, then your data's all totally vulnerable. So it's a good idea to set up inter-node encryption in less secure situations. Next, import/export is a good way to move data between clusters. So for instance, say you have an on-premises cluster and you're looking to move to AWS. Import/Export is a great way to move your data from your on-prem cluster to AWS, but that means that the data is going over the open internet. And that is another case where an attacker could try to sniff network traffic and pull out credit card numbers or whatever you have stored in Vertica that's sensitive. So it's a good idea to secure data in that case. And then we also connect to a lot of external services. Kafka, Hadoop, S3 are three of them. Voltage SecureData, which we'll talk about more in a sec, is another. And because of how each service deals with authentication, how to configure your authentication to them differs. So, see our docs. And then I'd like to talk a little bit about where we're going next. Our main goal at this point is making Vertica easier to use. Our first objective was security, was to make sure everything could be secure, so we built relatively low-level building blocks. Now that we've done that, we can identify common use cases and automate them. And that's where our attention is going. Okay, so we've talked about how to secure your data over the network, but what about when it's on disk? There are several different encryption approaches, each depends on kind of what your use case is. RAID controllers and disk encryption are mostly for on-prem clusters and they protect against media theft. They're invisible to Vertica. S3 and GCP are kind of the equivalent in the cloud. They also invisible to Vertica. And then there's field-level encryption, which we accomplish using Voltage SecureData, which is format-preserving encryption. So how does Voltage work? Well, it, the, yeah. It encrypts values to things that look like the same format. So for instance, you can see date of birth encrypted to something that looks like a date of birth but it is not in fact the same thing. You could do cool stuff like with a credit card number, you can encrypt only the first 12 digits, allowing the user to, you know, validate the last four. The benefits of format-preserving encryption are that it doesn't increase database size, you don't need to alter your schema or anything. And because of referential integrity, it means that you can do analytics without unencrypting the data. So again, a little diagram of how you could work Voltage into your use case. And you could even work with Vertica's row and column access policies, which Chris will talk about a bit later, for even more customized access control. Depending on your use case and your Voltage integration. We are enhancing our Voltage integration in several ways in 10.0 and if you're interested in Voltage, you can go see their virtual BDC talk. And then again, talking about roadmap a little, we're working on in-database encryption at rest. What this means is kind of a Vertica solution to encryption at rest that doesn't depend on the platform that you're running on. Encryption at rest is hard. (laughs) Encrypting, say, 10 petabytes of data is a lot of work. And once again, the theme of this talk is everyone has a different key management strategy, a different threat model, so we're working on designing a solution that fits everyone. If you're interested, we'd love to hear from you. Contact us on the Vertica forums. All right, next up we're going to talk a little bit about access control. So first off is how do I prove who I am? How do I log in? So, Vertica has several authentication methods. Which one is best depends on your deployment size/use case. Again, theme of this talk is what you should use depends on your use case. You could order authentication methods by priority and origin. So for instance, you can only allow connections from within your internal network or you can enforce TLS on connections from external networks but relax that for connections from your internal network. That kind of thing. So we have a bunch of built-in authentication methods. They're all password-based. User profiles allow you to set complexity requirements of passwords and you can even reject non-TLS connections, say, or reject certain kinds of connections. Should only be used by small deployments because you probably have an LDAP server, where you manage users if you're a larger deployment and rather than duplicating passwords and users all in LDAP, you should use LDAP Auth, where Vertica still has to keep track of users, but each user can then use LDAP authentication. So Vertica doesn't store the password at all. The client gives Vertica a username and password and Vertica then asks the LDAP server is this a correct username or password. And the benefits of this are, well, manyfold, but if, say, you delete a user from LDAP, you don't need to remember to also delete their Vertica credentials. You can just, they won't be able to log in anymore because they're not in LDAP anymore. If you like LDAP but you want something a little bit more secure, Kerberos is a good idea. So similar to LDAP, Vertica doesn't keep track of who's allowed to log in, it just keeps track of the Kerberos credentials and it even, Vertica never touches the user's password. Users log in to Kerberos and then they pass Vertica a ticket that says "I can log in." It is more complex to set up, so if you're just getting started with security, LDAP is probably a better option. But Kerberos is, again, a little bit more secure. If you're looking for something that, you know, works well for applications, certificate auth is probably what you want. Rather than hardcoding a password, or storing a password in a script that you use to run an application, you can instead use a certificate. So, if you ever need to change it, you can just replace the certificate on disk and the next time the application starts, it just picks that up and logs in. Yeah. And then, multi-factor auth is a feature request we've gotten in the past and it's not built-in to Vertica but you can do it using Kerberos. So, security is a whole application concern and fitting MFA into your workflow is all about fitting it in at the right layer. And we believe that that layer is above Vertica. If you're interested in more about how MFA works and how to set it up, we wrote a blog on how to do it. And now, over to Chris, for more on identity and authorization. >> Chris: Thanks, Fen. Hi everyone, I'm Chris. So, we're a Vertica user and we've connected to Vertica but once we're in the database, who are we? What are we? So in Vertica, the answer to that questions is principals. Users and roles, which are like groups in other systems. Since roles can be enabled and disabled at will and multiple roles can be active, they're a flexible way to use only the privileges you need in the moment. For example here, you've got Alice who has Dbadmin as a role and those are some elevated privileges. She probably doesn't want them active all the time, so she can set the role and add them to her identity set. All of this information is stored in the catalog, which is basically Vertica's metadata storage. How do we manage these principals? Well, depends on your use case, right? So, if you're a small organization or maybe only some people or services need Vertica access, the solution is just to manage it with Vertica. You can see some commands here that will let you do that. But what if we're a big organization and we want Vertica to reflect what's in our centralized user management system? Sort of a similar motivating use case for LDAP authentication, right? We want to avoid duplication hassles, we just want to centralize our management. In that case, we can use Vertica's LDAPLink feature. So with LDAPLink, principals are mirrored from LDAP. They're synced in a considerable fashion from the LDAP into Vertica's catalog. What this does is it manages creating and dropping users and roles for you and then mapping the users to the roles. Once that's done, you can do any Vertica-specific configuration on the Vertica side. It's important to note that principals created in Vertica this way, support multiple forms of authentication, not just LDAP. This is a separate feature from LDAP authentication and if you created a user via LDAPLink, you could have them use a different form of authentication, Kerberos, for example. Up to you. Now of course this kind of system is pretty mission-critical, right? You want to make sure you get the right roles and the right users and the right mappings in Vertica. So you probably want to test it. And for that, we've got new and improved dry run functionality, from 9.3.1. And what this feature offers you is new metafunctions that let you test various parameters without breaking your real LDAPLink configuration. So you can mess around with parameters and the configuration as much as you want and you can be sure that all of that is strictly isolated from the live system. Everything's separated. And when you use this, you get some really nice output through a Data Collector table. You can see some example output here. It runs the same logic as the real LDAPLink and provides detailed information about what would happen. You can check the documentation for specifics. All right, so we've connected to the database, we know who we are, but now, what can we do? So for any given action, you want to control who can do that, right? So what's the question you have to ask? Sometimes the question is just who are you? It's a simple yes or no question. For example, if I want to upgrade a user, the question I have to ask is, am I the superuser? If I'm the superuser, I can do it, if I'm not, I can't. But sometimes the actions are more complex and the question you have to ask is more complex. Does the principal have the required privileges? If you're familiar with SQL privileges, there are things like SELECT, INSERT, and Vertica has a few of their own, but the key thing here is that an action can require specific and maybe even multiple privileges on multiple objects. So for example, when selecting from a table, you need USAGE on the schema and SELECT on the table. And there's some other examples here. So where do these privileges come from? Well, if the action requires a privilege, these are the only places privileges can come from. The first source is implicit privileges, which could come from owning the object or from special roles, which we'll talk about in a sec. Explicit privileges, it's basically a SQL standard GRANT system. So you can grant privileges to users or roles and optionally, those users and roles could grant them downstream. Discretionary access control. So those are explicit and they come from the user and the active roles. So the whole identity set. And then we've got Vertica-specific inherited privileges and those come from the schema, and we'll talk about that in a sec as well. So these are the special roles in Vertica. First role, DBADMIN. This isn't the Dbadmin user, it's a role. And it has specific elevated privileges. You can check the documentation for those exact privileges but it's less than the superuser. The PSEUDOSUPERUSER can do anything the real superuser can do and you can grant this role to whomever. The DBDUSER is actually a role, can run Database Designer functions. SYSMONITOR gives you some elevated auditing permissions and we'll talk about that later as well. And finally, PUBLIC is a role that everyone has all the time so anything you want to be allowed for everyone, attach to PUBLIC. Imagine this scenario. I've got a really big schema with lots of relations. Those relations might be changing all the time. But for each principal that uses this schema, I want the privileges for all the tables and views there to be roughly the same. Even though the tables and views come and go, for example, an analyst might need full access to all of them no matter how many there are or what there are at any given time. So to manage this, my first approach I could use is remember to run grants every time a new table or view is created. And not just you but everyone using this schema. Not only is it a pain, it's hard to enforce. The second approach is to use schema-inherited privileges. So in Vertica, schema grants can include relational privileges. For example, SELECT or INSERT, which normally don't mean anything for a schema, but they do for a table. If a relation's marked as inheriting, then the schema grants to a principal, for example, salespeople, also apply to the relation. And you can see on the diagram here how the usage applies to the schema and the SELECT technically but in Sales.foo table, SELECT also applies. So now, instead of lots of GRANT statements for multiple object owners, we only have to run one ALTER SCHEMA statement and three GRANT statements and from then on, any time that you grant some privileges or revoke privileges to or on the schema, to or from a principal, all your new tables and views will get them automatically. So it's dynamically calculated. Now of course, setting it up securely, is that you want to know what's happened here and what's going on. So to monitor the privileges, there are three system tables which you want to look at. The first is grants, which will show you privileges that are active for you. That is your user and active roles and theirs and so on down the chain. Grants will show you the explicit privileges and inherited_privileges will show you the inherited ones. And then there's one more inheriting_objects which will show all tables and views which inherit privileges so that's useful more for not seeing privileges themselves but managing inherited privileges in general. And finally, how do you see all privileges from all these sources, right? In one go, you want to see them together? Well, there's a metafunction added in 9.3.1. Get_privileges_description which will, given an object, it will sum up all the privileges for a current user on that object. I'll refer you to the documentation for usage and supported types. Now, the problem with SELECT. SELECT let's you see everything or nothing. You can either read the table or you can't. But what if you want some principals to see subset or a transformed version of the data. So for example, I have a table with personnel data and different principals, as you can see here, need different access levels to sensitive information. Social security numbers. Well, one thing I could do is I could make a view for each principal. But I could also use access policies and access policies can do this without introducing any new objects or dependencies. It centralizes your restriction logic and makes it easier to manage. So what do access policies do? Well, we've got row and column access policies. Rows will hide and column access policies will transform data in the row or column, depending on who's doing the SELECTing. So it transforms the data, as we saw on the previous slide, to look as requested. Now, if access policies let you see the raw data, you can still modify the data. And the implication of this is that when you're crafting access policies, you should only use them to refine access for principals that need read-only access. That is, if you want a principal to be able to modify it, the access policies you craft should let through the raw data for that principal. So in our previous example, the loader service should be able to see every row and it should be able to see untransformed data in every column. And as long as that's true, then they can continue to load into this table. All of this is of course monitorable by a system table, in this case access_policy. Check the docs for more information on how to implement these. All right, that's it for access control. Now on to delegation and impersonation. So what's the question here? Well, the question is who is Vertica? And that might seem like a silly question, but here's what I mean by that. When Vertica's connecting to a downstream service, for example, cloud storage, how should Vertica identify itself? Well, most of the time, we do the permissions check ourselves and then we connect as Vertica, like in this diagram here. But sometimes we can do better. And instead of connecting as Vertica, we connect with some kind of upstream user identity. And when we do that, we let the service decide who can do what, so Vertica isn't the only line of defense. And in addition to the defense in depth benefit, there are also benefits for auditing because the external system can see who is really doing something. It's no longer just Vertica showing up in that external service's logs, it's somebody like Alice or Bob, trying to do something. One system where this comes into play is with Voltage SecureData. So, let's look at a couple use cases. The first one, I'm just encrypting for compliance or anti-theft reasons. In this case, I'll just use one global identity to encrypt or decrypt with Voltage. But imagine another use case, I want to control which users can decrypt which data. Now I'm using Voltage for access control. So in this case, we want to delegate. The solution here is on the Voltage side, give Voltage users access to appropriate identities and these identities control encryption for sets of data. A Voltage user can access multiple identities like groups. Then on the Vertica side, a Vertica user can set their Voltage username and password in a session and Vertica will talk to Voltage as that Voltage user. So in the diagram here, you can see an example of how this is leverage so that Alice could decrypt something but Bob cannot. Another place the delegation paradigm shows up is with storage. So Vertica can store and interact with data on non-local file systems. For example, HGFS or S3. Sometimes Vertica's storing Vertica-managed data there. For example, in Eon mode, you might store your projections in communal storage in S3. But sometimes, Vertica is interacting with external data. For example, this usually maps to a user storage location in the Vertica side and it might, on the external storage side, be something like Parquet files on Hadoop. And in that case, it's not really Vertica's data and we don't want to give Vertica more power than it needs, so let's request the data on behalf of who needs it. Lets say I'm an analyst and I want to copy from or export to Parquet, using my own bucket. It's not Vertica's bucket, it's my data. But I want Vertica to manipulate data in it. So the first option I have is to give Vertica as a whole access to the bucket and that's problematic because in that case, Vertica becomes kind of an AWS god. It can see any bucket, any Vertica user might want to push or pull data to or from any time Vertica wants. So it's not good for the principals of least access and zero trust. And we can do better than that. So in the second option, use an ID and secret key pair for an AWS, IAM, if you're familiar, principal that does have access to the bucket. So I might use my, the analyst, credentials, or I might use credentials for an AWS role that has even fewer privileges than I do. Sort of a restricted subset of my privileges. And then I use that. I set it in Vertica at the session level and Vertica will use those credentials for the copy export commands. And it gives more isolation. Something that's in the works is support for keyless delegation, using assumable IAM roles. So similar benefits to option two here, but also not having to manage keys at the user level. We can do basically the same thing with Hadoop and HGFS with three different methods. So first option is Kerberos delegation. I think it's the most secure. It definitely, if access control is your primary concern here, this will give you the tightest access control. The downside is it requires the most configuration outside of Vertica with Kerberos and HGFS but with this, you can really determine which Vertica users can talk to which HGFS locations. Then, you've got secure impersonation. If you've got a highly trusted Vertica userbase, or at least some subset of it is, and you're not worried about them doing things wrong but you want to know about auditing on the HGFS side, that's your primary concern, you can use this option. This diagram here gives you a visual overview of how that works. But I'll refer you to the docs for details. And then finally, option three, this is bringing your own delegation token. It's similar to what we do with AWS. We set something in the session level, so it's very flexible. The user can do it at an ad hoc basis, but it is manual, so that's the third option. Now on to auditing and monitoring. So of course, we want to know, what's happening in our database? It's important in general and important for incident response, of course. So your first stop, to answer this question, should be system tables. And they're a collection of information about events, system state, performance, et cetera. They're SELECT-only tables, but they work in queries as usual. The data is just loaded differently. So there are two types generally. There's the metadata table, which stores persistent information or rather reflects persistent information stored in the catalog, for example, users or schemata. Then there are monitoring tables, which reflect more transient information, like events, system resources. Here you can see an example of output from the resource pool's storage table which, these are actually, despite that it looks like system statistics, they're actually configurable parameters for using that. If you're interested in resource pools, a way to handle users' resource allocation and various principal's resource allocation, again, check that out on the docs. Then of course, there's the followup question, who can see all of this? Well, some system information is sensitive and we should only show it to those who need it. Principal of least privilege, right? So of course the superuser can see everything, but what about non-superusers? How do we give access to people that might need additional information about the system without giving them too much power? One option's SYSMONITOR, as I mentioned before, it's a special role. And this role can always read system tables but not change things like a superuser would be able to. Just reading. And another option is the RESTRICT and RELEASE metafunctions. Those grant and revoke access to from a certain system table set, to and from the PUBLIC role. But the downside of those approaches is that they're inflexible. So they only give you, they're all or nothing. For a specific preset of tables. And you can't really configure it per table. So if you're willing to do a little more setup, then I'd recommend using your own grants and roles. System tables support GRANT and REVOKE statements just like any regular relations. And in that case, I wouldn't even bother with SYSMONITOR or the metafunctions. So to do this, just grant whatever privileges you see fit to roles that you create. Then go ahead and grant those roles to the users that you want. And revoke access to the system tables of your choice from PUBLIC. If you need even finer-grained access than this, you can create views on top of system tables. For example, you can create a view on top of the user system table which only shows the current user's information, uses a built-in function that you can use as part of the view definition. And then, you can actually grant this to PUBLIC, so that each user in Vertica could see their own user's information and never give access to the user system table as a whole, just that view. Now if you're a superuser or if you have direct access to nodes in the cluster, filesystem/OS, et cetera, then you have more ways to see events. Vertica supports various methods of logging. You can see a few methods here which are generally outside of running Vertica, you'd interact with them in a different way, with the exception of active events which is a system table. We've also got the data collector. And that sorts events by subjects. So what the data collector does, it extends the logging and system table functionality, by the component, is what it's called in the documentation. And it logs these events and information to rotating files. For example, AnalyzeStatistics is a function that could be of use by users and as a database administrator, you might want to monitor that so you can use the data collector for AnalyzeStatistics. And the files that these create can be exported into a monitoring database. One example of that is with the Management Console Extended Monitoring. So check out their virtual BDC talk. The one on the management console. And that's it for the key points of security in Vertica. Well, many of these slides could spawn a talk on their own, so we encourage you to check out our blog, check out the documentation and the forum for further investigation and collaboration. Hopefully the information we provided today will inform your choices in securing your deployment of Vertica. Thanks for your time today. That concludes our presentation. Now, we're ready for Q&A.
SUMMARY :
in the question box below the slide as it occurs to you So for instance, you can see date of birth encrypted and the question you have to ask is more complex.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Chris Morris | PERSON | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
two types | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
Alice | PERSON | 0.99+ |
second approach | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
third option | QUANTITY | 0.99+ |
AWS' | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Today | DATE | 0.99+ |
first approach | QUANTITY | 0.99+ |
second half | QUANTITY | 0.99+ |
each service | QUANTITY | 0.99+ |
Bob | PERSON | 0.99+ |
10 petabytes | QUANTITY | 0.99+ |
Fenic | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first source | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Fen | PERSON | 0.98+ |
S3 | TITLE | 0.98+ |
One system | QUANTITY | 0.98+ |
first objective | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
First role | QUANTITY | 0.97+ |
each principal | QUANTITY | 0.97+ |
4/2 | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Vertica | TITLE | 0.97+ |
First | QUANTITY | 0.97+ |
one | QUANTITY | 0.96+ |
this week | DATE | 0.95+ |
three different methods | QUANTITY | 0.95+ |
three system tables | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
Fenic Fawkes | PERSON | 0.94+ |
Parquet | TITLE | 0.94+ |
Hadoop | TITLE | 0.94+ |
One example | QUANTITY | 0.93+ |
Dbadmin | PERSON | 0.92+ |
10.0 | QUANTITY | 0.92+ |
Peter Smails, Datos IO | CUBE Conversation with John Furrier
(light orchestral music) >> Hello, everyone, and welcome to the Cube Conversation here at the Palo Alto studios for theCUBE. I'm John Furrier, the co-founder of SiliconANGLE Media. We're here for some news analysis with Peter Smails, the CMO of Datos.IO D-a-t-o-s dot I-O. Hot new start up with some news. Peter was just here for a thought leader segment with Chris Cummings talking about the industry breakdown. But the news is hot, prior to re:Invent which you will be at? >> Absolutely. >> RecoverX is the product. 2.5, it's a release. So, you've got a point release on your core product. >> Correct. >> Welcome to this conversation. >> Thanks for having me. Yeah, we're excited to share the news. Big day for us. >> All right, so let's get into the hard news. You guys are announcing a point release of the latest product which is your core flagship, RecoverX. >> Correct. >> Love the name. Love the branding of the X in there. It reminds me of the iPhone, so makes me wanna buy one. But you know ... >> We can make that happen, John. >> You guys are the X Factor. So, we've been pretty bullish on what you guys are doing. Obviously, like the positioning. It's cloud. You're taking advantage of the growth in the cloud. What is this new product release? Why? What's the big deal? What's in it for the customer? >> So, I'll start with the news, and then we'll take a small step back and sort of talk about why exactly we're doing what we're doing. So, RecoverX 2.5 is the latest in our flagship RecoverX line. It's a cloud data management platform. And the market that we're going after and the market we're disrupting is the traditional data management space. The proliferation of modern applications-- >> John: Which includes which companies? >> So, the Veritas' of the world, the Commvault's of the world, the Dell EMC's of the world. Anybody that was in the traditional-- >> 20-year-old architected data backup and recovery software. >> You stole my fun fact. (laughs) But very fair point which is that the average age approximately of the leading backup and recovery software products is approximately 20 years. So, a lot's changed in the last 20 years, not the least of which has been this proliferation of modern applications, okay? Which are geo-distributed microservices oriented and the rapid proliferation of multicloud. That disrupts that traditional notion of data management specifically backup and recovery. That's what we're going after with RecoverX. RecoverX 2.5 is the most recent version. News on three fronts. One is on our advanced recovery, and we can double-click into those. But it's essentially all about giving you more data awareness, more granularity to what data you wanna recover and where you wanna put it, which becomes very important in the multicloud world. Number two is what we call data center aware backup and recovery. That's all about supporting geo-distributed application environments, which again, is the new normal in the cloud. And then number three is around enterprise hardening, specifically around security. So, it's all about us increased flexibility and new capabilities for the multicloud environment and continue to enterprise-harden the product. >> Okay, so you guys say significant upgrade. >> Peter: Yep. >> I wanna just look at that. I'm also pretty critical, and you know how I feel on this so don't take it personal, multicloud is not a real deal yet. It's in statement of value that customers are saying-- It's coming! But cloud is here today, regular cloud. So, multicloud ... Well, what is multicloud actually mean? I mean, I can have multiple clouds but I'm not actually moving workloads across clouds, yet. >> I disagree. >> Okay. >> I actually disagree. We have multiple customers. >> All right, debunk that. >> I will debunk that. Number one use case for RecoverX is backup and recovery. But with a twist of the fact that it's for these modern applications running these geo-distributed environments. Which means it's not about backing up my data center, it's about, I need to make a copy of my data but I wanna back it up in the cloud. I'm running my application natively in the cloud, so I want a backup in the cloud. I'm running my application in the cloud but I actually wanna backup from the cloud back to my private cloud. So, that in lies a backup and recovery, and operation recovery use case that involves multicloud. That's number one. Number two use case for RecoverX is what we talk about on data mobility. >> So, you have a different definition of multicloud. >> Sorry, what was your-- Our definition of multicloud is fundamentally a customer using multiple clouds, whether it be a private on-prem GCP, AWS, Oracle, any mix and match. >> I buy that. I buy that. Where I was getting critical of was a workload. >> Okay. >> I have a workload and I'm running it on Amazon. It's been architected for Amazon. Then I also wanna run that same workload on Azure and Google. >> Okay. >> Or Oracle or somewhere else. >> Yep. >> I have to re-engineer it (laughs) to move, and I can't share the data. So, to me what multicloud means, I can run it anywhere. My app anywhere. Backup is a little bit different. You're saying the cloud environments can be multiple environments for your solution. >> That is correct. >> So, you're looking at it from the other perspective. >> Correct. The way we define ourselves is application-centric data management. And what that essentially means is we don't care what the underlying infrastructure is. So, if you look at traditional backup and recovery products they're LUN-based. So, I'm going to backup my storage LUN. Or they're VM-based. And a lot of big companies made a lot of money doing that. The problem is they are no LUN's and VM's in hybrid cloud or multicloud environment. The only thing that's consistent across application, across cloud-environments is the data and the applications that are running. Where we focus is we're 100% application-centric. So, we integrate at the database level. The database is the foundation of any application you create. We integrate there, which makes us agnostic to the underlying infrastructure. We run, just as examples, we have customers running next generation applications on-prem. We have customers running next generation applications on AWS in GCP. Any permutation of the above, and to your point about back to the multicloud we've got organizations doing backup with us but then we also have organizations using us to take copies of their backup data and put them on whatever clouds they want for things like test and refresh. Or performance testing or business analytics. Whatever you might wanna do. >> So, you're pretty flexible. I like that. So, we talked before on other segments, and certainly even this morning about modern stacks. >> Yeah. >> Modern applications. This is the big to-do item for all CXOs and CIOs. I need a modern infrastructure. I need modern applications. I need modern developers. I need modern everything. Hyper, micro, ultra. >> Whatever buzz word you use. >> But you guys in this announcement have a couple key things I wanna just get more explanation on. One, advanced recovery, backup anywhere, recover anywhere, and you said enterprise-grade security is the third thing. >> Yep. >> So, let's just break them down one at a time. Advanced recovery for Datos 2.5, RecoverX 2.5. >> Yep. >> What is advanced recovery? >> It's very specifically about providing high levels of granularity for recovering your data, on two fronts. So, the use case is, again, backup. I need to recover data. But I don't wanna necessarily recover everything. I wanna get smarter about the data I wanna recover. Or it could be for non-operational use cases, which is I wanna spin up a copy of data to run test dev or to do performance testing on. What advanced recovery specifically means is number one, we've introduced the notion of queryble recovery. And what that means is that I can say things like star dot John star. And the results returning from that, because we're application-centric, and we integrated the database, we give you visibility to that. I wanna see everything star dot John star. Or I wanna recover data from a very specific row, in a very specific column. Or I want to mask data that I do not wanna be recovered and I don't want people to see. The implications of that are think about that from a performance standpoint. Now, I only recover the data I need. So, I'm very, very high levels of granularity based upon a query. So, I'm fast from an RTO standpoint. The second part of it is for non-operational requirements I only move the data that is select to that data set. And number three is it helps you with things like GDPR compliance and PII compliance because you can mask data. So, that's query-based recovery. That's number one. The second piece of advanced recovery is what we call incremental recovery. That is granular recovery based upon a time stamp. So, you can get within individual points in time. So, you can get to a very high level of granularity based upon time. So, it's all about visibility. It's your data and getting very granular in a smart way to what you wanna recover. So, if I kind of hear what you're saying, what you're saying is essentially you built in the operational effectiveness of being effective operationally. You know, time to backup recovery, all that good RTO stuff. Restoring stuff operationally >> Peter: Very quickly. >> very fast. >> Peter: In a smart way. >> So, there's a speed game there which is table stakes. But you're real value here is all these compliance nightmares that are coming down the pike, GDPR and others. There's gonna be more. >> Peter: Absolutely. I mean, it could be HIPPA, it could be GDPR, anything that involves-- >> Policy. >> Policies. Anything that requires, we're completely policy-driven. And you can create a policy to mask certain data based upon the criteria you wanna put in. So, it's all about-- >> So you're the best of performance, and you got some tunability. >> And it's all about being data aware. It's all about being data aware. So, that's what advanced recovery is. >> Okay, backup anywhere, recover anywhere. What does that mean? >> So, what that means is the old world of backup and recovery was I had a database running in my data center. And I would say database please take a snapshot of yourself so I can make a copy. The new world of cloud is that these microservices-based modern applications typically run, they're by definition distributed, And in many cases they run distributed across they're geo-distributed. So, what data center aware backup and recovery is, use a perfect example. We have a customer. They're running their eCommerce. So, leading online restaurant reservations company. They're running their eCommerce application on-prem, interestingly enough, but it's based on Cassandra distributed database. Excuse me, MongoDB. Sorry. They're running geo-distributed, sharded MongoDB clusters. Anybody in the traditional backup and recovery their head would explode when you say that. In the modern application world, that's a completely normal use case. They have a data center in the U.S. They have a data center in the U.K. What they want is they wanna be able to do local backup and recovery while maintaining complete global consistency of their data. So again, it's about recovery time ultimately but it's also being data aware and focusing only on the data that you need to backup and recovery. So, it's about performance but then it's also about compliance. It's about governance. That's what data center aware backup is. >> And that's a global phenomenon people are having with the GO. >> Absolutely. Yeah, you could be within country. It could be any number of different things that drive that. We can do it because we're data aware-- >> And that creates complexity for the customer. You guys can take that complexity away >> Correct. >> From the whole global, regional where the data can sit. >> Correct. I'd say two things actually. To give the customers credit, the customers building these apps or actually getting a lot smarter about what they're data is and where they're data is. >> So they expect this feature? >> Oh, absolutely. Absolutely. I wouldn't call it table stakes cause we're the only kids on the block that can do it. But this is in direct response to our customers that are building these new apps. I wanna get into some of the environmental and customer drivers in a second. I wanna nail the last segment down. Cause I wanna unpack the whole why is this trend happening? What's the gestation period? What's the main enabler for you? But okay, final point on the significant announcements. My favorite topic enterprise-grade security. What the hell does that mean? First of all, from your standpoint the industry's trying to solve the same thing. Enterprise-grade security, what are you guys providing in this? >> Number one, it's basically security protocol. So, TLS and SSL. This is weed stuff. TLS, SSL, so secure protocol support. It's integration with LDAP. So, if organizations are running, primarily if they're running on-prem and they're running in an LDAP environment, we're support there. And then we've got Kerberos support for Kerberos authentication. So, it's all about just checking the boxes around the different security >> So, this is like in between >> and transport protocol. >> the toes, the details around compliance, identity management. >> Peter: Bingo. >> I mean we just had Centrify's CyberConnect conference, and you're seeing a lot of focus on identity. >> Absolutely. And the reason that that's sort of from a market standpoint the reason that these are very important now is because the applications that we're supporting these are not science experiments. These are eCommerce applications. These are core business applications that mainstream enterprises are running, and they need to be protected and they're bringing the true, classic enterprise security, authentication, authorization requirements to the table. >> Are you guys aligning with those features? Or is there anything significant in that section? >> From an enterprise security standpoint? It's primarily about we provide the support, so we integrate with all of those environments and we can check the boxes. Oh, absolutely TLS. Absolutely, we've got that box checked because-- >> So, you're not competing with other cybersecurity? >> No, this is purely we need to do this. This is part of our enterprise-- >> This is where you partner. >> Peter: Well, no. For these things it's literally just us providing the protocol support. So, LDAP's a good example. We support LDAP. So, we show up and if somebody's using my data management-- >> But you look at the other security solutions as a way to integrate with? >> Yeah. >> Not so much-- >> Absolutely, no. This has nothing to do with the competition. It's just supporting ... I mean Google has their own protocol, you know, security protocols, so we support those. So, does Amazon. >> I really don't want to go into the customer benefits. We'll let the folks go to the Datos website, d-a-t-o-s dot i-o is the website, if you wanna check out all their customer references. I don't wanna kind of drill on that. I kind of wanna really end this segment on the real core issue for me is reading the tea leaves. You guys are different. You're now kind of seeing some traction and some growth. You're a new kind of animal in the zoo, if you will. (Peter laughs) You've got a relevant product. Why is it happening now? And I'm trying to get to understanding Cloud Oss is enabling a lot of stuff. You guys are an effect of that, a data point of what the cloud is enabled as a venture. Everything that you're doing, the value you create is the function of the cloud. >> Yes. >> And how data is moving. Where's this coming from? Is it just recently? Is it a gestation period of a few years? Where did this come from? You mentioned some comparisons like Oracle. >> So, I'll answer that in sort of, we like to use history as our guide. So, I'll answer that both in macro terms, and then I'll answer it in micro terms. From a macro term standpoint, this is being driven by the proliferation of new data sources. It's the easiest way to look at it. So, if you let history be your guide. There was about a seven to eight year proliferation or gap between proliferation of Oracle as the primary traditional relational database data source and the advent of Veritas who really defined themselves as the defacto standard for traditional on-prem data center relational data management. You look at that same model, you'll look at the proliferation of VMware. In the late 90s, about a seven to eight year gestation with the rapid adoption of Veeam. You know the early days a lot of folks laughed at Veeam, like, "Who's gonna backup VMs? People aren't gonna use VMs in the enterprise. Now, you looked at Veeam, great company. They've done some really tremendous things carving out much more than a niche providing backup and recovery and availability in a VM-based environment. The exact same thing is happening now. If you go back six to seven years from now, you had the early adoption of the MongoDBs, the Cassandras, the Couches. More recently you've got a much faster acceleration around the DynamoDBs and the cloud databases. We're riding that same wave to support that. >> This is a side effect of the enabling of the growth of cloud. >> Yes. >> So, similar to what you did in VMware with VMs and database for Oracle you guys are taking it to the next level. >> These new data sources are completely driven by the fact that the cloud is enabling this completely distributed, far more agile, far more dynamic, far less expensive application deployment model, and a new way of providing data management is required. That's what we do. >> Yeah, I mean it's a function of maturity, one. As Jeff Rickard, General Manager of theCube, always says, when the industry moves to it's next point of failure, in this case failure is problem and you solve. So, the headaches that come from the awesomeness of the growth. >> Absolutely. And to answer that micro-wise briefly. So, that was the macro. The micro is the proliferation of, the movement from monolithic apps to microservices-based app, it's happening. And the cloud is what's enabling them. The move from traditional on-prem to hybrid cloud is absolutely happening. That's by definition the cloud. The third piece which is cloud-centric is the world's moving from a scale up world to an elastic-compute, elastic storage model. We call that the modern IT stack. Traditional backup and recovery, traditional data management doesn't work in the new modern IT stack. That's the market we're planning. That's the market we're disrupting is all that traditional stuff moving to the modern IT stack. >> Okay, Datos IO announcing a 2.5 release of RecoverX, their flagship product, their start up growing out of Los Gatos. Peter Smails here, the CMO. Where ya gonna be next? What's going on-- I know we're gonna see you re:Invent in a week in a half. >> Absolutely. So, we've got two stops. Well, actually the next stop on the tour is re:Invent. So, absolutely looking forward to being back on theCUBE at re:Invent. >> And the company feels good about those things are good. You've got good money in the bank. You're growing. >> We feel fantastic. It's fascinating to watch as things develop. The conversations we have now versus even six months ago. It's sort of the tipping point of people get it. You sort of explain, "Oh, yeah it's data management from modern applications. Are you deploying modern applications?" Absolutely. >> Share one example to end this segment on what you hear over and over again from customers that illuminates what you guys are about as a company, the DNA, the value preposition, and their impact on results and value for customers. >> So, I'll use a case study as an example. You know, we're the world's largest home improvement retailers. Old way, was they ran their multi-billion dollar eCommerce infrastructure. Running on IBM Db2 database. Running in their on-prem data center. They've moved their world. They're now running, they've re-architected their application. It's now completely microservices-based running on Cassandra, deployed 100% in Google cloud platform. And they did that because they wanted to be more agile. They wanted to be more flexible. It's a far more cost effective deployment model. They are all in on the cloud. And they needed a next generation backup and recovery data protection, data management solution which is exactly what we do. So, that's the value. Backup's not a new problem. People need to protect data and they need to be able to take better advantage of the data. >> All right, so here's the final, final question. I'm a customer watching this video. Bottom line maybe, I'm kind of hearing all this stuff. When do I call you? What are the signals? What are the little smoke signals I see in my organization burning? When do I need to call you guys, Datos? >> You should call Datos IO anytime, if you're doing anything with development of modern applications, number one. If you're doing anything with hybrid cloud you should call us. Because you're gonna need to reevaluate your overall data management strategy it's that simple. >> All right, Peter Smails, the CMO of Datos, one of the hot companies here in Silicon Valley, out of Los Gatos, California. Of course, we're in Palo Alto at theCube Studios. I'm John Furrier. This is theCUBE conversation. Thanks for watching. (upbeat techno music)
SUMMARY :
But the news is hot, RecoverX is the product. Yeah, we're excited to share the news. of the latest product which is Love the branding of the X in there. What's in it for the customer? So, RecoverX 2.5 is the latest in So, the Veritas' of the world, data backup and recovery software. is that the average age Okay, so you guys and you know how I feel on I actually disagree. I'm running my application in the cloud So, you have a different Our definition of critical of was a workload. I have a workload and You're saying the cloud environments from the other perspective. The database is the foundation So, we talked before on other segments, This is the big to-do item security is the third thing. So, let's just break So, the use case is, again, backup. that are coming down the I mean, it could be And you can create a and you got some tunability. So, that's what advanced recovery is. What does that mean? the data that you need And that's a global phenomenon Yeah, you could be within country. complexity for the customer. From the whole global, the customers building these on the block that can do it. checking the boxes around the toes, the details I mean we just had Centrify's is because the applications and we can check the boxes. This is part of our enterprise-- providing the protocol support. So, does Amazon. You're a new kind of animal in the zoo, And how data is moving. and the advent of Veritas of the growth of cloud. So, similar to what you did that the cloud is enabling So, the headaches that come from We call that the modern IT stack. Peter Smails here, the CMO. on the tour is re:Invent. And the company feels good It's sort of the tipping as a company, the DNA, So, that's the value. All right, so here's the you should call us. Smails, the CMO of Datos,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Rickard | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Amazon | ORGANIZATION | 0.99+ |
Peter Smails | PERSON | 0.99+ |
Chris Cummings | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Peter | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Peter Smails | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Veeam | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
Los Gatos | LOCATION | 0.99+ |
second part | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
U.S. | LOCATION | 0.99+ |
eight year | QUANTITY | 0.99+ |
two fronts | QUANTITY | 0.99+ |
U.K. | LOCATION | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
third piece | QUANTITY | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
second piece | QUANTITY | 0.99+ |
three fronts | QUANTITY | 0.99+ |
GDPR | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
approximately 20 years | QUANTITY | 0.99+ |
iPhone | COMMERCIAL_ITEM | 0.98+ |
two things | QUANTITY | 0.98+ |
six | QUANTITY | 0.98+ |
theCube | ORGANIZATION | 0.98+ |
Los Gatos, California | LOCATION | 0.98+ |
two stops | QUANTITY | 0.98+ |
third thing | QUANTITY | 0.98+ |
Veritas | ORGANIZATION | 0.98+ |
IBM | ORGANIZATION | 0.98+ |
Cloud Oss | TITLE | 0.98+ |
late 90s | DATE | 0.98+ |
X Factor | TITLE | 0.98+ |
Dell EMC | ORGANIZATION | 0.98+ |
Number two | QUANTITY | 0.97+ |
First | QUANTITY | 0.97+ |
six months ago | DATE | 0.97+ |
one | QUANTITY | 0.96+ |
20-year-old | QUANTITY | 0.96+ |
MongoDB | TITLE | 0.94+ |
RecoverX | ORGANIZATION | 0.94+ |
Datos.IO | ORGANIZATION | 0.94+ |
number three | QUANTITY | 0.94+ |
one example | QUANTITY | 0.93+ |
RecoverX 2.5 | TITLE | 0.92+ |
multicloud | ORGANIZATION | 0.92+ |
seven years | QUANTITY | 0.9+ |
RecoverX | TITLE | 0.87+ |
multi-billion dollar | QUANTITY | 0.87+ |
Centrify | ORGANIZATION | 0.86+ |
a week in a half | QUANTITY | 0.86+ |
GCP | ORGANIZATION | 0.84+ |
2.5 | QUANTITY | 0.84+ |
Datos | ORGANIZATION | 0.83+ |
couple | QUANTITY | 0.83+ |
Kerberos | TITLE | 0.82+ |
re:Invent | EVENT | 0.82+ |
David McNeely, Centrify | CyberConnect 2017
(upbeat music) >> Narrator: Live from New York City It's theCUBE, covering CyberConnect 2017. Brought to you by Centrify and the Institute for Critical Infrastructure Technology. >> Hey, welcome back everyone. Live here in New York is theCUBE's exclusive coverage of Centrify's CyberConnect 2017, presented by Centrify. It's an industry event that Centrify is underwriting but it's really not a Centrify event, it's really where industry and government are coming together to talk about the best practices of architecture, how to solve the biggest crisis of our generation, and the computer industry that is security. I am John Furrier, with my co-host Dave Vellante. Next guest: David McNeely, who is the vice president of product strategy with Centrify, welcome to theCUBE. >> Great, thank you for having me. >> Thanks for coming on. I'm really impressed by Centrify's approach here. You're underwriting the event but it's not a Centrify commercial. >> Right >> This is about the core issues of the community coming together, and the culture of tech. >> Right. >> You are the product. You got some great props from the general on stage. You guys are foundational. What does that mean, when he said that Centrify could be a foundational element for solving this problem? >> Well, I think a lot of it has to do with if you look at the problems that people are facing, the breaches are misusing computers in order to use your account. If your account is authorized to still gain access to a particular resource, whether that be servers or databases, somehow the software and the systems that we put in place, and even some of the policies need to be retrofitted in order to go back and make sure that it really is a human gaining access to things, and not malware running around the network with compromised credentials. We've been spending a lot more time trying to help customers eliminate the use of passwords and try to move to stronger authentication. Most of the regulations now start talking about strong authentication but what does that really mean? It can't just be a one time passcode delivered to your phone. They've figured out ways to break into that. >> Certificates are being hacked and date just came out at SourceStory even before iStekNET's certificate authorities, are being compromised even before the big worm hit in what he calls the Atom Bomb of Malware. But this is the new trend that we are seeing is that the independent credentials of a user is being authentically compromised with the Equifax and all these breaches where all personal information is out there, this is a growth area for the hacks that people are actually getting compromised emails and sending them. How do you know it's not a fake account if you think it's your friend? >> Exactly. >> And that's the growth area, right? >> The biggest problem is trying to make sure that if you do allow someone to use my device here to gain access to my mail account, how do we make it stronger? How do we make sure that it really is David that is logged onto the account? If you think about it, my laptop, my iPad, my phone all authenticate and access the same email account and if that's only protected with a password then how good is that? How hard is it to break passwords? So we are starting to challenge a lot of base assumptions about different ways to do security because if you look at some of the tools that the hackers have their tooling is getting better all the time. >> So when, go ahead, sorry. finish your thoughts. >> Tools like their HashCat can break passwords. Like millions and millions a second. >> You're hacked, and basically out there. >> When you talk about eliminating passwords, you're talking about doing things other than just passwords, or you mean eliminating passwords? >> I mean eliminating passwords. >> So how does that work? >> The way that works is you have to have a stronger vetting process around who the person is, and this is actually going to be a challenge as people start looking at How do you vet a person? We ask them a whole bunch of questions: your mother's maiden name, where you've lived, other stuff that Equifax asked-- >> Yeah, yeah, yeah, everybody has. >> We ask you all of that information to find out is it really you?. But really the best way to do it now is going to be go back to government issued IDs because they have a vetting process where they're establishing an identity for you. You've got a driver's license, we all have social security numbers, maybe a passport. That kind of information is really the only way to start making sure it really is me. This is where you start, and the next place is assigning a stronger credential. So there is a way to get a strong credential on to your mobile device. The issuance process itself generates the first key pair inside the device in a protected place, that can't be compromised because it is part of the hardware, part of the chip that runs the processes of the phone and that starts acting as strong as a smart card. In the government they call it derived credentials. It's kind of new technology, NIST has had described documentation on how to make that work for quite some time but actually implementing it and delivering it as a solution that can be used for authentication to other things is kind of new here. >> A big theme of your talk tomorrow is on designing this in, so with all of this infrastructure out there I presume you can't just bolt this stuff on and spread it in a peanut butter spread across, so how do we solve that problem? Is it just going to take time-- >> Well that's actually-- >> New infrastructure? Modernization? >> Dr. Ron Ross is going to be joining me tomorrow and he is from the NIST, and we will be talking with him about some of these security frameworks that they've created. There's cyber security framework, there's also other guidance that they've created, the NIST 800-160, that describe how to start building security in from the very start. We actually have to back all the way up to the app developer and the operating system developers and get them to design security into the applications and also into the operating systems in such a way that you can trust the OS. Applications sitting on top of an untrusted operating system is not very good so the applications have to be sitting on top of trusted operating systems. Then we will probably get into a little bit of the newer technology. I am starting to find a lot of our customers that move to cloud based infostructures, starting to move their applications into containers where there is a container around the application, and actually is not bound so heavily to the OS. I can deploy as many of these app containers as I want and start scaling those out. >> So separate the workload from some of your infostructure. You're kind of seeing that trend? >> Exactly and that changes a whole lot of the way we look at security. So now your security boundary is not the machine or the computer, it's now the application container. >> You are the product strategist. You have the keys to the kingdom at Centrify, but we also heard today that it's a moving train, this business, it's not like you can lock into someone. Dave calls it the silver bullet and it's hard to get a silver bullet in security. How do you balance the speed of the game, the product strategy, and how do you guys deal with bringing customer solutions to the market that has an architectural scalability to it? Because that's the challenge. I am a slow enterprise, but I want to implement a product, I don't want to be obsolete by the time I roll it out. I need to have a scalable solution that can give me the head room and flexibility. So you're bringing a lot to the table. Explain what's going on in that dynamic. >> There's a lot of the, I try as much as possible to adhere to standards before they exist and push and promote those like on the authentication side of things. For the longest time we used LDAP and Kerberos to authenticate computers, to act a directory. Now almost all of the web app develops are using SAML or OpenID Connect or OLAF too as a mechanism for authenticating the applications. Just keeping up with standards like that is one of the best ways. That way the technologies and tools that we deliver just have APIs that the app developers can use and take advantage of. >> So I wanted to follow up on that because I was going to ask you. Isn't there a sort of organizational friction in that you've got companies, if you have to go back to the developers and the guys who are writing code around the OS, there's an incentive from up top to go for fast profits. Get to market as soon as you can. If I understand what you just said, if you are able to use open source standards, things like OLAF, that maybe could accelerate your time to market. Help me square that circle. Is there an inherent conflict between the desire to get short term profits versus designing in good security? >> It does take a little bit of time to design, build, and deliver products, but as we moved to cloud based infostructure we are able to more rapidly deploy and release features. Part of having a cloud service, we update that every month. Every 30 days we have a new version of that rolling out that's got new capabilities in it. Part of adapting an agile delivery models, but everything we deliver also has an API so when we go back and talk to the customers and the developer at the customer organizations we have a rich set of APIs that the cloud service exposes. If they uncover a use case or a situation that requires something new or different that we don't have then that's when I go back to the product managers and engineering teams and talk about adding that new capability into the cloud service, which we can expect the monthly cadence helps me deliver that more rapidly to the market. >> So as you look at the bell curve in the client base, what's the shape of those that are kind of on the cutting edge and doing by definition, I shouldn't use the term cutting edge, but on the path to designing in as you would prescribe? What's it look like? Is it 2080? 199? >> That's going to be hard to put a number on. Most of the customers are covering the basics with respect to consolidating identities, moving to stronger authetication, I'm finding one of the areas that the more mature companies have adopted as this just in time notion where by default nobody has any rights to gain access to either systems or applications, and moving it to a workflow request access model. So that's the one that's a little bit newer that fewer of my customers are using but most everybody wants to adopt. If you think about some of the attacks that have taken place, if I can get a piece of email to you, and you think it's me and you open up the attachment, at that point you are now infected and the malware that's on your machine has the ability to use your account to start moving around and authenticating the things that you are authorized to get to. So if I can send that piece of email and accomplish that, I might target a system administrator or system admins and go try to use their account because it's already authorized to go long onto the database servers, which is what I'm trying to get to. Now if we could flip it say well, yeah. He's a database admin but if he doesn't have permissions to go log onto anything right now and he has to make a request then the malware can't make the request and can't get the approval of the manager in order to go gain access to the database. >> Now, again, I want to explore the organizational friction. Does that slow down the organization's ability to conduct business and will it be pushed back from the user base or can you make that transparent? >> It does slow things down. We're talking about process-- >> That's what it is. It's a choice that organizations have to make if you care about the long term health of your company, your brand, your revenues or do you want to go for the short term profit? >> That is one of the biggest challenges that we describe in the software world as technical debt. Some IT organizations may as well. It's just the way things happen in the process by which people adhere to things. We find all to often that people will use the password vault for example and go check out the administrator password or their Dash-A account. It's authorized to log on to any Windows computer in the entire network that has an admin. And if they check it out, and they get to use it all day long, like okay did you put it in Clipboard? Malware knows how to get to your clipboard. Did you put it in a notepad document stored on your desktop? Guess what? Malware knows how to get to that. So now we've got a system might which people might check out a password and Malware can get to that password and use it for the whole day. Maybe at the end of the day the password vault can rotate the password so that it is not long lived. The process is what's wrong there. We allow humans to continue to do things in a bad way just because it's easy. >> The human error is a huge part in this. Administrators have their own identity. Systems have a big problem. We are with David McNeely, the vice president of product strategy with Centrify. I've got to get your take on Jim Ruth's, the chief security officer for Etna that was on the stage, great presentation. He's really talking about the cutting edge things that he's doing unconventionally he says, but it's the only way for him to attack the problem. He did do a shout out for Centrify. Congratulations on that. He was getting at a whole new way to reimagine security and he brought up that civilizations crumble when you lose trust. Huge issues. How would you guys seeing that help you guys solve problems for your customers? Is Etna a tell-sign for which direction to go? >> Absolutely, I mean if you think about problem we just described here the SysAdmin now needs to make a workflow style request to gain access to a machine, the problem is that takes time. It involves humans and process change. It would be a whole lot nicer, and we've already been delivering solutions that do this Machine learning behavior-based access controls. We tied it into our multifactor authentication system. The whole idea was to get the computers to make a decision based on behavior. Is it really David at the keyboard trying to gain access to a target application or a server? The machine can learn by patterns and by looking at my historical access to go determine does that look, and smell, and feel like David? >> The machine learning, for example. >> Right and that's a huge part of it, right? Because if we can get the computers to make these decisions automatically, then we eliminate so much time that is being chewed up by humans and putting things into a queue and then waiting for somebody to investigate. >> What's the impact of machine-learning on security in your opinion? Is it massive in the sense of, obviously it's breached, no it's going to be significant, but what areas is it attacking? The speed of the solution? The amount of data it can go through? Unique domain expertise of the applications? Where is the a-ha, moment for the machine learning value proposition? >> It's really going to help us enormously on making more intelligent decisions. If you think about access control systems, they all make a decision based on did you supply the correct user ID and password, or credential, or did you have access to whatever that resource is? But we only looked at two things. The authentication, and the access policy, and these behavior based systems, they look at a lot of other things. He mentioned 60 different attributes that they're looking at. And all of these attributes, we're looking at where's David's iPad? What's the location of my laptop, which would be in the room upstairs, my phone is nearby, and making sure that somebody is not trying to use my account from California because there's no way I could get from here to California at a rapid pace. >> Final question for you while we have a couple seconds left here. What is the value propositions for Centrify? If you had the bottom line of the product strategy in a nutshell? >> Well, kind of a tough one there. >> Identity? Stop the Breach is the tagline. Is it the identity? Is it the tech? Is it the workflow? >> Identity and access control. At the end of the day we are trying to provide identity and access controls around how a user accesses an application, how we access servers, privileged accounts, how you would access your mobile device and your mobile device accesses applications. Basically, if you think about what defines an organization, identity, the humans that work at an organization and your rights to go gain access to applications is what links everything together because as you start adopting cloud services as we've adopted mobile devices, there's no perimeter any more really for the company. Identity makes up the definition and the boundary of the organization. >> Alright, David McNeely, vice president of product strategy, Centrify. More live coverage, here in New York City from theCUBE, at CyberConnect 2017. The inaugural event. Cube coverage continues after this short break. (upbeat music)
SUMMARY :
Brought to you by Centrify and and the computer industry that is security. I'm really impressed by Centrify's approach here. This is about the core issues of the community You are the product. Well, I think a lot of it has to do with if you look is that the independent credentials of a user is David that is logged onto the account? finish your thoughts. Tools like their HashCat can break passwords. that runs the processes of the phone so the applications have to be sitting on top of So separate the workload from some of your infostructure. is not the machine or the computer, You have the keys to the kingdom at Centrify, For the longest time we used LDAP and Kerberos the desire to get short term profits and the developer at the customer organizations has the ability to use your account from the user base or can you make that transparent? It does slow things down. have to make if you care about the long term That is one of the biggest challenges that we describe seeing that help you guys solve problems for your customers? Is it really David at the keyboard Because if we can get the computers to make these decisions The authentication, and the access policy, What is the value propositions for Centrify? Is it the identity? and the boundary of the organization. of product strategy, Centrify.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
David McNeely | PERSON | 0.99+ |
Centrify | ORGANIZATION | 0.99+ |
California | LOCATION | 0.99+ |
Dave | PERSON | 0.99+ |
Institute for Critical Infrastructure Technology | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
David | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
Ron Ross | PERSON | 0.99+ |
NIST | ORGANIZATION | 0.99+ |
60 different attributes | QUANTITY | 0.99+ |
iPad | COMMERCIAL_ITEM | 0.99+ |
iStekNET | ORGANIZATION | 0.99+ |
millions | QUANTITY | 0.99+ |
Equifax | ORGANIZATION | 0.99+ |
two things | QUANTITY | 0.99+ |
New York | LOCATION | 0.99+ |
today | DATE | 0.99+ |
one | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
first key pair | QUANTITY | 0.99+ |
SourceStory | ORGANIZATION | 0.98+ |
one time | QUANTITY | 0.98+ |
2080 | DATE | 0.98+ |
Jim Ruth | PERSON | 0.98+ |
CyberConnect 2017 | EVENT | 0.97+ |
SysAdmin | ORGANIZATION | 0.95+ |
millions a second | QUANTITY | 0.95+ |
theCUBE | ORGANIZATION | 0.93+ |
Windows | TITLE | 0.92+ |
OLAF | TITLE | 0.9+ |
OpenID Connect | TITLE | 0.9+ |
Etna | ORGANIZATION | 0.89+ |
Dr. | PERSON | 0.85+ |
SAML | TITLE | 0.85+ |
HashCat | TITLE | 0.85+ |
couple seconds | QUANTITY | 0.74+ |
LDAP | TITLE | 0.73+ |
Every 30 days | QUANTITY | 0.69+ |
Centrify | EVENT | 0.69+ |
lot more time | QUANTITY | 0.67+ |
notepad | COMMERCIAL_ITEM | 0.66+ |
Kerberos | TITLE | 0.65+ |
199 | QUANTITY | 0.64+ |
Atom Bomb | OTHER | 0.62+ |
800-160 | COMMERCIAL_ITEM | 0.45+ |
Cube | ORGANIZATION | 0.41+ |
Malware | TITLE | 0.4+ |