UNLIST TILL 4/1 - Putting Complex Data Types to Work

hello everybody thank you for joining us today from the virtual verdict of BBC 2020 today's breakout session is entitled putting complex data types to work I'm Jeff Healey I lead vertical marketing I'll be a host for this breakout session joining me is Deepak Magette II technical lead from verdict engineering but before we begin I encourage you to submit questions and comments during the virtual session you don't have to wait just type your question or comment and the question box below the slides and click Submit it won't be a Q&A session at the end of the presentation we'll answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forms that formed up Vertica calm to post your questions there after the session engineering team is planning to join the forms conversation going and also as a reminder that you can maximize your screen by clicking a double arrow button in the lower right corner of the slides yes this virtual session is being recorded and will be available to view on demand this week we'll send you a notification as submits ready now let's get started over to you Deepak thanks yes make sure you talk about the complex a textbook they've been doing it wedeck R&D without further delay let's see why and how we should put completely aside to work in your data analytics so this is going to be the outline or overview of my talk today first I'm going to talk about what are complex data types in some use cases I will then quickly cover some file formats that support these complex website I will then deep dive into the current support for complex data types in America finally I'll conclude with some usage considerations and what is coming in are 1000 release and our future roadmap and directions for this project so what are complex stereotypes complex data types are nested data structures composed of tentative types community types are nothing but your int float and string war binary etc the basic types some examples of complex data types include struct also called row are a list set map and Union composite types can also be built by composing other complicated types computer types are very useful for handling sparse data we also make samples on this presentation on that use case and also they help simplify analysis so let's look at some examples of complex data types so the first example on the left you can see a simple customer which is of type struc with two fields namely make a field name of type string and field ID of type integer structs are nothing but a group of fields and each field is a type of its own the type can be primitive or another complex type and on the right we have some example data for this simple customer complex type so it's basically two fields of type string and integer so in this case you have two rows where the first row is Alex with name named Alex and ID 1 0 and the second row has name Mary with ID 2 0 0 2 the second complex type on the left is phone numbers of type array of data has the element type string so area is nothing but a collection of elements the elements could be again a primitive type or another complex type so in this example the collection is of type string which is a primitive type and on the right you have some example of this collection of a fairy type called phone numbers and basically each row has a set or the list or a collection of phone numbers on the first we have two phone numbers and second you have a single phone number in that array and the third type on the slide is the map data type map is nothing but a collection of key value pairs so each element is actually a key value and you have a collection of such elements the key is usually a primitive type however the value is can be a primitive or complex type so in this example the both the key and value are of type string and then if you look on the right side of the slide you have some sample data here we have HTTP requests where the key is the header type and the value is the header value so the for instance on the first row we have a key type pragma with value no cash key type host with value some hostname and similarly on the second row you have some key value called accept with some text HTML because yeah they actually have a collection of elements allison maps are commonly called as collections as a to talking to in mini documents so we saw examples of a one-level complex steps on this slide we have nested complex there types on the right we have the root complex site called web events of type struct script has a for field a session ID of type integer session duration of type timestamp and then the third and the fourth fields customer and history requests are further complex types themselves so customer is again a complex type of type struct with three fields where the first two fields name ID are primitive types however the third field is another complex type phone numbers which we just saw in the previous slide similarly history request is also the same map type that we just saw so in this example each complex types is independent and you can reuse a complex type inside other complex types for example you can build another type called orders and simply reuse the customer type however in a practical implementation you have to deal with complexities involving security ownership and like sets lifecycle dependencies so keeping complex types as independent has that advantage of reusing them however the complication with that is you have to deal with security and ownership and lifecycle dependencies so this is on this slide we have another style of declaring a nested complex type do is call inlined complex data type so we have the same web driven struct type however if you look at the complex sites that embedded into the parent type definition so customer and HTTP request definition is embedded in lined into this parent structure so the advantage of this is you won't have to deal with the security and other lifecycle dependency issues but with the downside being you can't reuse them so it's sort of a trade-off between the these two so so let's see now some use cases of these complex types so the first use case or the benefit of using complex stereotypes is that you'll be able to express analysis mode naturally compute I've simplified the expression of analysis logic thereby simplifying the data pipelines in sequel it feels as if you have tables inside table so let's look at an example on and say you want to list all the customers with more than one thousand website events so if you have complex types you can simply create a table called web events and with one column of type web even which is a complex step so we just saw that difference it has four fields station customer and HTTP request so you can basically have the entire schema or in one type if you don't have complex types you'll have to create four tables one essentially for each complex type and then you have to establish primary key foreign key dependencies across these tables now if you want to achieve your goal of of listing all the customers in more than thousand web requests if you have complex types you can simply use the dot notation to extract the name the contact and also use some special functions for maps that will give you the count of all the HTTP requests grid in thousand however if you don't have complex types you'll have to now join each table individually extract the results from sub query and again joined on the outer query and finally you can apply a predicate of total requests which are greater than thousand to basically get your final result so it's a complex steps basically simplify the query writing part also the execution itself is also simplified so you don't have to have joins if you have complex you can simply have a load step to load the map type and then you can apply the function on top of it directly however if you have separate tables you have to join all these data and apply the filter step and then finally another joint to get your results alright so the other advantage of complex types is that you can cross this semi structured data very efficiently for example if you have data from clique streams or page views the data is often sparse and maps are very well suited for such data so maps or semi-structured by nature and with this support you can now actually have semi structured data represented along with structured columns in in any database so maps have this nice of nice feature to cap encapsulated sparse data as an example the common fields of a kick stream click stream or page view data are pragma host and except if you don't have map types you will have to end up creating a column for each of this header or field types however if you have map you can basically embed as key value pairs for all the data so on the left here on the slide you can see an example where you have a separate column for each field you end up with a lot of nodes basically the sparse however if you can embed them into in a map you can put them into a single column and sort of yeah have better efficiency and better representation of spots they imagine if you have thousands of fields in a click stream or page view you will have thousands of columns you will need thousands of columns represent data if you don't have a map type correct so given these are the most commonly used complexity types let's see what are the file formats that actually support these complex data types so most of file formats popular ones support complex data types however they have different serve variations so for instance if you have JSON it supports arrays and objects which are complex data types however JSON data is schema-less it is row oriented and this text fits because it is Kimmel s it has to store it in encase on every job the second type of file format is Avro and Avro has records enums arrays Maps unions and a fixed type however Avro has a schema it is oriented and it is binary compressed the third category is basically the park' and our style of file formats where the columnar so parquet and arc have support for arrays maps and structs the hewa schema they are column-oriented unlike Avro which is oriented and they're also binary compressed and they support a very nice compression and encoding types additionally so the main difference between parquet and arc is only in terms of how they represent complex types parquet includes the complex type hierarchy as reputation deflation levels however orc uses a separate column at every parent of the complex type to basically the prisons are now less so that apart from that difference in how they represent complex types parking hogs have similar capabilities in terms of optimizations and other compression techniques so to summarize JSON has no schema has no binary format in this columnar so it is not columnar Avro has a schema because binary format however it is not columnar and parquet and art are have a schema have a binary format and are columnar so let's see how we can query these different kinds of complex types and also the different file formats that they can be present in in how we can basically query these different variations in Vertica so in Vertica we basically have this feature called flex tables to where you can load complex data types and analyze them so flex tables use a binary format called vemma to store data as key value pairs clicks tables are schema-less they are weak typed and they trade flexibility for performance so when I mean what I mean by schema-less is basically the keys provide the field name and each row can potentially have different keys and it is weak type because there's no type information at the column level we have some we will see some examples of of this week type in the following slides but basically there's no type information so so the data is stored in text format and because of the week type and schema-less nature of flex tables you can implement some optimum use cases like if you can trivially implement needs like schema evolution or keep the complex types types fluid if that is your use case then the weak tightness and schema-less nature of flex tables will help you a lot to get give you that flexibility however because you have this weak type you you have a downside of not getting the best possible performance so if you if your use case is to get the best possible performance you can use a new feature of the strongly-typed complex types that we started to introduce in Vertica so complex types here are basically a strongly typed complex types they have a schema and then they give you the best possible performance because the optimizer now has enough information from the schema and the type to implement optimization system column selection or all the nice techniques that Vertica employs to give you the best possible color performance can now be supported even for complex types so and we'll see some of the examples of these two types in these slides now so let's use a simple data called restaurants a restaurant data - as running throughout this poll excites to basically see all the different variations of flex and complex steps so on this slide you have some sample data with four fields and essentially two rows if you sort of loaded in if you just operate them out so the four fields are named cuisine locations in menu name in cuisine or of type watch are locations is essentially an array and menu array of a row of two fields item and price so if you the data is in JSON there is no schema and there is no type information so how do we process that in Vertica so in Vertica you can simply create a flex table called restaurants you can copy the restaurant dot J's the restaurants of JSON file into Vertica and basically you can now start analyzing the data so if you do a select star from restaurants you will see that all the data is actually in one column called draw and it also you have the other column called identity which is to give you some unique row row ID but the row column base again encapsulates all the data that gives in the restaurant so JSON file this tall column is nothing but the V map format the V map format is a binary format that encodes the data as key value pairs and RAW format is basically backed by the long word binary column type in Vertica so each key essentially gives you the field name and the values the field value and it's all in its however the values are in the text text representation so see now you want to get better performance of this JSON data flex tables has these nice functions to basically analyze your data or try to extract some schema and type information from your data so if you execute compute flex table keys on the restaurants table you will see a new table called public dot restaurants underscore keys and then that will give you some information about your JSON data so it was able to automatically infer that your data has four fields namely could be name cuisine locations in menu and could also get that the name in cuisine or watch are however since locations in menu are complex types themselves one is array and one is area for row it sort of uses the same be map format as ease to process them so it has four columns to two primitive of type watch R and 2 R P map themselves so now you can materialize these columns by altering the table definitions and adding columns of that particular type it inferred and then you can get better performance from this materialized columns and yeah it's basically it's not in a single column anymore you have four columns for the fare your restaurant data and you can get some column selection and other optimizations on on the data that Whittaker provides all right so that is three flex tables are basically helpful if you don't have a schema and if you don't have any type of permission however we saw earlier that some file formats like Parker and Avro have schema and have some type information so in those cases you don't have to do the first step of inputting the type so you can directly create the type external table definition of the type and then you can target it to the park a file and you can load it in by an external table in vertical so the same restaurants dot JSON if you call if you transfer it to a translations or park' format you can basically get the fields with look however the locations and menu are still in the B map format all right so the V map format also allows you to explode the data and it has some nice functions to yeah M extract the fields from P map format so you have this map items so the same restaurant later if you want to explode and you want to apply predicate on the fields of the RS and the address of pro you can have map items to export your data and then you can apply predicates on a particular field in the complex type data so on this slide is basically showing you how you can explode the entire data the menu items as well as the locations and basically give you the elements of each of these complex types up so as I mentioned the menus so if you go back to the previous slide the locations and menu items are still the bond binary or the V map format so the question is if you want what if you want to get perform better on the V map data so for primitive types you could materialize into the primitive style however if it's an array and array of row we will need some first-class complex type constructs and that is what we will see that are added in what is right now so Vertica has started to introduce complex stereotypes with where these complex types is sort of a strongly typed complex site so on this slide you have an example of a row complex type where so we create an external table called customers and you have a row type of twit to fields name and ID so the complex type is basically inlined into the tables into the column definition and on the second example you can see the create external table items which is unlisted row type so it has an item of type row which is so fast to peals name and the properties is again another nested row type with two fixed quantities label so these are basically strongly typed complex types and then the optimizer can now give you a better performance compared to the V map using the strongly typed information in their queries so we have support for pure rows and extra draws in external tables for power K we have support for arrays and nested arrays as well for external tables in power K so you can declare an external table called contacts with a flip phone number of array of integers similarly you can have a nested array of items of type integer we can declare a column with that strongly typed complex type so the other complex type support that we are adding in the thinner liz's support for optimized one dimensional arrays and sets for both ross and as well as RK external table so you can create internal table called phone numbers with a one-dimensional array so here you have phone numbers of array of type int you can have one dimensional you can have sets as well which is also one color one dimension arrays but sets are basically optimized for fast look ups they are have unique elements and they are ordered so big so you can get fast look ups using sets if that is a use case then set will give you very quick lookups for elements and we also implemented some functions to support arrays sets as well so you have applied min apply max which are scale out that you can apply on top of an array element and you can get the minimum element and so on so you can up you have support for additional functions as well so the other feature that is coming in ten o is the explored arrays of functionality so we have a implemented EU DX that will allow you to similar similar to the example you saw in the math items case you can extract elements from these arrays and you can apply different predicates or analysis on the elements so for example if you have this restaurant table with the column name watch our locations of each an area of archer and menu again an area watch our you can insert values using the array constructor into these columns so here we inserting three values lilies feed the with location with locations cambridge pittsburgh menu items cheese and pepperoni again another row with name restaurant named bob tacos location Houston and totila salsa and Patty on the third example so now you can basically explode the both arrays into and extract the elements out from these arrays so you can explode the location array and extract the location elements which is which are basically Houston Cambridge Pittsburgh New Jersey and also you can explode the menu items and extract individual elements and now you can sort of apply other predicates on the extruded data Kollek so so so let's see what are some usage considerations of these complex data types so complex data types as we saw earlier are nice if you have sparse data so if your data has clickstream or has some page view data then maps are very nice to have to represent your data and then you can sort of efficiently represent the in the space wise fashion for sparse data use a map types and compensate that as we saw earlier for the web request count query it will help you simplify the analysis as well you don't have to have joins and it will simplify your query analysis as I just mentioned if your use cases are for fast look ups then you can use a set type so arrays are nice but they have the ordering on them however if your primary use case to just look up for certain elements then we can use the set type also you can use the B map or the Flex functionality that we have in Vertica if you want flexibility in your complex set data type schema so like I mentioned earlier you can trivially implement needs like scheme evolution or even keep the complex types fluid so if you have multiple iterations of unit analysis and each iteration we are changing the fields because you're just exploring the data then we map and flex will give you that nice ease to change the fields within the complex type or across files and we can load fluid complex you can load complexity types with bit fluids is basically different fields in different Rho into V map and flex tables easily however if you're once you basically treated over your data you figured out what are the fields and the complex types that you really need you can use the strongly typed complex data types that we started to introduce in Vertica so you can use the array type the struct type in the map type for your data analysis so that's sort of the high level use cases for complex types in vertical so it depends on a lot on where your data analysis phase is fear early then your data is usually still fluid and you might want to use V Maps and flex to explore it once you finalize your schema you can use the strongly typed complex data types and to get the best possible performance holic so so what's coming in the following releases of Vertica so antenna which is coming in sometime now so yeah so we are adding which is the next release of vertical basically we're adding support for loading Park a complex data types to the V map format so parquet is a strongly typed file format basically it has the schema it also has the type information for each of the complex type however if you are exploring your data then you might have different park' files with different schemes so you can load them to the V map format first and then you can analyze your data and then you can switch to the strongly typed complex types we're also adding one dimensional optimized arrays and sets in growth and for parquet so yeah the complex sets are not just limited to parquet you can also store them in drawers however right now you only support one dimension arrays and set in rows we're also adding the Explorer du/dx for one-dimensional arrays in the in this release so you can as you saw in the previous example you can explode the data for of arrays in arrays and you can apply predicates on individual elements for the erase data so you can in it'll apply for set so you can cause them to milli to erase and Clinics code sets as well so what are the plans paths that you know release so we are going to continue both for strongly-typed computer types right now we don't have support for the full in the tail release we won't have support for the full all the combinations of complex types so we only have support for nested arrays sorriness listed pure arrays or nested pure rows and some are only limited to park a file format so we will continue to add more support for sub queries and nested complex sites in the following in the in following releases and we're also planning to add this B map data type so you saw in the examples that the V map data format is currently backed by the long word binary data format or the other column type because of this the optimizer really cannot distinguish which is a which is which data is actually a long wall binary or which is actually data and we map format so if we the idea is to basically add a type called V map and then the optimizer can now implement our support optimizations or even syntax such as dot notation and yeah if your data is columnar such as Parque then you can implement optimizations just keep push down where you can push the keys that are actually querying in your in your in your analysis and then only those keys should be loaded from parquet and built into the V map format so that way you get sort of the column selection optimization for complex types as well and yeah that's something you can achieve if you have different types for the V map format so that's something on the roadmap as well and then unless join is basically another nice to have feature right now if you want to explode and join the array elements you have to explode in the sub query and then in the outer query you have to join the data however if you have unless join till I love you to explode as well as join the data in the same query and on the fly you can do both and finally we are also adding support for this new feature called UD vector so that's on the plan too so our work for complex types is is essentially chain the fundamental way Vertica execute in the sense of functions and expression so right now all expressions in Vertica can return only a single column out acceptance in some cases like beauty transforms and so on but the scalar functions for instance if you take aut scalar you can get only one column out of it however if you have some use cases where you want to compute multiple computation so if you also have multiple computations on the same input data say you have input data of two integers and you want to compute both addition and multiplication on those two columns this is for example but in many many machine learning example use cases have similar patterns so say you want to do both these computations on the data at the same time then in the current approach you have to have one function for addition one function for multiplication and both of them will have to load the data once basically loading data twice to get both these computations turn however with the Uni vector support you can perform both these computations in the same function and you can return two columns out so essentially saving you the loading loading these columns twice you can only do it once and get both the results out so that's sort of what we are trying to implement with all the changes that we are doing to support complex data types in Vertica and also you don't have to use these over Clause like a uni transform so PD scale just like we do scalars you can have your a vector and you can have multiple columns returned from your computations so that sort of concludes my talk so thank you for listening to my presentation now we are ready for Q&A

Published Date : Mar 30 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
America	LOCATION	0.99+
Jeff Healey	PERSON	0.99+
second row	QUANTITY	0.99+
Mary	PERSON	0.99+
two rows	QUANTITY	0.99+
two fields	QUANTITY	0.99+
first row	QUANTITY	0.99+
two rows	QUANTITY	0.99+
two types	QUANTITY	0.99+
each row	QUANTITY	0.99+
two integers	QUANTITY	0.99+
Deepak	PERSON	0.99+
one function	QUANTITY	0.99+
three fields	QUANTITY	0.99+
fourth fields	QUANTITY	0.99+
each element	QUANTITY	0.99+
each field	QUANTITY	0.99+
third	QUANTITY	0.99+
more than thousand web requests	QUANTITY	0.99+
second example	QUANTITY	0.99+
today	DATE	0.99+
each key	QUANTITY	0.99+
each table	QUANTITY	0.99+
four fields	QUANTITY	0.99+
third field	QUANTITY	0.99+
first example	QUANTITY	0.99+
Deepak Magette II	PERSON	0.99+
two columns	QUANTITY	0.99+
third category	QUANTITY	0.99+
two columns	QUANTITY	0.99+
two fields	QUANTITY	0.99+
Houston	LOCATION	0.99+
first step	QUANTITY	0.99+
twice	QUANTITY	0.99+
thousands of columns	QUANTITY	0.98+
three values	QUANTITY	0.98+
this week	DATE	0.98+
more than one thousand website events	QUANTITY	0.98+
third type	QUANTITY	0.98+
each iteration	QUANTITY	0.98+
both	QUANTITY	0.98+
greater than thousand	QUANTITY	0.98+
cambridge	LOCATION	0.98+
JSON	TITLE	0.98+
both arrays	QUANTITY	0.97+
one column	QUANTITY	0.97+
thousands of fields	QUANTITY	0.97+
second	QUANTITY	0.97+
third example	QUANTITY	0.97+
two	QUANTITY	0.97+
single column	QUANTITY	0.96+
thousand	QUANTITY	0.96+
Alex	PERSON	0.96+
first	QUANTITY	0.96+
BBC 2020	ORGANIZATION	0.96+
Vertica	TITLE	0.96+
four columns	QUANTITY	0.95+
once	QUANTITY	0.95+
one type	QUANTITY	0.95+
V Maps	TITLE	0.94+
one color	QUANTITY	0.94+
second type	QUANTITY	0.94+
one dimension	QUANTITY	0.94+
first two fields	QUANTITY	0.93+
four tables	QUANTITY	0.91+
each	QUANTITY	0.91+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for V Maps: