UNLIST TILL 4/2 - The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL

hello everybody and thank you for joining us today for the virtual verdict of BBC 2020 today's breakout session is entitled the shortest path to Vertica best practices for data warehouse migration ETL I'm Jeff Healey I'll leave verdict and marketing I'll be your host for this breakout session joining me today are Marco guesser and Mauricio lychee vertical product engineer is joining us from yume region but before we begin I encourage you to submit questions or comments or in the virtual session don't have to wait just type question in a comment in the question box below the slides that click Submit as always there will be a Q&A session the end of the presentation will answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forums that formed at vertical comm to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also reminder that you can maximize your screen by clicking the double arrow button and lower right corner of the sides and yes this virtual session is being recorded be available to view on demand this week send you a notification as soon as it's ready now let's get started over to you mark marco andretti oh hello everybody this is Marco speaking a sales engineer from Amir said I'll just get going ah this is the agenda part one will be done by me part two will be done by Mauricio the agenda is as you can see big bang or piece by piece and the migration of the DTL migration of the physical data model migration of et I saw VTL + bi functionality what to do with store procedures what to do with any possible existing user defined functions and migration of the data doctor will be by Maurice it you want to talk about emeritus Rider yeah hello everybody my name is Mauricio Felicia and I'm a birth record pre-sales like Marco I'm going to talk about how to optimize that were always using some specific vertical techniques like table flattening live aggregated projections so let me start with be a quick overview of the data browser migration process we are going to talk about today and normally we often suggest to start migrating the current that allows the older disease with limited or minimal changes in the overall architecture and yeah clearly we will have to port the DDL or to redirect the data access tool and we will platform but we should minimizing the initial phase the amount of changes in order to go go live as soon as possible this is something that we also suggest in the second phase we can start optimizing Bill arouse and which again with no or minimal changes in the architecture as such and during this optimization phase we can create for example dog projections or for some specific query or optimize encoding or change some of the visual spools this is something that we normally do if and when needed and finally and again if and when needed we go through the architectural design for these operations using full vertical techniques in order to take advantage of all the features we have in vertical and this is normally an iterative approach so we go back to name some of the specific feature before moving back to the architecture and science we are going through this process in the next few slides ok instead in order to encourage everyone to keep using their common sense when migrating to a new database management system people are you often afraid of it it's just often useful to use the analogy of how smooth in your old home you might have developed solutions for your everyday life that make perfect sense there for example if your old cent burner dog can't walk anymore you might be using a fork lifter to heap in through your window in the old home well in the new home consider the elevator and don't complain that the window is too small to fit the dog through this is very much in the same way as Narita but starting to make the transition gentle again I love to remain in my analogy with the house move picture your new house as your new holiday home begin to install everything you miss and everything you like from your old home once you have everything you need in your new house you can shut down themselves the old one so move each by feet and go for quick wins to make your audience happy you do bigbang only if they are going to retire the platform you are sitting on where you're really on a sinking ship otherwise again identify quick wings implement published and quickly in Vertica reap the benefits enjoy the applause use the gained reputation for further funding and if you find that nobody's using the old platform anymore you can shut it down if you really have to migrate you can still go to really go to big battle in one go only if you absolutely have to otherwise migrate by subject area use the group all similar clear divisions right having said that ah you start off by migrating objects objects in the database that's one of the very first steps it consists of migrating verbs the places where you can put the other objects into that is owners locations which is usually schemers then what do you have that you extract tables news then you convert the object definition deploy them to Vertica and think that you shouldn't do it manually never type what you can generate ultimate whatever you can use it enrolls usually there is a system tables in the old database that contains all the roads you can export those to a file reformat them and then you have a create role and create user scripts that you can apply to Vertica if LDAP Active Directory was used for the authentication the old database vertical supports anything within the l dubs standard catalogued schemas should be relatively straightforward with maybe sometimes the difference Vertica does not restrict you by defining a schema as a collection of all objects owned by a user but it supports it emulates it for old times sake Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use it it usually said it is always set to the name of the database in case of vertical having had now the schemas the catalogs the users and roles in place move the take the definition language of Jesus thought if you are allowed to it's best to use a tool that translates to date types in the PTL generated you might see as a mention of old idea to listen by memory to by the way several times in this presentation we are very happy to have it it actually can export the old database table definition because they got it works with the odbc it gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target schema to several target DBMS flavors the most important which is obviously vertical if they force you to use something else there are always tubes like sequel plots in Oracle the show table command in Tara data etc H each DBMS should have a set of tools to extract the object definitions to be deployed in the other instance of the same DBMS ah if I talk about youth views usually a very new definition also in the old database catalog one thing that you might you you use special a bit of special care synonyms is something that were to get emulated different ways depending on the specific needs I said I stop you on the view or table to be referred to or something that is really neat but other databases don't have the search path in particular that works that works very much like the path environment variable in Windows or Linux where you specify in a table an object name without the schema name and then it searched it first in the first entry of the search path then in a second then in third which makes synonym hugely completely unneeded when you generate uvl we remained in the analogy of moving house dust and clean your stuff before placing it in the new house if you see a table like the one here at the bottom this is usually corpse of a bad migration in the past already an ID is usually an integer and not an almost floating-point data type a first name hardly ever has 256 characters and that if it's called higher DT it's not necessarily needed to store the second when somebody was hired so take good care in using while you are moving dust off your stuff and use better data types the same applies especially could string how many bytes does a string container contains for eurozone's it's not for it's actually 12 euros in utf-8 in the way that Vertica encodes strings and ASCII characters one died but the Euro sign thinks three that means that you have to very often you have when you have a single byte character set up a source you have to pay attention oversize it first because otherwise it gets rejected or truncated and then you you will have to very carefully check what their best science is the best promising is the most promising approach is to initially dimension strings in multiples of very initial length and again ODP with the command you see there would be - I you 2 comma 4 will double the lengths of what otherwise will single byte character and multiply that for the length of characters that are wide characters in traditional databases and then load the representative sample of your cells data and profile using the tools that we personally use to find the actually longest datatype and then make them shorter notice you might be talking about the issues of having too long and too big data types on projection design are we live and die with our projects you might know remember the rules on how default projects has come to exist the way that we do initially would be just like for the profiling load a representative sample of the data collector representative set of already known queries from the Vertica database designer and you don't have to decide immediately you can always amend things and otherwise follow the laws of physics avoid moving data back and forth across nodes avoid heavy iOS if you can design your your projections initially by hand encoding matters you know that the database designer is a very tight fisted thing it would optimize to use as little space as possible you will have to think of the fact that if you compress very well you might end up using more time in reading it this is the testimony to run once using several encoding types and you see that they are l e is the wrong length encoded if sorted is not even visible while the others are considerably slower you can get those nights and look it in look at them in detail I will go in detail you now hear about it VI migrations move usually you can expect 80% of everything to work to be able to live to be lifted and shifted you don't need most of the pre aggregated tables because we have live like regain projections many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flatten tables that are going to be talked about later you might have to ride those by hand you will be able to switch off casting because vertical speeds of everything with laps Lyle aggregate projections and you have worked with molap cubes before you very probably won't meet them at all ETL tools what you will have to do is if you do it row by row in the old database consider changing everything to very big transactions and if you use in search statements with parameter markers consider writing to make pipes and using verticals copy command mouse inserts yeah copy c'mon that's what I have here ask you custom functionality you can see on this slide the verticals the biggest number of functions in the database we compare them regularly by far compared to any other database you might find that many of them that you have written won't be needed on the new database so look at the vertical catalog instead of trying to look to migrate a function that you don't need stored procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have very rarely you will have to actually write a procedure that involves a loop but it's really in our experience very very rarely usually you can just switch to standard scripting and this is basically repeating what Mauricio said in the interest of time I will skip this look at this one here the most of the database data warehouse migration talks should be automatic you can use you can automate GDL migration using ODB which is crucial data profiling it's not crucial but game-changing the encoding is the same thing you can automate at you using our database designer the physical data model optimization in general is game-changing you have the database designer use the provisioning use the old platforms tools to generate the SQL you have no objects without their onus is crucial and asking functions and procedures they are only crucial if they depict the company's intellectual property otherwise you can almost always replace them with something else that's it from me for now Thank You Marco Thank You Marco so we will now point our presentation talking about some of the Vertica that overall the presentation techniques that we can implement in order to improve the general efficiency of the dot arouse and let me start with a few simple messages well the first one is that you are supposed to optimize only if and when this is needed in most of the cases just a little shift from the old that allows to birth will provide you exhaust the person as if you were looking for or even better so in this case probably is not really needed to to optimize anything in case you want optimize or you need to optimize then keep in mind some of the vertical peculiarities for example implement delete and updates in the vertical way use live aggregate projections in order to avoid or better in order to limit the goodbye executions at one time used for flattening in order to avoid or limit joint and and then you can also implement invert have some specific birth extensions life for example time series analysis or machine learning on top of your data we will now start by reviewing the first of these ballots optimize if and when needed well if this is okay I mean if you get when you migrate from the old data where else to birth without any optimization if the first four month level is okay then probably you only took my jacketing but this is not the case one very easier to dispute in session technique that you can ask is to ask basket cells to optimize the physical data model using the birth ticket of a designer how well DB deal which is the vertical database designer has several interfaces here I'm going to use what we call the DB DB programmatic API so basically sequel functions and using other databases you might need to hire experts looking at your data your data browser your table definition creating indexes or whatever in vertical all you need is to run something like these are simple as six single sequel statement to get a very well optimized physical base model you see that we start creating a new design then we had to be redesigned tables and queries the queries that we want to optimize we set our target in this case we are tuning the physical data model in order to maximize query performances this is why we are using my design query and in our statement another possible journal tip would be to tune in order to reduce storage or a mix between during storage and cheering queries and finally we asked Vertica to produce and deploy these optimized design in a matter of literally it's a matter of minutes and in a few minutes what you can get is a fully optimized fiscal data model okay this is something very very easy to implement keep in mind some of the vertical peculiarities Vaska is very well tuned for load and query operations aunt Berta bright rose container to biscuits hi the Pharos container is a group of files we will never ever change the content of this file the fact that the Rose containers files are never modified is one of the political peculiarities and these approach led us to use minimal locks we can add multiple load operations in parallel against the very same table assuming we don't have a primary or unique constraint on the target table in parallel as a sage because they will end up in two different growth containers salad in read committed requires in not rocket fuel and can run concurrently with insert selected because the Select will work on a snapshot of the catalog when the transaction start this is what we call snapshot isolation the kappa recovery because we never change our rows files are very simple and robust so we have a huge amount of bandages due to the fact that we never change the content of B rows files contain indiarose containers but on the other side believes and updates require a little attention so what about delete first when you believe in the ethica you basically create a new object able it back so it appeared a bit later in the Rose or in memory and this vector will point to the data being deleted so that when the feed is executed Vertica will just ignore the rules listed in B delete records and it's not just about the leak and updating vertical consists of two operations delete and insert merge consists of either insert or update which interim is made of the little insert so basically if we tuned how the delete work we will also have tune the update in the merge so what should we do in order to optimize delete well remember what we said that every time we please actually we create a new object a delete vector so avoid committing believe and update too often we reduce work the work for the merge out for the removal method out activities that are run afterwards and be sure that all the interested projections will contain the column views in the dedicate this will let workers directly after access the projection without having to go through the super projection in order to create the vector and the delete will be much much faster and finally another very interesting optimization technique is trying to segregate the update and delete operation from Pyrenean third workload in order to reduce lock contention beliefs something we are going to discuss and these contain using partition partition operation this is exactly what I want to talk about now here you have a typical that arouse architecture so we have data arriving in a landing zone where the data is loaded that is from the data sources then we have a transformation a year writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end those green data structure we have at the end are the ones used by the data access tools when they run their queries sometimes we might need to change old data for example because we have late records or maybe because we want to fix some errors that have been originated in the facilities so what we do in this case is we just copied back the partition we want to change or we want to adjust from the green interior a the end to the stage in the area we have a very fast operation which is Tokyo Station then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area and at the very same time people continues to you with green data structures that are at the end so we will never have contention between the two operations when we updating the staging area is completed what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in be staging zone to the query area that is the green one at the end this swap partition is very fast is an atomic operation and basically what will happens is just that well exchange the pointer to the data this is a very very effective techniques and lot of customer useless so why flops on table and live aggregate for injections well basically we use slot in table and live aggregate objection to minimize or avoid joint this is what flatten table are used for or goodbye and this is what live aggregate projections are used for now compared to traditional data warehouses better can store and process and aggregate and join order of magnitudes more data that is a true columnar database joint and goodbye normally are not a problem at all they run faster than any traditional data browse that page there are still scenarios were deficits are so big and we are talking about petabytes of data and so quickly going that would mean be something in order to boost drop by and join performances and this is why you can't reduce live aggregate projections to perform aggregations hard loading time and limit the need for global appear on time and flux and tables to combine information from different entity uploading time and again avoid running joint has query undefined okay so live aggregate projections at this point in time we can use live aggregate projections using for built in aggregate functions which are some min Max and count okay let's see how this works suppose that you have a normal table in this case we have a table unit sold with three columns PIB their time and quantity which has been segmented in a given way and on top of this base table we call it uncle table we create a projection you see that we create the projection using the salad that will aggregate the data we get the PID we get the date portion of the time and we get the sum of quantity from from the base table grouping on the first two columns so PID and the date portion of day time okay what happens in this case when we load data into the base table all we have to do with load data into the base table when we load data into the base table we will feel of course big injections that assuming we are running with k61 we will have to projection to projections and we will know the data in those two projection with all the detail in data we are going to load into the table so PAB playtime and quantity but at the very same time at the very same time and without having to do nothing any any particular operation or without having to run any any ETL procedure we will also get automatically in the live aggregate projection for the data pre aggregated with be a big day portion of day time and the sum of quantity into the table name total quantity you see is something that we get for free without having to run any specific procedure and this is very very efficient so the key concept is that during the loading operation from VDL point of view is executed again the base table we do not explicitly aggregate data or we don't have any any plc do the aggregation is automatic and we'll bring the pizza to be live aggregate projection every time we go into the base table you see the two selection that we have we have on in this line on the left side and you see that those two selects will produce exactly the same result so running select PA did they trying some quantity from the base table or running the select star from the live aggregate projection will result exactly in the same data you know this is of course very useful but is much more useful result that if we and we can observe this if we run an explained if we run the select against the base table asking for this group data what happens behind the scene is that basically vertical itself that is a live aggregate projection with the data that has been already aggregating loading phase and rewrite your query using polite aggregate projection this happens automatically you see this is a query that ran a group by against unit sold and vertical decided to rewrite this clearly as something that has to be collected against the light aggregates projection because if I decrease this will save a huge amount of time and effort during the ETL cycle okay and is not just limited to be information you want to aggregate for example another query like select count this thing you might note that can't be seen better basically our goodbyes will also take advantage of the live aggregate injection and again this is something that happens automatically you don't have to do anything to get this okay one thing that we have to keep very very clear in mind Brassica what what we store in the live aggregate for injection are basically partially aggregated beta so in this example we have two inserts okay you see that we have the first insert that is entered in four volts and the second insert which is inserting five rules well in for each of these insert we will have a partial aggregation you will never know that after the first insert you will have a second one so better will calculate the aggregation of the data every time irin be insert it is a key concept and be also means that you can imagine lies the effectiveness of bees technique by inserting large chunk of data ok if you insert data row by row this technique live aggregate rejection is not very useful because for every goal that you insert you will have an aggregation so basically they'll live aggregate injection will end up containing the same number of rows that you have in the base table but if you everytime insert a large chunk of data the number of the aggregations that you will have in the library get from structure is much less than B base data so this is this is a key concept you can see how these works by counting the number of rows that you have in alive aggregate injection you see that if you run the select count star from the solved live aggregate rejection the query on the left side you will get four rules but actually if you explain this query you will see that he was reading six rows so this was because every of those two inserts that we're actively interested a few rows in three rows in India in the live aggregate projection so this is a key concept live aggregate projection keep partially aggregated data this final aggregation will always happen at runtime okay another which is very similar to be live aggregate projection or what we call top K projection we actually do not aggregate anything in the top case injection we just keep the last or limit the amount of rows that we collect using the limit over partition by all the by clothes and this again in this case we create on top of the base stable to top gay projection want to keep the last quantity that has been sold and the other one to keep the max quantity in both cases is just a matter of ordering the data in the first case using the B time column in the second page using quantity in both cases we fill projection with just the last roof and again this is something that we do when we insert data into the base table and this is something that happens automatically okay if we now run after the insert our select against either the max quantity okay or be lost wanted it okay we will get the very last you see that we have much less rows in the top k projections okay we told at the beginning that basically we can use for built-in function you might remember me max sum and count what if I want to create my own specific aggregation on top of the lid and customer sum up because our customers have very specific needs in terms of live aggregate projections well in this case you can code your own live aggregate production user-defined functions so you can create the user-defined transport function to implement any sort of complex aggregation while loading data basically after you implemented miss VPS you can deploy using a be pre pass approach that basically means the data is aggregated as loading time during the data ingestion or the batch approach that means that the data is when that woman is running on top which things to remember on live a granade projections they are limited to be built in function again some max min and count but you can call your own you DTF so you can do whatever you want they can reference only one table and for bass cab version before 9.3 it was impossible to update or delete on the uncle table this limit has been removed in 9.3 so you now can update and delete data from the uncle table okay live aggregate projection will follow the segmentation of the group by expression and in some cases the best optimizer can decide to pick the live aggregates objection or not depending on if depending on the fact that the aggregation is a consistent or not remember that if we insert and commit every single role to be uncoachable then we will end up with a live aggregate indirection that contains exactly the same number of rows in this case living block or using the base table it would be the same okay so this is one of the two fantastic techniques that we can implement in Burtka this live aggregate projection is basically to avoid or limit goodbyes the other which we are going to talk about is cutting table and be reused in order to avoid the means for joins remember that K is very fast running joints but when we scale up to petabytes of beta we need to boost and this is what we have in order to have is problem fixed regardless the amount of data we are dealing with so how what about suction table let me start with normalized schemas everybody knows what is a normalized scheme under is no but related stuff in this slide the main scope of an normalized schema is to reduce data redundancies so and the fact that we reduce data analysis is a good thing because we will obtain fast and more brides we will have to write into a database small chunks of data into the right table the problem with these normalized schemas is that when you run your queries you have to put together the information that arrives from different table and be required to run joint again jointly that again normally is very good to run joint but sometimes the amount of data makes not easy to deal with joints and joints sometimes are not easy to tune what happens in in the normal let's say traditional data browser is that we D normalize the schemas normally either manually or using an ETL so basically we have on one side in this light on the left side the normalized schemas where we can get very fast right on the other side on the left we have the wider table where we run all the three joints and pre aggregation in order to prepare the data for the queries and so we will have fast bribes on the left fast reads on the Left sorry fast bra on the right and fast read on the left side of these slides the probability in the middle because we will push all the complexity in the middle in the ETL that will have to transform be normalized schema into the water table and the way we normally implement these either manually using procedures that we call the door using ETL this is what happens in traditional data warehouse is that we will have to coach in ETL layer in order to round the insert select that will feed from the normalized schema and right into the widest table at the end the one that is used by the data access tools we we are going to to view store to run our theories so this approach is costly because of course someone will have to code this ETL and is slow because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was ok with the batch and is resource intensive of course and is also human being intensive because of the people that will have to code and check the results it ever thrown because it can fail and introduce a latency because there is a get in the time axis between the time t0 when you load the data into be normalized schema and the time t1 when we get the data finally ready to be to be queried so what would be inverter to facilitate this process is to create this flatten table with the flattened T work first you avoid data redundancy because you don't need the wide table on the normalized schema on the left side second is fully automatic you don't have to do anything you just have to insert the data into the water table and the ETL that you have coded is transformed into an insert select by vatika automatically you don't have to do anything it's robust and this Latin c0 is a single fast as soon as you load the data into the water table you will get all the joints executed for you so let's have a look on how it works in this case we have the table we are going to flatten and basically we have to focus on two different clauses the first one is you see that there is one table here I mentioned value 1 which can be defined as default and then the Select or set using okay the difference between the fold and set using is when the data is populated if we use default data is populated as soon as we know the data into the base table if we use set using Google Earth to refresh but everything is there I mean you don't need them ETL you don't need to code any transformation because everything is in the table definition itself and it's for free and of course is in latency zero so as soon as you load the other columns you will have the dimension value valued as well okay let's see an example here suppose here we have a dimension table customer dimension that is on the left side and we have a fact table on on the right you see that the fact table uses columns like o underscore name or Oh the score city which are basically the result of the salad on top of the customer dimension so Beezus were the join is executed as soon as a remote data into the fact table directly into the fact table without of course loading data that arise from the dimension all the data from the dimension will be populated automatically so let's have an example here suppose that we are running this insert as you can see we are running be inserted directly into the fact table and we are loading o ID customer ID and total we are not loading made a major name no city those name and city will be automatically populated by Vertica for you because of the definition of the flood table okay you see behave well all you need in order to have your widest tables built for you your flattened table and this means that at runtime you won't need any join between base fuck table and the customer dimension that we have used in order to calculate name and city because the data is already there this was using default the other option was is using set using the concept is absolutely the same you see that in this case on the on the right side we have we have basically replaced this all on the school name default with all underscore name set using and same is true for city the concept that I said is the same but in this case which we set using then we will have to refresh you see that we have to run these select trash columns and then the name of the table in this case all columns will be fresh or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension so this technique this technique is extremely useful the difference between default and said choosing just to summarize the most important differences remember you just have to remember that default will relate your target when you load set using when you refresh end and in some cases you might need to use them both so in some cases you might want to use both default end set using in this example here we'll see that we define the underscore name using both default and securing and this means that we love the data populated either when we load the data into the base table or when we run the Refresh this is summary of the technique that we can implement in birth in order to make our and other browsers even more efficient and well basically this is the end of our presentation thank you for listening and now we are ready for the Q&A session you

Published Date : Mar 30 2020

SUMMARY :

the end to the stage in the area we have

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Tom	PERSON	0.99+
Marta	PERSON	0.99+
John	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Dave	PERSON	0.99+
Peter Burris	PERSON	0.99+
Chris Keg	PERSON	0.99+
Laura Ipsen	PERSON	0.99+
Jeffrey Immelt	PERSON	0.99+
Chris	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Chris O'Malley	PERSON	0.99+
Andy Dalton	PERSON	0.99+
Chris Berg	PERSON	0.99+
Dave Velante	PERSON	0.99+
Maureen Lonergan	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Paul Forte	PERSON	0.99+
Erik Brynjolfsson	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Andrew McCafee	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Cheryl	PERSON	0.99+
Mark	PERSON	0.99+
Marta Federici	PERSON	0.99+
Larry	PERSON	0.99+
Matt Burr	PERSON	0.99+
Sam	PERSON	0.99+
Andy Jassy	PERSON	0.99+
Dave Wright	PERSON	0.99+
Maureen	PERSON	0.99+
Google	ORGANIZATION	0.99+
Cheryl Cook	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
$8,000	QUANTITY	0.99+
Justin Warren	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Europe	LOCATION	0.99+
Andy	PERSON	0.99+
30,000	QUANTITY	0.99+
Mauricio	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Robb	PERSON	0.99+
Jassy	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Mike Nygaard	PERSON	0.99+

Paola Peraza Calderon & Viraj Parekh, Astronomer | Cube Conversation

(soft electronic music) >> Hey everyone, welcome to this CUBE conversation as part of the AWS Startup Showcase, season three, episode one, featuring Astronomer. I'm your host, Lisa Martin. I'm in the CUBE's Palo Alto Studios, and today excited to be joined by a couple of guests, a couple of co-founders from Astronomer. Viraj Parekh is with us, as is Paola Peraza-Calderon. Thanks guys so much for joining us. Excited to dig into Astronomer. >> Thank you so much for having us. >> Yeah, thanks for having us. >> Yeah, and we're going to be talking about the role of data orchestration. Paola, let's go ahead and start with you. Give the audience that understanding, that context about Astronomer and what it is that you guys do. >> Mm-hmm. Yeah, absolutely. So, Astronomer is a, you know, we're a technology and software company for modern data orchestration, as you said, and we're the driving force behind Apache Airflow. The Open Source Workflow Management tool that's since been adopted by thousands and thousands of users, and we'll dig into this a little bit more. But, by data orchestration, we mean data pipeline, so generally speaking, getting data from one place to another, transforming it, running it on a schedule, and overall just building a central system that tangibly connects your entire ecosystem of data services, right. So what, that's Redshift, Snowflake, DVT, et cetera. And so tangibly, we build, we at Astronomer here build products powered by Apache Airflow for data teams and for data practitioners, so that they don't have to. So, we sell to data engineers, data scientists, data admins, and we really spend our time doing three things. So, the first is that we build Astro, our flagship cloud service that we'll talk more on. But here, we're really building experiences that make it easier for data practitioners to author, run, and scale their data pipeline footprint on the cloud. And then, we also contribute to Apache Airflow as an open source project and community. So, we cultivate the community of humans, and we also put out open source developer tools that actually make it easier for individual data practitioners to be productive in their day-to-day jobs, whether or not they actually use our product and and pay us money or not. And then of course, we also have professional services and education and all of these things around our commercial products that enable folks to use our products and use Airflow as effectively as possible. So yeah, super, super happy with everything we've done and hopefully that gives you an idea of where we're starting. >> Awesome, so when you're talking with those, Paola, those data engineers, those data scientists, how do you define data orchestration and what does it mean to them? >> Yeah, yeah, it's a good question. So, you know, if you Google data orchestration you're going to get something about an automated process for organizing silo data and making it accessible for processing and analysis. But, to your question, what does that actually mean, you know? So, if you look at it from a customer's perspective, we can share a little bit about how we at Astronomer actually do data orchestration ourselves and the problems that it solves for us. So, as many other companies out in the world do, we at Astronomer need to monitor how our own customers use our products, right? And so, we have a weekly meeting, for example, that goes through a dashboard and a dashboarding tool called Sigma where we see the number of monthly customers and how they're engaging with our product. But, to actually do that, you know, we have to use data from our application database, for example, that has behavioral data on what they're actually doing in our product. We also have data from third party API tools, like Salesforce and HubSpot, and other ways in which our customer, we actually engage with our customers and their behavior. And so, our data team internally at Astronomer uses a bunch of tools to transform and use that data, right? So, we use FiveTran, for example, to ingest. We use Snowflake as our data warehouse. We use other tools for data transformations. And even, if we at Astronomer don't do this, you can imagine a data team also using tools like, Monte Carlo for data quality, or Hightouch for Reverse ETL, or things like that. And, I think the point here is that data teams, you know, that are building data-driven organizations have a plethora of tooling to both ingest the right data and come up with the right interfaces to transform and actually, interact with that data. And so, that movement and sort of synchronization of data across your ecosystem is exactly what data orchestration is responsible for. Historically, I think, and Raj will talk more about this, historically, schedulers like KRON and Oozie or Control-M have taken a role here, but we think that Apache Airflow has sort of risen over the past few years as the defacto industry standard for writing data pipelines that do tasks, that do data jobs that interact with that ecosystem of tools in your organization. And so, beyond that sort of data pipeline unit, I think where we see it is that data acquisition is not only writing those data pipelines that move your data, but it's also all the things around it, right, so, CI/CD tool and Secrets Management, et cetera. So, a long-winded answer here, but I think that's how we talk about it here at Astronomer and how we're building our products. >> Excellent. Great context, Paola. Thank you. Viraj, let's bring you into the conversation. Every company these days has to be a data company, right? They've got to be a software company- >> Mm-hmm. >> whether it's my bank or my grocery store. So, how are companies actually doing data orchestration today, Viraj? >> Yeah, it's a great question. So, I think one thing to think about is like, on one hand, you know, data orchestration is kind of a new category that we're helping define, but on the other hand, it's something that companies have been doing forever, right? You need to get data moving to use it, you know. You've got it all in place, aggregate it, cleaning it, et cetera. So, when you look at what companies out there are doing, right. Sometimes, if you're a more kind of born in the cloud company, as we say, you'll adopt all these cloud native tooling things your cloud provider gives you. If you're a bank or another sort of institution like that, you know, you're probably juggling an even wider variety of tools. You're thinking about a cloud migration. You might have things like Kron running in one place, Uzi running somewhere else, Informatics running somewhere else, while you're also trying to move all your workloads to the cloud. So, there's quite a large spectrum of what the current state is for companies. And then, kind of like Paola was saying, Apache Airflow started in 2014, and it was actually started by Airbnb, and they put out this blog post that was like, "Hey here's how we use Apache Airflow to orchestrate our data across all their sources." And really since then, right, it's almost been a decade since then, Airflow emerged as the open source standard, and there's companies of all sorts using it. And, it's really used to tie all these tools together, especially as that number of tools increases, companies move to hybrid cloud, hybrid multi-cloud strategies, and so on and so forth. But you know, what we found is that if you go to any company, especially a larger one and you say like, "Hey, how are you doing data orchestration?" They'll probably say something like, "Well, I have five data teams, so I have eight different ways I do data orchestration." Right. This idea of data orchestration's been there but the right way to do it, kind of all the abstractions you need, the way your teams need to work together, and so on and so forth, hasn't really emerged just yet, right? It's such a quick moving space that companies have to combine what they were doing before with what their new business initiatives are today. So, you know, what we really believe here at Astronomer is Airflow is the core of how you solve data orchestration for any sort of use case, but it's not everything. You know, it needs a little more. And, that's really where our commercial product, Astro comes in, where we've built, not only the most tried and tested airflow experience out there. We do employ a majority of the Airflow Core Committers, right? So, we're kind of really deep in the project. We've also built the right things around developer tooling, observability, and reliability for customers to really rely on Astro as the heart of the way they do data orchestration, and kind of think of it as the foundational layer that helps tie together all the different tools, practices and teams large companies have to do today. >> That foundational layer is absolutely critical. You've both mentioned open source software. Paola, I want to go back to you, and just give the audience an understanding of how open source really plays into Astronomer's mission as a company, and into the technologies like Astro. >> Mm-hmm. Yeah, absolutely. I mean, we, so we at Astronomers started using Airflow and actually building our products because Airflow is open source and we were our own customers at the beginning of our company journey. And, I think the open source community is at the core of everything we do. You know, without that open source community and culture, I think, you know, we have less of a business, and so, we're super invested in continuing to cultivate and grow that. And, I think there's a couple sort of concrete ways in which we do this that personally make me really excited to do my own job. You know, for one, we do things like we organize meetups and we sponsor the Airflow Summit and there's these sort of baseline community efforts that I think are really important and that reminds you, hey, there just humans trying to do their jobs and learn and use both our technology and things that are out there and contribute to it. So, making it easier to contribute to Airflow, for example, is another one of our efforts. As Viraj mentioned, we also employ, you know, engineers internally who are on our team whose full-time job is to make the open source project better. Again, regardless of whether or not you're a customer of ours or not, we want to make sure that we continue to cultivate the Airflow project in and of itself. And, we're also building developer tooling that might not be a part of the Apache Open Source project, but is still open source. So, we have repositories in our own sort of GitHub organization, for example, with tools that individual data practitioners, again customers are not, can use to make them be more productive in their day-to-day jobs with Airflow writing Dags for the most common use cases out there. The last thing I'll say is how important I think we've found it to build sort of educational resources and documentation and best practices. Airflow can be complex. It's been around for a long time. There's a lot of really, really rich feature sets. And so, how do we enable folks to actually use those? And that comes in, you know, things like webinars, and best practices, and courses and curriculum that are free and accessible and open to the community are just some of the ways in which I think we're continuing to invest in that open source community over the next year and beyond. >> That's awesome. It sounds like open source is really core, not only to the mission, but really to the heart of the organization. Viraj, I want to go back to you and really try to understand how does Astronomer fit into the wider modern data stack and ecosystem? Like what does that look like for customers? >> Yeah, yeah. So, both in the open source and with our commercial customers, right? Folks everywhere are trying to tie together a huge variety of tools in order to start making sense of their data. And you know, I kind of think of it almost like as like a pyramid, right? At the base level, you need things like data reliability, data, sorry, data freshness, data availability, and so on and so forth, right? You just need your data to be there. (coughs) I'm sorry. You just need your data to be there, and you need to make it predictable when it's going to be there. You need to make sure it's kind of correct at the highest level, some quality checks, and so on and so forth. And oftentimes, that kind of takes the case of ELT or ETL use cases, right? Taking data from somewhere and moving it somewhere else, usually into some sort of analytics destination. And, that's really what businesses can do to just power the core parts of getting insights into how their business is going, right? How much revenue did I had? What's in my pipeline, salesforce, and so on and so forth. Once that kind of base foundation is there and people can get the data they need, how they need it, it really opens up a lot for what customers can do. You know, I think one of the trendier things out there right now is MLOps, and how do companies actually put machine learning into production? Well, when you think about it you kind of have to squint at it, right? Like, machine learning pipelines are really just any other data pipeline. They just have a certain set of needs that might not not be applicable to ELT pipelines. And, when you kind of have a common layer to tie together all the ways data can move through your organization, that's really what we're trying to make it so companies can do. And, that happens in financial services where, you know, we have some customers who take app data coming from their mobile apps, and actually run it through their fraud detection services to make sure that all the activity is not fraudulent. We have customers that will run sports betting models on our platform where they'll take data from a bunch of public APIs around different sporting events that are happening, transform all of that in a way their data scientist can build models with it, and then actually bet on sports based on that output. You know, one of my favorite use cases I like to talk about that we saw in the open source is we had there was one company whose their business was to deliver blood transfusions via drone into remote parts of the world. And, it was really cool because they took all this data from all sorts of places, right? Kind of orchestrated all the aggregation and cleaning and analysis that happened had to happen via airflow and the end product would be a drone being shot out into a real remote part of the world to actually give somebody blood who needed it there. Because it turns out for certain parts of the world, the easiest way to deliver blood to them is via drone and not via some other, some other thing. So, these kind of, all the things people do with the modern data stack is absolutely incredible, right? Like you were saying, every company's trying to be a data-driven company. What really energizes me is knowing that like, for all those best, super great tools out there that power a business, we get to be the connective tissue, or the, almost like the electricity that kind of ropes them all together and makes so people can actually do what they need to do. >> Right. Phenomenal use cases that you just described, Raj. I mean, just the variety alone of what you guys are able to do and impact is so cool. So Paola, when you're with those data engineers, those data scientists, and customer conversations, what's your pitch? Why use Astro? >> Mm-hmm. Yeah, yeah, it's a good question. And honestly, to piggyback off of Viraj, there's so many. I think what keeps me so energized is how mission critical both our product and data orchestration is, and those use cases really are incredible and we work with customers of all shapes and sizes. But, to answer your question, right, so why use Astra? Why use our commercial products? There's so many people using open source, why pay for something more than that? So, you know, the baseline for our business really is that Airflow has grown exponentially over the last five years, and like we said has become an industry standard that we're confident there's a huge opportunity for us as a company and as a team. But, we also strongly believe that being great at running Airflow, you know, doesn't make you a successful company at what you do. What makes you a successful company at what you do is building great products and solving problems and solving pin points of your own customers, right? And, that differentiating value isn't being amazing at running Airflow. That should be our job. And so, we want to abstract those customers from meaning to do things like manage Kubernetes infrastructure that you need to run Airflow, and then hiring someone full-time to go do that. Which can be hard, but again doesn't add differentiating value to your team, or to your product, or to your customers. So, folks to get away from managing that infrastructure sort of a base, a base layer. Folks who are looking for differentiating features that make their team more productive and allows them to spend less time tweaking Airflow configurations and more time working with the data that they're getting from their business. For help, getting, staying up with Airflow releases. There's a ton of, we've actually been pretty quick to come out with new Airflow features and releases, and actually just keeping up with that feature set and working strategically with a partner to help you make the most out of those feature sets is a key part of it. And, really it's, especially if you're an organization who currently is committed to using Airflow, you likely have a lot of Airflow environments across your organization. And, being able to see those Airflow environments in a single place and being able to enable your data practitioners to create Airflow environments with a click of a button, and then use, for example, our command line to develop your Airflow Dags locally and push them up to our product, and use all of the sort of testing and monitoring and observability that we have on top of our product is such a key. It sounds so simple, especially if you use Airflow, but really those things are, you know, baseline value props that we have for the customers that continue to be excited to work with us. And of course, I think we can go beyond that and there's, we have ambitions to add whole, a whole bunch of features and expand into different types of personas. >> Right? >> But really our main value prop is for companies who are committed to Airflow and want to abstract themselves and make use of some of the differentiating features that we now have at Astronomer. >> Got it. Awesome. >> Thank you. One thing, one thing I'll add to that, Paola, and I think you did a good job of saying is because every company's trying to be a data company, companies are at different parts of their journey along that, right? And we want to meet customers where they are, and take them through it to where they want to go. So, on one end you have folks who are like, "Hey, we're just building a data team here. We have a new initiative. We heard about Airflow. How do you help us out?" On the farther end, you know, we have some customers that have been using Airflow for five plus years and they're like, "Hey, this is awesome. We have 10 more teams we want to bring on. How can you help with this? How can we do more stuff in the open source with you? How can we tell our story together?" And, it's all about kind of taking this vast community of data users everywhere, seeing where they're at, and saying like, "Hey, Astro and Airflow can take you to the next place that you want to go." >> Which is incredibly- >> Mm-hmm. >> and you bring up a great point, Viraj, that every company is somewhere in a different place on that journey. And it's, and it's complex. But it sounds to me like a lot of what you're doing is really stripping away a lot of the complexity, really enabling folks to use their data as quickly as possible, so that it's relevant and they can serve up, you know, the right products and services to whoever wants what. Really incredibly important. We're almost out of time, but I'd love to get both of your perspectives on what's next for Astronomer. You give us a a great overview of what the company's doing, the value in it for customers. Paola, from your lens as one of the co-founders, what's next? >> Yeah, I mean, I think we'll continue to, I think cultivate in that open source community. I think we'll continue to build products that are open sourced as part of our ecosystem. I also think that we'll continue to build products that actually make Airflow, and getting started with Airflow, more accessible. So, sort of lowering that barrier to entry to our products, whether that's price wise or infrastructure requirement wise. I think making it easier for folks to get started and get their hands on our product is super important for us this year. And really it's about, I think, you know, for us, it's really about focused execution this year and all of the sort of core principles that we've been talking about. And continuing to invest in all of the things around our product that again, enable teams to use Airflow more effectively and efficiently. >> And that efficiency piece is, everybody needs that. Last question, Viraj, for you. What do you see in terms of the next year for Astronomer and for your role? >> Yeah, you know, I think Paola did a really good job of laying it out. So it's, it's really hard to disagree with her on anything, right? I think executing is definitely the most important thing. My own personal bias on that is I think more than ever it's important to really galvanize the community around airflow. So, we're going to be focusing on that a lot. We want to make it easier for our users to get get our product into their hands, be that open source users or commercial users. And last, but certainly not least, is we're also really excited about Data Lineage and this other open source project in our umbrella called Open Lineage to make it so that there's a standard way for users to get lineage out of different systems that they use. When we think about what's in store for data lineage and needing to audit the way automated decisions are being made. You know, I think that's just such an important thing that companies are really just starting with, and I don't think there's a solution that's emerged that kind of ties it all together. So, we think that as we kind of grow the role of Airflow, right, we can also make it so that we're helping solve, we're helping customers solve their lineage problems all in Astro, which is our kind of the best of both worlds for us. >> Awesome. I can definitely feel and hear the enthusiasm and the passion that you both bring to Astronomer, to your customers, to your team. I love it. We could keep talking more and more, so you're going to have to come back. (laughing) Viraj, Paola, thank you so much for joining me today on this showcase conversation. We really appreciate your insights and all the context that you provided about Astronomer. >> Thank you so much for having us. >> My pleasure. For my guests, I'm Lisa Martin. You're watching this Cube conversation. (soft electronic music)

Published Date : Feb 21 2023

SUMMARY :

to this CUBE conversation Thank you so much and what it is that you guys do. and hopefully that gives you an idea and the problems that it solves for us. to be a data company, right? So, how are companies actually kind of all the abstractions you need, and just give the And that comes in, you of the organization. and analysis that happened that you just described, Raj. that you need to run Airflow, that we now have at Astronomer. Awesome. and I think you did a good job of saying and you bring up a great point, Viraj, and all of the sort of core principles and for your role? and needing to audit the and all the context that you (soft electronic music)

ENTITIES

Entity	Category	Confidence
Viraj Parekh	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Paola	PERSON	0.99+
Viraj	PERSON	0.99+
2014	DATE	0.99+
Astronomer	ORGANIZATION	0.99+
Paola Peraza-Calderon	PERSON	0.99+
Paola Peraza Calderon	PERSON	0.99+
Airflow	ORGANIZATION	0.99+
Airbnb	ORGANIZATION	0.99+
five plus years	QUANTITY	0.99+
Astro	ORGANIZATION	0.99+
Raj	PERSON	0.99+
Uzi	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
first	QUANTITY	0.99+
both	QUANTITY	0.99+
today	DATE	0.99+
Kron	ORGANIZATION	0.99+
10 more teams	QUANTITY	0.98+
Astronomers	ORGANIZATION	0.98+
Astra	ORGANIZATION	0.98+
one	QUANTITY	0.98+
Airflow	TITLE	0.98+
Informatics	ORGANIZATION	0.98+
Monte Carlo	TITLE	0.98+
this year	DATE	0.98+
HubSpot	ORGANIZATION	0.98+
one company	QUANTITY	0.97+
Astronomer	TITLE	0.97+
next year	DATE	0.97+
Apache	ORGANIZATION	0.97+
Airflow Summit	EVENT	0.97+
AWS	ORGANIZATION	0.95+
both worlds	QUANTITY	0.93+
KRON	ORGANIZATION	0.93+
CUBE	ORGANIZATION	0.92+
M	ORGANIZATION	0.92+
Redshift	TITLE	0.91+
Snowflake	TITLE	0.91+
five data teams	QUANTITY	0.91+
GitHub	ORGANIZATION	0.91+
Oozie	ORGANIZATION	0.9+
Data Lineage	ORGANIZATION	0.9+

AWS Startup Showcase S3E1

(upbeat electronic music) >> Hello everyone, welcome to this CUBE conversation here from the studios in the CUBE in Palo Alto, California. I'm John Furrier, your host. We're featuring a startup, Astronomer. Astronomer.io is the URL, check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI, and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder of Astronomer, and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table who've worked very hard to get this company to the point that it's at. We have long ways to go, right? But there's been a lot of people involved that have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders, sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry to kind of highlight this shift that's happening. It's real, we've been chronicalizing, take a minute to explain what you guys do. >> Yeah, sure, we can get started. So, yeah, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data, and we were using an open source project called Apache Airflow that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into, and that running Airflow is actually quite challenging, and that there's a big opportunity for us to create a set of commercial products and an opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item in the old classic data infrastructure. But with AI, you're seeing a lot more emphasis on scale, tuning, training. Data orchestration is the center of the value proposition, when you're looking at coordinating resources, it's one of the most important things. Can you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one, and Viraj, feel free to jump in. So if you google data orchestration, here's what you're going to get. You're going to get something that says, "Data orchestration is the automated process" "for organizing silo data from numerous" "data storage points, standardizing it," "and making it accessible and prepared for data analysis." And you say, "Okay, but what does that actually mean," right, and so let's give sort of an an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Okay, give me a dashboard in Sigma, for example, for the number of customers or monthly active users, and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have in product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran to ingest data, a data warehouse, like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration, in our view, is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on and the company advances. And so, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run, and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CICD tooling, secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that it's the heartbeat, we think, of of the data ecosystem, and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> One of the things that jumped out, Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out. You mentioned a variety of things. You look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are fundamental, that were once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier, or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over the last however many years is that if a data team is using a bunch of tools to get what they need done, and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them, and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have some sort of base layer, right? That's kind of like, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, things like SageMaker, Redshift, whatever, but they also might need things that their cloud can't provide. Something like Fivetran, or Hightouch, those are other tools. And where data orchestration really shines, and something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need? So that somebody can read a dashboard and trust the number that it says, or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines, or machine learning, or whatever, you need different things to do them, and Airflow helps tie them together in a way that's really specific for a individual business' needs. >> Take a step back and share the journey of what you guys went through as a company startup. So you mentioned Apache, open source. I was just having an interview with a VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone/Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. Are you guys helping them? Take us through, 'cause you guys are on the front end of a big, big wave, and that is to make sense of the chaos, rain it in. Take us through your journey and why this is important. >> Yeah, Paola, I can take a crack at this, then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source, because we started using Airflow as an end user and started to say like, "Hey wait a second," "more and more people need this." Airflow, for background, started at Airbnb, and they were actually using that as a foundation for their whole data stack. Kind of how they made it so that they could give you recommendations, and predictions, and all of the processes that needed orchestrated. Airbnb created Airflow, gave it away to the public, and then fast forward a couple years and we're building a company around it, and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us, it's really been about watching the community and our customers take these problems, find a solution to those problems, standardize those solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting in their ELP infrastructure, they've solved that problem and now they're moving on to things like doing machine learning with their data, because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build a layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Viraj, I'll let you take that one too. (group laughs) >> So you know, a lot of it is... It goes across the gamut, right? Because it doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from other disparate sources into one place and then building on top of that. Be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection, because Airflow's orchestrating how transactions go, transactions get analyzed. They do things like analyzing marketing spend to see where your highest ROI is. And then you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers, kind of analyze and aggregating that at scale, and trying to automate decision making processes. So it goes from your most basic, what we call data plumbing, right? Just to make sure data's moving as needed, all the ways to your more exciting expansive use cases around automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future, is how critical Airflow is to all of those processes, and I think when you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic questions about your business and the growth of your company for so many organizations that we work with. So it's, I think, one of the things that gets Viraj and I and the rest of our company up every single morning is knowing how important the work that we do for all of those use cases across industries, across company sizes, and it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models is that you can integrate data into these models from outside. So you're starting to see the integration easier to deal with. Still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Has it already been disrupted? Would you categorize it as a new first inning kind of opportunity, or what's the state of the data orchestration area right now? Both technically and from a business model standpoint. How would you guys describe that state of the market? >> Yeah, I mean, I think in a lot of ways, in some ways I think we're category creating. Schedulers have been around for a long time. I released a data presentation sort of on the evolution of going from something like Kron, which I think was built in like the 1970s out of Carnegie Mellon. And that's a long time ago, that's 50 years ago. So sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out industry has 5X'd over the last 10 years. And so obviously as that ecosystem grows, and grows, and grows, and grows, the need for orchestration only increases. And I think, as Astronomer, I think we... And we work with so many different types of companies, companies that have been around for 50 years, and companies that got started not even 12 months ago. And so I think for us it's trying to, in a ways, category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration, and then there's folks who have such advanced use case, 'cause they're hitting sort of a ceiling and only want to go up from there. And so I think we, as a company, care about both ends of that spectrum, and certainly want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point, Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. If you rewind the clock like 5 or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business, and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is right on the money. And what we're finding is the need for it is spreading to all parts of the data team, naturally where Airflow's emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. We've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data, and that's data engineering, and then you're got to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I have to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers, or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean, there's so many... Sorry, Viraj, you can jump in. So there's so many companies using Airflow, right? So the baseline is that the open source project that is Airflow that came out of Airbnb, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in their organization, and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Viraj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's to start at the baseline, as we continue to grow our our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way, in a more efficient way, and that's really the crux of who we sell to. And so to answer your question on, we get a lot of inbound because they're... >> You have a built in audience. (laughs) >> The world that use it. Those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean, the power of the opensource community is really just so, so big, and I think that's also one of the things that makes this job fun. >> And you guys are in a great position. Viraj, you can comment a little, get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of productizing it, operationalizing it. This is a huge new dynamic, I mean, in the past 5 or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do, because we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running an e-commerce business, or maybe you're running, I don't know, some sort of like, any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it, you want to be able to google something and get answers for it, you want the benefits of open source. You don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, that you can benefit from, that you can learn from. But you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolve. We used a debate 10 years ago, can there be another Red Hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company? The milestones of Astromer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Viraj and I have obviously been at Astronomer along with our other founding team and leadership folks for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people, solving, again, mission critical problems for so many types of organizations. We've had some funding that has allowed us to invest in the team that we have and in the software that we have, and that's been really phenomenal. And so that investment, I think, keeps us confident, even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us that we know can get valuable out of what we do, and making developers' lives better, and growing the open source community and making sure that everything that we're doing makes it easier for folks to get started, to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> Don't know what the total is, but it's in the ballpark over $200 million. It feels good to... >> A little bit of capital. Got a little bit of cap to work with there. Great success. I know as a Series C Financing, you guys have been down. So you're up and running, what's next? What are you guys looking to do? What's the big horizon look like for you from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done. But really investing our product over the next, at least over the course of our company lifetime. And there's a lot of ways we want to make it more accessible to users, easier to get started with, easier to use, kind of on all areas there. And really, we really want to do more for the community, right, like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways, in more kind of events and everything else that we can keep those folks galvanized and just keep them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. I think we'll keep growing the team this year. We've got a couple of offices in the the US, which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York, and we're excited to be engaging with our coworkers in person finally, after years of not doing so. We've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world, and really focusing on our product and the open source community is where our heads are at this year. So, excited. >> Congratulations. 200 million in funding, plus. Good runway, put that money in the bank, squirrel it away. It's a good time to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open source community does, and good luck with the venture, continue to be successful, and we'll see you at the Startup Showcase. >> Thank you. >> Yeah, thanks so much, John. Appreciate it. >> Okay, that's the CUBE Conversation featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model, good solution for the next gen cloud scale data operations, data stacks that are emerging. I'm John Furrier, your host, thanks for watching. (soft upbeat music)

Published Date : Feb 14 2023

SUMMARY :

and that is the future of for the path we've been on so far. for the AI industry to kind of highlight So the crux of what we center of the value proposition, that it's the heartbeat, One of the things and the number of tools they're using of what you guys went and all of the processes That's a beautiful thing. all the tools that they need, What are some of the companies Viraj, I'll let you take that one too. all of the machine learning and the growth of your company that state of the market? and the value that we can provide and the data scientists that the data market's And so the folks that we sell to You have a built in audience. one of the things that makes this job fun. in the past 5 or so years, 10 years, that you can build on top of, the history of the company? and in the software that we have, How much have you guys raised? but it's in the ballpark What's the big horizon look like for you Kind of one of the best and worst things and continuing to hire the work you guys do. Yeah, thanks so much, John. for the next gen cloud

ENTITIES

Entity	Category	Confidence
Viraj Parekh	PERSON	0.99+
Paola	PERSON	0.99+
Viraj	PERSON	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
Airbnb	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
New York	LOCATION	0.99+
Apache	ORGANIZATION	0.99+
US	LOCATION	0.99+
Two	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Paola Peraza Calderon	PERSON	0.99+
1970s	DATE	0.99+
first question	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Airflow	TITLE	0.99+
both	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
200 million	QUANTITY	0.99+
Astronomer	ORGANIZATION	0.99+
One	QUANTITY	0.99+
over 200 million	QUANTITY	0.99+
over $200 million	QUANTITY	0.99+
this year	DATE	0.99+
10 years ago	DATE	0.99+
HubSpot	ORGANIZATION	0.98+
Fivetran	ORGANIZATION	0.98+
50 years ago	DATE	0.98+
over five years	QUANTITY	0.98+
one stack	QUANTITY	0.98+
12 months ago	DATE	0.98+
10 years	QUANTITY	0.97+
Both	QUANTITY	0.97+
Apache Airflow	TITLE	0.97+
both worlds	QUANTITY	0.97+
CNCF	ORGANIZATION	0.97+
one	QUANTITY	0.97+
ChatGPT	ORGANIZATION	0.97+
5	DATE	0.97+
next year	DATE	0.96+
Astromer	ORGANIZATION	0.96+
today	DATE	0.95+
5X	QUANTITY	0.95+
over five years ago	DATE	0.95+
CUBE	ORGANIZATION	0.94+
two things	QUANTITY	0.94+
each	QUANTITY	0.93+
one person	QUANTITY	0.93+
First	QUANTITY	0.92+
S3	TITLE	0.91+
Carnegie Mellon	ORGANIZATION	0.91+
Startup Showcase	EVENT	0.91+

AWS Startup Showcase S3E1

(soft music) >> Hello everyone, welcome to this Cube conversation here from the studios of theCube in Palo Alto, California. John Furrier, your host. We're featuring a startup, Astronomer, astronomer.io is the url. Check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table, who've worked very hard to get this company to the point that it's at. And we have long ways to go, right? But there's been a lot of people involved that are, have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry. Kind of highlight this shift that's happening. It's real. We've been chronologicalizing, take a minute to explain what you guys do. >> Yeah, sure. We can get started. So yeah, when Astronomer, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data and we were using an open source project called Apache Airflow, that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into. And that running Airflow is actually quite challenging and that there's a lot of, a big opportunity for us to create a set of commercial products and opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item, you know, in the old classic data infrastructure. But with AI you're seeing a lot more emphasis on scale, tuning, training. You know, data orchestration is the center of the value proposition when you're looking at coordinating resources, it's one of the most important things. Could you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one and Viraj feel free to jump in. So if you google data orchestration, you know, here's what you're going to get. You're going to get something that says, data orchestration is the automated process for organizing silo data from numerous data storage points to organizing it and making it accessible and prepared for data analysis. And you say, okay, but what does that actually mean, right? And so let's give sort of an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Hey, give me a dashboard in Sigma, for example, for the number of customers or monthly active users and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have end product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran, to ingest data, a data warehouse like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that, you know, data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration in our view is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration, you know, is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on. And, you know, the company advances. And so, you know, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CI/CD tooling, right? Secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that, it's the heartbeat that we think of the data ecosystem and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> You know, one of the things that jumped out Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out there. You mentioned a variety of things. You know, if you look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are >> Yeah. - >> fundamental, but we're once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got, you know, S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over, you know, the last however many years, is that like if a data team is using a bunch of tools to get what they need done and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have like some sort of base layer, right? That's kind of like, you know, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, you know, things like SageMaker, Redshift, whatever. But they also might need things that their Cloud can't provide, you know, something like Fivetran or Hightouch or anything of those other tools and where data orchestration really shines, right? And something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need, right? Something that makes it so that somebody can read a dashboard and trust the number that it says or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines or machine learning or whatever, you need different things to do them and Airflow helps tie them together in a way that's really specific for a individual business's needs. >> Take a step back and share the journey of what your guys went through as a company startup. So you mentioned Apache open source, you know, we were just, I was just having an interview with the VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone, Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. How do you guys, are you guys helping them? Take us through, 'cuz you guys are on the front end of a big, big wave and that is to make sense of the chaos, reigning it in. Take us through your journey and why this is important. >> Yeah Paola, I can take a crack at this and then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source because we started using Airflow as an end user and started to say like, "Hey wait a second". Like more and more people need this. Airflow, for background, started at Airbnb and they were actually using that as the foundation for their whole data stack. Kind of how they made it so that they could give you recommendations and predictions and all of the processes that need to be or needed to be orchestrated. Airbnb created Airflow, gave it away to the public and then, you know, fast forward a couple years and you know, we're building a company around it and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us it's really been about like watching the community and our customers take these problems, find solution to those problems, build standardized solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting and their ELP infrastructure, you know, they've solved that problem and now they're moving onto things like doing machine learning with their data, right? Because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build the layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Raj, I'll let you take that one too. (all laughing) >> Yeah. (all laughing) So you know, a lot of it is, it goes across the gamut, right? Because all doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from all the disparate sources into one place and then building on top of that, be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection because Airflow's orchestrating how transactions go. Transactions get analyzed, they do things like analyzing marketing spend to see where your highest ROI is. And then, you know, you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers kind of analyze and aggregating that at scale and trying to automate decision making processes. So it goes from your most basic, what we call like data plumbing, right? Just to make sure data's moving as needed. All the ways to your more exciting and sexy use cases around like automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future is how critical Airflow is to all of those processes, you know? And I think when, you know, you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic, you know, questions about your business and the growth of your company for so many organizations that we work with. So it's, I think one of the things that gets Viraj and I, and the rest of our company up every single morning, is knowing how important the work that we do for all of those use cases across industries, across company sizes. And it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models, is that you can integrate data into these models, right? From outside, right? So you're starting to see the integration easier to deal with, still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Is it already been disrupted? Would you categorize it as a new first inning kind of opportunity or what's the state of the data orchestration area right now? Both, you know, technically and from a business model standpoint, how would you guys describe that state of the market? >> Yeah, I mean I think, I think in a lot of ways we're, in some ways I think we're categoric rating, you know, schedulers have been around for a long time. I recently did a presentation sort of on the evolution of going from, you know, something like KRON, which I think was built in like the 1970s out of Carnegie Mellon. And you know, that's a long time ago. That's 50 years ago. So it's sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out the industry has, you know, has some 5X over the last 10 years. And so obviously as that ecosystem grows and grows and grows and grows, the need for orchestration only increases. And I think, you know, as Astronomer, I think we, and there's, we work with so many different types of companies, companies that have been around for 50 years and companies that got started, you know, not even 12 months ago. And so I think for us, it's trying to always category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration and then there's folks who have such advanced use case 'cuz they're hitting sort of a ceiling and only want to go up from there. And so I think we as a company, care about both ends of that spectrum and certainly have want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. You know, if you rewind the clock like five or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is spot on, is right on the money. And what we're finding is it's spreading, the need for it, is spreading to all parts of the data team naturally where Airflows have emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. You know, we've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data and that's data engineering and then you're going to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I got to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean we've, there's so many, there's so many. Sorry Raj, you can jump in. - >> It's okay. So there's so many companies using Airflow, right? So our, the baseline is that the open source project that is Airflow that was, that came out of Airbnb, you know, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in the organization and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Raj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's for, to start at the baseline. You know, as we continue to grow our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way and a more efficient way. And that's really the crux of who we sell to. And so to answer your question on, we actually, we get a lot of inbound because they're are so many - >> A built-in audience. >> In the world that use it, that those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean the power of the open source community is really just so, so big. And I think that's also one of the things that makes this job fun, so. >> And you guys are in a great position, Viraj, you can comment, to get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also, you know, we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of product-izing it, operationalizing it. This is a huge new dynamic. I mean, in the past, you know, five or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do because, you know, we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running e-commerce business or maybe you're running, I don't know, some sort of like any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it. You want to take, you want to be able to google something and get answers for it. You want the benefits of open source. You don't want to have, you don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that, in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, you can benefit from, that you can learn from, but you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolved. We used to debate 10 years ago, can there be another red hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company, the milestones of the Astronomer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Raj and I have obviously been at astronomer along with our other founding team and leadership folks, for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people. Solving again, mission critical problems for so many types of organizations. You know, we've had some funding that has allowed us to invest in the team that we have and in the software that we have. And that's been really phenomenal. And so that investment, I think, keeps us confident even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us, that we know can get value out of what we do. And making developers' lives better and growing the open source community, you know, and making sure that everything that we're doing makes it easier for folks to get started to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> I forget what the total is, but it's in the ballpark of 200, over $200 million. So it feels good - >> A little bit of capital. Got a little bit of cash to work with there. Great success. I know it's a Series C financing, you guys been down, so you're up and running. What's next? What are you guys looking to do? What's the big horizon look like for you? And from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Like, kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done, but really invest in our product over the next, at least the next, over the course of our company lifetime. And there's a lot of ways we wanting to just make it more accessible to users, easier to get started with, easier to use all kind of on all areas there. And really, we really want to do more for the community, right? Like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways and more kind of events and everything else that we can do to keep those folks galvanized and just keeping them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. You know, I think we'll keep growing the team this year. We've got a couple of offices in the US which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York and we're excited to be engaging with our coworkers in person. Finally, after years of not doing so, we've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world and really focusing on our product and the open source community is where our heads are at this year, so. >> Congratulations. - >> Excited. 200 million in funding plus good runway. Put that money in the bank, squirrel it away. You know, it's good to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open sourced community does and good luck with the venture. Continue to be successful and we'll see you at the Startup Showcase. >> Thank you. - >> Yeah, thanks so much, John. Appreciate it. - >> It's theCube conversation, featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model. Good solution for the next gen, Cloud, scale, data operations, data stacks that are emerging. I'm John Furrier, your host. Thanks for watching. (soft music)

Published Date : Feb 8 2023

SUMMARY :

and that is the future of for the path we've been on so far. take a minute to explain what you guys do. and that there's a lot of, of the value proposition And that data team needs to use tools You know, one of the and then a bunch of point solution. and the number of tools they're using and that is to make sense of the chaos, and all of the processes that need to be That's a beautiful thing. you know, they've solved that problem What are some of the companies Raj, I'll let you take that one too. And then, you know, and the growth of your company So I have to ask you guys, and companies that got started, you know, and the data scientists that the data market's kind of you can jump in. And so the folks that we and come to our website and chat with us I mean, in the past, you to what we do because, you history of the company, and in the software that we have. How much have you guys raised? but it's in the ballpark What are you guys looking to do? and you often have to just kind of and the open source community the work you guys do. Yeah, thanks so much, John. that's the website.

ENTITIES

Entity	Category	Confidence
Viraj Parekh	PERSON	0.99+
Paola	PERSON	0.99+
Viraj	PERSON	0.99+
John Furrier	PERSON	0.99+
John	PERSON	0.99+
Raj	PERSON	0.99+
Airbnb	ORGANIZATION	0.99+
US	LOCATION	0.99+
2017	DATE	0.99+
New York	LOCATION	0.99+
Paola Peraza Calderon	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Apache	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Palo Alto, California	LOCATION	0.99+
1970s	DATE	0.99+
10 years	QUANTITY	0.99+
five	QUANTITY	0.99+
Two	QUANTITY	0.99+
first question	QUANTITY	0.99+
over 200 million	QUANTITY	0.99+
both	QUANTITY	0.99+
Both	QUANTITY	0.99+
over $200 million	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
50 years ago	DATE	0.99+
one	QUANTITY	0.99+
five	DATE	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
this year	DATE	0.98+
One	QUANTITY	0.98+
Airflow	TITLE	0.98+
10 years ago	DATE	0.98+
Carnegie Mellon	ORGANIZATION	0.98+
over five years	QUANTITY	0.98+
200	QUANTITY	0.98+
12 months ago	DATE	0.98+
both worlds	QUANTITY	0.98+
5X	QUANTITY	0.98+
ChatGPT	ORGANIZATION	0.98+
first	QUANTITY	0.98+
one stack	QUANTITY	0.97+
one person	QUANTITY	0.97+
two things	QUANTITY	0.97+
Fivetran	ORGANIZATION	0.96+
seven	QUANTITY	0.96+
next year	DATE	0.96+
today	DATE	0.95+
50 years	QUANTITY	0.95+
each	QUANTITY	0.95+
theCube	ORGANIZATION	0.94+
HubSpot	ORGANIZATION	0.93+
Sigma	ORGANIZATION	0.92+
Series C	OTHER	0.92+
Astronomer	ORGANIZATION	0.91+
astronomer.io	OTHER	0.91+
Hightouch	TITLE	0.9+
one place	QUANTITY	0.9+
Android	TITLE	0.88+
Startup Showcase	EVENT	0.88+
Apache Airflow	TITLE	0.86+
CNCF	ORGANIZATION	0.86+

Muddu Sudhakar, Aisera | AWS re:Invent 2022

(upbeat music) >> Hey, welcome back everyone, live coverage here. Re:invent 2022. I'm John Furrier, host of theCUBE. Two sets here. We got amazing content flowing. A third set upstairs in the executive briefing area. It's kind of a final review, day three. We got a special guest for do a re:Invent review. Muddu Sudhakar CEO founder of Aisera. Former multi-exit entrepreneur. Kind of a CUBE analyst who's always watching the floor, comes in, reports on our behalf. Thank you, you're seasoned veteran. Good to see you. Thanks for coming. >> Thank you John >> We've only got five minutes. Let's get into it. What's your report? What are you seeing here at re:Invent? What's the most important story? What's happening? What should people pay attention to? >> No, a lot of things. First all, thank you for having me John. But, most important thing what Amazon has announced is AIML. How they're doubling down on AIML. Amazon Connect for Wise. Watch out all the contact center vendors. Third, is in the area of workflow, low-code, no-code, workflow automation. I see these three are three big pillars. And, the fourth is ETL and ELTs. They're offering ETL as included as a part of S3 Redshift. I see those four areas are the big buckets. >> Well, it's not no ETL to S3. It's ETL into S3 or migration. >> That's right. >> Then the other one was Zero ETL Promise. >> Muddu: That's right. >> Which there's a skeptical group out there that think that's not possible. I do. I think ultimately that'll happen, but what's your take? >> I think it's going to happen. So, it's going to happen both within that data store as well as outside the data store, data coming in. I think that area, Amazon is going to slowly encroach into the whole thing will be part offered as a part of Redshift and S3. >> Got it. What else are you seeing? Security. >> Amazon Connect Amazon Connect is a big thing. >> John: Why is that so important? It seems like they already have that. >> They have it, but what they're doing now is to automate AI bots. They want to use AI bot to automate both agent assist, AI assist, and also WiseBot automation. So, all the contact center Wise to text they're doubling down. I think it's a good competition to Microsoft with the Nuance acquisition and what Zoom is doing today. So, I think within Microsoft, Zoom, and Amazon, it's a nice competition there. >> Okay, so we had Adam's keynote, a lot of security and data, that was big. Today, we had Swami, all ML, 13 announcements. Adam did telegraph to me that he was going to to share the love. Jassy would've probably taken most of those announcements, we know that. Adam shared the love. So, Adam, props to you for sharing the love with Swami and some of those announcements. We had 13. So, good for him. >> Yes. >> And then, we had Aruba with the partners. What's your take on the partner network? A revamp? >> No, I think Aruba did a very good job in terms of partners. Look at these, one of the best stores that Amazon does. Even the companies like me, I'm a startup company. They know how to include the partners, drive more revenue with partners, sell through it, more expansion. So, Amazon is still one of the best for startup to mid-market companies to go into enterprise. So, I love their partnership angle. >> One of the things I like that she said that resonated with me 'cause, I've been working with those teams, is it's unified, clear roles, but together. But, scaling the support for partners and making money for partners. >> That's right. >> That is a huge deal. Big road ahead. She's focused on it. She says, no problem. We want to scale up the business model of the channel. >> Muddu: That's right. >> The resources, so that the ecosystem can make money and serve customers or serve customers and make money. >> Muddu: That's right. And, I think one thing that they're always good is Marketplace. Now, they're doing is outside of market with ISV, co-sell, selling through. I think Amazon really understood that adding the value so that we make money as a partners and they make money, incrementally. So, I think Aruba is doing a very good job. I really like it. >> Okay, final question. What's going on with Werner? What do you expect to hear tomorrow from a developer front? Not a lot of developer productivity conversations at this re:Invent. Not a lot of people talking about software supply chain although Snyk was on theCUBE earlier. Developer productivity. Werner's going to speak to that tomorrow we think. Or, I don't know. What do you think? >> I think he's going talk something called generative AI. Rumored the people are talking about the code will be returned by the algorithms now. I think if I'm Werner, I'm going to talk about where the technology is going, where the humans will not be writing code. So, I think AI is going to double down with Amazon more on the generative AI. He's going to try a lot about that. >> Generative AI is hot. We could have generative CUBE, no hosts. >> Muddu: Yes, that would be good. >> No code, no host >> Muddu: Have an answer, John Software. (both laugh) >> We're going to automate everything. Muddu, great to hear from you. Thanks for reporting. Anything else on the ecosystem? Any observations on the ecosystem and their opportunity? >> So, coming from my side, if I'd to provide an answer, today we have like close to thousand leads that are good. Most of them are financial, healthcare. Healthcare is still one of the largest ones I saw in this conference. Financials, and then, I'm started seeing a lot more on the manufacturing. So, I think supply chain, they were not so. I think Amazon is doing fantastic job with financial, healthcare, and supply chain. >> Where is their blind spot if you had to point that one? >> I think media and entertainment. Media and entertainment is not that big on Amazon. So, I think we should see a lot more of those. >> Yeah, I think they need to look at that. Any other observations? Hallway conversations that are notable that you would like to share with folks watching? >> I think what needs to happen is with VMware, and Citrix desktop, and Endpoint Management. That's their blind spot. So far, nobody's really talking about the Endpoints. Your workstation, laptop, desktop. Remember, that was big with VMware. Nope, that's not a thought of conversation in email right now. So, I think that area is left behind by Amazon. Somebody needs to go after that white space. >> John: And, the audience here is over 50,000. Big numbers. >> Huge. One of the best shows, right? I mean after Covid. It's by far the best show I've seen in this year. >> All right, if you'd do a sizzle reel, what would it be? >> Sizzle reel. I think it's going to be a lot more on, as I said, generative to AI is the key word to watch. And, more than that, low-code no-code workflow automation. How do you automate the workflows? Which is where ServiceNow is fairly strong. I think you'll see Amazon and ServiceNow playing in the workflow automation. >> Muddu, thank you so much for coming on theCube sharing. That's a wrap up for day three here in theCUBE. I'm John Furrier, Dave Vellante for Lisa Martin, Savannah Peterson, all working on Paul Gillan and John Walls and the whole team. Thanks for all your support. Wrapping it up to the end of the day. Pulling the plug. We'll see you tomorrow. (upbeat music)

Published Date : Dec 1 2022

SUMMARY :

Good to see you. What's the most important story? Third, is in the area Well, it's not no ETL to S3. Then the other one I think ultimately that'll I think it's going to happen. What else are you seeing? Amazon Connect is a big thing. John: Why is that so important? So, all the contact center Wise to text So, Adam, props to you Aruba with the partners. So, Amazon is still one of the best One of the things I like that she said business model of the channel. the ecosystem can make money that adding the value so that to that tomorrow we think. So, I think AI is going Generative AI is hot. Muddu: Have an answer, John Software. Anything else on the ecosystem? of the largest ones I saw So, I think we should that you would like to I think what needs to happen is John: And, the audience One of the best shows, right? I think it's going to be Walls and the whole team.

ENTITIES

Entity	Category	Confidence
Adam	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Lisa Martin	PERSON	0.99+
John Walls	PERSON	0.99+
Muddu	PERSON	0.99+
Savannah Peterson	PERSON	0.99+
Jassy	PERSON	0.99+
John Furrier	PERSON	0.99+
Werner	PERSON	0.99+
Paul Gillan	PERSON	0.99+
five minutes	QUANTITY	0.99+
Zoom	ORGANIZATION	0.99+
Swami	PERSON	0.99+
Muddu Sudhakar	PERSON	0.99+
tomorrow	DATE	0.99+
Today	DATE	0.99+
Aisera	ORGANIZATION	0.99+
13	QUANTITY	0.99+
Third	QUANTITY	0.99+
First	QUANTITY	0.99+
One	QUANTITY	0.99+
three	QUANTITY	0.99+
both	QUANTITY	0.99+
fourth	QUANTITY	0.99+
over 50,000	QUANTITY	0.99+
13 announcements	QUANTITY	0.98+
AWS	ORGANIZATION	0.98+
one	QUANTITY	0.98+
today	DATE	0.98+
Aisera	PERSON	0.98+
Two sets	QUANTITY	0.98+
John Software	PERSON	0.97+
Nuance	ORGANIZATION	0.97+
this year	DATE	0.96+
Aruba	ORGANIZATION	0.96+
day three	QUANTITY	0.96+
S3	TITLE	0.94+
four areas	QUANTITY	0.92+
day three	QUANTITY	0.92+
one thing	QUANTITY	0.91+
AIML	TITLE	0.9+
Wise	ORGANIZATION	0.88+
VMware	ORGANIZATION	0.88+
S3 Redshift	TITLE	0.85+
third set	QUANTITY	0.84+
three big pillars	QUANTITY	0.82+
Redshift	TITLE	0.8+
thousand leads	QUANTITY	0.78+
ServiceNow	TITLE	0.77+
theCUBE	ORGANIZATION	0.76+
CUBE	ORGANIZATION	0.76+
Citrix	ORGANIZATION	0.75+
WiseBot	TITLE	0.75+

Hoshang Chenoy, Meraki & Matthew Scullion, Matillion | AWS re:Invent 2022

(upbeat music) >> Welcome back to Vegas. It's theCUBE live at AWS re:Invent 2022. We're hearing up to 50,000 people here. It feels like if the energy at this show is palpable. I love that. Lisa Martin here with Dave Vellante. Dave, we had the keynote this morning that Adam Selipsky delivered lots of momentum in his first year. One of the things that you said that you were looking in your breaking analysis that was released a few days ago, four trends and one of them, he said under Selipsky's rule in the 2020s, there's going to be a rush of data that will dwarf anything we have ever seen. >> Yeah, it was at least a quarter, maybe a third of his keynote this morning was all about data and the theme is simplifying data and doing better data integration, integrating across different data platforms. And we're excited to talk about that. Always want to simplify data. It's like the rush of data is so fast. It's hard for us to keep up. >> It is hard to keep that up. We're going to be talking with an alumni next about how his company is helping organizations like Cisco Meraki keep up with that data explosion. Please welcome back to the program, Matthew Scullion, the CEO of Matillion and how Hoshang Chenoy joins us, data scientist at Cisco Meraki. Guys, great to have you on the program. >> Thank you. >> Thank you for having us. >> So Matthew, we last saw you just a few months ago in Vegas at Snowflake Summits. >> Matthew: We only meet in Vegas. >> I guess we do, that's okay. Talk to us about some of the things, I know that Matillion is a data transformation solution that was originally introduced for AWS for Redshift. But talk to us about Matillion. What's gone on since we've seen you last? >> Well, I mean it's not that long ago but actually quite a lot. And it's all to do with exactly what you guys were just talking about there. This almost hard to comprehend way the world is changing with the amounts of data that we now can and need to put to work. And our worldview is there's no shortage of data but the choke points certainly one of the choke points. Maybe the choke point is our ability to make that data useful, to make it business ready. And we always talk about the end use cases. We talk about the dashboard or the AI model or the data science algorithm. But until before we can do any of that fun stuff, we have to refine raw data into business ready, usable data. And that's what Matillion is all about. And so since we last met, we've made a couple of really important announcements and possibly at the top of the list is what we call the data productivity cloud. And it's really squarely addressed this problem. It's the results of many years of work, really the apex of many years of the outsize engineering investment, Matillion loves to make. And the Data Productivity Cloud is all about helping organizations like Cisco Meraki and hundreds of others enterprise organizations around the world, get their data business ready, faster. >> Hoshang talk to us a little bit about what's going on at Cisco Meraki, how you're leveraging Matillion from a productivity standpoint. >> I've really been a Matillion fan for a while, actually even before Cisco Meraki at my previous company, LiveRamp. And you know, we brought Matillion to LiveRamp because you know, to Matthew's point, there is a stage in every data growth as I want to call it, where you have different companies at different stages. But to get data, data ready, you really need a platform like Matillion because it makes it really easy. So you have to understand Matillion, I think it's designed for someone that uses a lot of code but also someone that uses no code because the UI is so good. Someone like a marketer who doesn't really understand what's going on with that data but wants to be a data driven marketer when they look at the UI they immediately get it. They're just like, oh, I get what's happening with my data. And so that's the brilliance of Matillion and to get data to that data ready part, Matillion does a really, really good job because what we've been able to do is blend so many different data sources. So there is an abundance of data. Data is siloed though. And the connectivity between different data is getting harder and harder. And so here comes the Matillion with it's really simple solution, easy to use platform, powerful and we get to use all of that. So to really change the way we've thought about our analytics, the way we've progressed our division, yeah. >> You're always asking about superpowers and that is a superpower of Matillion 'cause you know, low-code, no-code sounds great but it only gets you a quarter of the way there, maybe 50% of the way there. You're kind of an "and" not an "or." >> That's a hundred percent right. And so I mentioned the Data Productivity Cloud earlier which is the name of this platform of technology we provide. That's all to do with making data business ready. And so I think one of the things we've seen in this industry over the past few years is a kind of extreme decomposition in terms of vendors of making data business ready. You've got vendors that just do loading, you've got vendors that just do a bit of data transformation, you've got vendors that do data ops and orchestration, you've got vendors that do reverse ETL. And so with the data productivity platform, you've got all of that. And particularly in this kind of, macroeconomic heavy weather that we're now starting to face, I think companies are looking for that. It's like, I don't want to buy five things, five sets of skills, five expensive licenses. I want one platform that can do it. But to your point David, it's the and not the or. We talk about the Data Productivity Cloud, the DPC, as being everyone ready. And what we mean by that is if you are the tech savvy marketer who wants to get a particular insight and you understand what a Rowan economy is, but you're not necessarily a hardcore super geeky data engineer then you can visual low-code, no-code, your data to a point where it's business ready. You can do that really quick. It's easy to understand, it's faster to ramp people onto those projects cause it like explains itself, faster to hand it over cause it's self-documenting. But, they'll always be individuals, teams, "and", "or" use cases that want to high-code as well. Maybe you want to code in SQL or Python, increasingly of course in DBT and you can do that on top of the Data Productivity Cloud as well. So you're not having to make a choice, but is that right? >> So one of the things that Matillion really delivers is speed to insight. I've always said that, you know, when you want to be business ready you want to make fast decisions, you want to act on data quickly, Matillion allows you to, this feed to insight is just unbelievably fast because you blend all of these different data sources, you can find the deficiencies in your process, you fix that and you can quickly turn things around and I don't think there's any other platform that I've ever used that has that ability. So the speed to insight is so tremendous with Matillion. >> The thing I always assume going on in our customers teams, like you run Hoshang is that the visual metaphor, be it around the orchestration and data ops jobs, be it around the transformation. I hope it makes it easier for teams not only to build it in the first place, but to live with it, right? To hand it over to other people and all that good stuff. Is that true? >> Let me highlight that a little bit more and better for you. So, say for example, if you don't have a platform like Matillion, you don't really have a central repository. >> Yeah. >> Where all of your codes meet, you could have a get repository, you could do all of those things. But, for example, for definitions, business definitions, any of those kind of things, you don't want it to live in just a spreadsheet. You want it to have a central platform where everybody can go in, there's detailed notes, copious notes that you can make on Matillion and people know exactly which flow to go to and be part of, and so I kind of think that that's really, really important because that's really helped us in a big, big way. 'Cause when I first got there, you know, you were pulling code from different scripts and things and you were trying to piece everything together. But when you have a platform like Matillion and you actually see it seamlessly across, it's just so phenomenal. >> So, I want to pick up on something Matthew said about, consolidating platforms and vendors because we have some data from PTR, one of our survey partners and they went out, every quarter they do surveys and they asked the customers that were going to decrease their spending in the quarter, "How are you going to do it?" And number one, by far, like, over a third said, "We're going to consolidate redundant vendors." Way ahead of cloud, we going to optimize cloud resource that was next at like 15%. So, confirms what you were saying and you're hearing that a lot. Will you wait? And I think we never get rid of stuff, we talk about it all the time. We call it GRS, get rid of stuff. Were you able to consolidate or at least minimize your expense around? >> Hoshang: Yeah, absolutely. >> What we were able to do is identify different parts of our tech stack that were just either deficient or duplicate, you know, so they're just like, we don't want any duplicate efforts, we just want to be able to have like, a single platform that does things, does things well and Matillion helped us identify all of those different and how do we choose the right tech stack. It's also about like Matillion is so easy to integrate with any tech stack, you know, it's just they have a generic API tool that you can log into anything besides all of the components that are already there. So it's a great platform to help you do that. >> And the three things we always say about the Data Productivity Cloud, everyone ready, we spoke about this is whether low-code, no-code, quasi-technical, quasi-business person using it, through to a high-end data engineer. You're going to feel at home on the DPC. The second one, which Hoshang was just alluding to there is stack ready, right? So it is built for AWS, built for Snowflake, built for Redshift, pure tight integration, push down ELT better than you could write yourself by hand. And then the final one is future ready, which is this idea that you can start now super easy. And we buy software quickly nowadays, right? We spin it up, we try it out and before we know it, the whole organization is using it. And so the future ready talks about that continuum of being able to launch in five minutes, learn it in five hours, deliver your first project in five days and yet still be happy that it's an enterprise scalable platform, five years down track including integrating with all the different things. So Matillion's job holding up the end of the bargain that Hoshang was just talking about there is to ensure we keep putting the features integrations and support into the Data Productivity Cloud to make sure that Hoshang's team can continue to live inside it and do all the things they need to do. >> Hoshang, you talked about the speed to insight being tremendously fast, but if I'm looking at Cisco Meraki from a high level business outcome perspective, what are some of those outcomes that a Matillion is helping Cisco Meraki to achieve. >> So I can just talk in general, not giving you like any specific numbers or anything, but for example, we were trying to understand how well our small and medium business campaigns were doing and we had to actually pull in data from multiple different sources. So not just, our instances of Marketo and Salesforce, we had to look at our internal databases. So Matillion helped us blend all of that together. Once I had all of that data blended, it was then ready to be analyzed. And once we had that analysis done, we were able to confirm that our SMB campaigns were doing well but these the things that we need to do to improve them. When we did that and all of that happened so quickly because they were like, well you need to get data from here, you need to get data from there. And we're like, great, we'll just plug, plug, plug. We put it all together, build transformations and you know we produced this insight and then we were able to reform, refine, and keep getting better and better at it. And you know, we had a 40X return on SMB campaigns. It's unbelievable. >> And there's the revenue tie in right there. >> Hoshang: Yeah. >> Matthew, I know you've been super busy, tons of meetings, you didn't get to see the whole keynote, but one of the themes of Adam Selipsky's keynote was, you know, the three letter word of ETL, they laid out a vision of zero ETL and then they announced zero ETL for Aurora and Redshift. And you think about ETL, I remember the days they said, "Okay, we're going to do ELT." Which is like, raising the debt ceiling, we're just going to kick the can down the road. So, what do you think about that vision? You know, how does it relate to what you guys are doing? >> So there was a, I don't know if this only works in the UK or it works globally. It was a good line many years ago. Rumors of my death are premature or so I think it was an obituary had gone out in the times by accident and that's how the guy responded to it. Something like that. It's a little bit like that. The announcement earlier within the AWS space of zero ETL between platforms like Aurora and Redshift and perhaps more over time is really about data movement, right? So it's about do I need to do a load of high cost in terms of coding and compute, movement of data between one platform, another. At Matillion, we've always seen data movement as an enabling technology, which gets you to the value add of transformation. My favorite metaphor to bring this to life is one of iron. So the world's made of iron, right? The world is literally made of iron ore but iron ore isn't useful until you turn it to steel. Loading data is digging out iron ore from the ground and moving it to the refinery. Transformation of data is turning iron ore into steel and what the announcements you saw earlier from AWS are more about the quarry to the factory bit than they are about the iron ore to the steel bit. And so, I think it's great that platforms are making it easier to move data between them, but it doesn't change the need for Hoshang's business professionals to refine that data into something useful to drive their marketing campaigns. >> Exactly, it's quarry to the factory and a very Snowflake like in a way, right? You make it easy to get in. >> It's like, don't get me wrong, I'm great to see investment going into the Redshift business and the AWS data analytics stack. We do a lot of business there. But yes, this stuff is also there on Snowflake, already. >> I mean come on, we've seen this for years. You know, I know there's a big love fest between Snowflake and AWS 'cause they're selling so much business in the field. But look that we saw it separating computing from storage, then AWS does it and now, you know, why not? It's good sense. That's what customers want. The customer obsessed data sharing is another thing. >> And if you take data sharing as an example from our friends at Snowflake, when that was announced a few people possibly, yourselves, said, "Oh, Matthew what do you think about this? You're in the data movement business." And I was like, "Ah, I'm not really actually, some of my competitors are in the data movement business. I have data movement as part of my platform. We don't charge directly for it. It's just part of the platform." And really what it's to do is to get the data into a place where you can do the fun stuff with it of refining into steel. And so if Snowflake or now AWS and the Redshift group are making that easier that's just faster to fun for me really. >> Yeah, sure. >> Last question, a question for both of you. If you had, you have a brand new shiny car, you got a bumper sticker that you want to put on that car to tell everyone about Matillion, everyone about Cisco Meraki, what does that bumper sticker say? >> So for Matillion, it says Matillion is the Data Productivity Cloud. We help you make your data business ready, faster. And then for a joke I'd write, "Which you are going to need in the face of this tsunami of data." So that's what mine would say. >> Love it. Hoshang, what would you say? >> I would say that Cisco makes some of the best products for IT professionals. And I don't think you can, really do the things you do in IT without any Cisco product. Really phenomenal products. And, we've gone so much beyond just the IT realm. So you know, it's been phenomenal. >> Awesome. Guys, it's been a pleasure having you back on the program. Congrats to you now Hoshang, an alumni of theCUBE. >> Thank you. >> But thank you for talking to us, Matthew, about what's going on with Matillion so much since we've seen you last. I can imagine how much worse going to go on until we see you again. But we appreciate, especially having the Cisco Meraki customer example that really articulates the value of data for everyone. We appreciate your insights and we appreciate your time. >> Thank you. >> Privilege to be here. Thanks for having us. >> Thank you. >> Pleasure. For our guests and Dave Vellante, I'm Lisa Martin. You're watching theCUBE, the leader in live enterprise and emerging tech coverage.

Published Date : Nov 29 2022

SUMMARY :

One of the things that you and the theme is simplifying data Guys, great to have you on the program. you just a few months ago What's gone on since we've seen you last? And the Data Productivity Cloud Hoshang talk to us a little And so that's the brilliance of Matillion but it only gets you a And so I mentioned the Data So the speed to insight is is that the visual metaphor, if you don't have a and things and you were trying So, confirms what you were saying to help you do that. and do all the things they need to do. Hoshang, you talked about the speed And you know, we had a 40X And there's the revenue to what you guys are doing? the guy responded to it. Exactly, it's quarry to the factory and the AWS data analytics stack. now, you know, why not? And if you take data you want to put on that car We help you make your data Hoshang, what would you say? really do the things you do in Congrats to you now Hoshang, until we see you again. Privilege to be here. the leader in live enterprise

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Matthew	PERSON	0.99+
Lisa Martin	PERSON	0.99+
David	PERSON	0.99+
Matthew Scullion	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Adam Selipsky	PERSON	0.99+
Vegas	LOCATION	0.99+
Cisco	ORGANIZATION	0.99+
Hoshang	PERSON	0.99+
50%	QUANTITY	0.99+
five days	QUANTITY	0.99+
UK	LOCATION	0.99+
five hours	QUANTITY	0.99+
five minutes	QUANTITY	0.99+
Selipsky	PERSON	0.99+
Matillion	ORGANIZATION	0.99+
2020s	DATE	0.99+
Hoshang Chenoy	PERSON	0.99+
40X	QUANTITY	0.99+
15%	QUANTITY	0.99+
first project	QUANTITY	0.99+
Cisco Meraki	ORGANIZATION	0.99+
Aurora	ORGANIZATION	0.99+
five sets	QUANTITY	0.99+
Python	TITLE	0.99+
one	QUANTITY	0.99+
Meraki	PERSON	0.99+
one platform	QUANTITY	0.99+
both	QUANTITY	0.99+
One	QUANTITY	0.99+
SQL	TITLE	0.99+
second one	QUANTITY	0.98+
five years	QUANTITY	0.98+
five expensive licenses	QUANTITY	0.98+
first year	QUANTITY	0.98+
PTR	ORGANIZATION	0.98+
LiveRamp	ORGANIZATION	0.97+
Snowflake	TITLE	0.97+
three things	QUANTITY	0.97+
hundred percent	QUANTITY	0.96+
Matillion	PERSON	0.96+
zero	QUANTITY	0.95+
Redshift	TITLE	0.95+
over a third	QUANTITY	0.94+

Accelerating Automated Analytics in the Cloud with Alteryx

>>Alteryx is a company with a long history that goes all the way back to the late 1990s. Now the one consistent theme over 20 plus years has been that Ultrix has always been a data company early in the big data and Hadoop cycle. It saw the need to combine and prep different data types so that organizations could analyze data and take action Altrix and similar companies played a critical role in helping companies become data-driven. The problem was the decade of big data, brought a lot of complexities and required immense skills just to get the technology to work as advertised this in turn limited, the pace of adoption and the number of companies that could really lean in and take advantage of the cloud began to change all that and set the foundation for today's theme to Zuora of digital transformation. We hear that phrase a ton digital transformation. >>People used to think it was a buzzword, but of course we learned from the pandemic that if you're not a digital business, you're out of business and a key tenant of digital transformation is democratizing data, meaning enabling, not just hypo hyper specialized experts, but anyone business users to put data to work. Now back to Ultrix, the company has embarked on a major transformation of its own. Over the past couple of years, brought in new management, they've changed the way in which it engaged with customers with the new subscription model and it's topgraded its talent pool. 2021 was even more significant because of two acquisitions that Altrix made hyper Ana and trifecta. Why are these acquisitions important? Well, traditionally Altryx sold to business analysts that were part of the data pipeline. These were fairly technical people who had certain skills and were trained in things like writing Python code with hyper Ana Altryx has added a new persona, the business user, anyone in the business who wanted to gain insights from data and, or let's say use AI without having to be a deep technical expert. >>And then Trifacta a company started in the early days of big data by cube alum, Joe Hellerstein and his colleagues at Berkeley. They knocked down the data engineering persona, and this gives Altryx a complimentary extension into it where things like governance and security are paramount. So as we enter 2022, the post isolation economy is here and we do so with a digital foundation built on the confluence of cloud native technologies, data democratization and machine intelligence or AI, if you prefer. And Altryx is entering that new era with an expanded portfolio, new go-to market vectors, a recurring revenue business model, and a brand new outlook on how to solve customer problems and scale a company. My name is Dave Vellante with the cube and I'll be your host today. And the next hour, we're going to explore the opportunities in this new data market. And we have three segments where we dig into these trends and themes. First we'll talk to Jay Henderson, vice president of product management at Ultrix about cloud acceleration and simplifying complex data operations. Then we'll bring in Suresh Vetol who's the chief product officer at Altrix and Adam Wilson, the CEO of Trifacta, which of course is now part of Altrix. And finally, we'll hear about how Altryx is partnering with snowflake and the ecosystem and how they're integrating with data platforms like snowflake and what this means for customers. And we may have a few surprises sprinkled in as well into the conversation let's get started. >>We're kicking off the program with our first segment. Jay Henderson is the vice president of product management Altryx and we're going to talk about the trends and data, where we came from, how we got here, where we're going. We get some launch news. Well, Jay, welcome to the cube. >>Great to be here, really excited to share some of the things we're working on. >>Yeah. Thank you. So look, you have a deep product background, product management, product marketing, you've done strategy work. You've been around software and data, your entire career, and we're seeing the collision of software data cloud machine intelligence. Let's start with the customer and maybe we can work back from there. So if you're an analytics or data executive in an organization, w J what's your north star, where are you trying to take your company from a data and analytics point of view? >>Yeah, I mean, you know, look, I think all organizations are really struggling to get insights out of their data. I think one of the things that we see is you've got digital exhaust, creating large volumes of data storage is really cheap, so it doesn't cost them much to keep it. And that results in a situation where the organization's, you know, drowning in data, but somehow still starving for insights. And so I think, uh, you know, when I talk to customers, they're really excited to figure out how they can put analytics in the hands of every single person in their organization, and really start to democratize the analytics, um, and, you know, let the, the business users and the whole organization get value out of all that data they have. >>And we're going to dig into that throughout this program data, I like to say is plentiful insights, not always so much. Tell us about your launch today, Jay, and thinking about the trends that you just highlighted, the direction that your customers want to go and the problems that you're solving, what role does the cloud play in? What is what you're launching? How does that fit in? >>Yeah, we're, we're really excited today. We're launching the Altryx analytics cloud. That's really a portfolio of cloud-based solutions that have all been built from the ground up to be cloud native, um, and to take advantage of things like based access. So that it's really easy to give anyone access, including folks on a Mac. Um, it, you know, it also lets you take advantage of elastic compute so that you can do, you know, in database processing and cloud native, um, solutions that are gonna scale to solve the most complex problems. So we've got a portfolio of solutions, things like designer cloud, which is our flagship designer product in a browser and on the cloud, but we've got ultra to machine learning, which helps up-skill regular old analysts with advanced machine learning capabilities. We've got auto insights, which brings a business users into the fold and automatically unearths insights using AI and machine learning. And we've got our latest edition, which is Trifacta that helps data engineers do data pipelining and really, um, you know, create a lot of the underlying data sets that are used in some of this, uh, downstream analytics. >>Let's dig into some of those roles if we could a little bit, I mean, you've traditionally Altryx has served the business analysts and that's what designer cloud is fit for, I believe. And you've explained, you know, kind of the scope, sorry, you've expanded that scope into the, to the business user with hyper Anna. And we're in a moment we're going to talk to Adam Wilson and Suresh, uh, about Trifacta and that recent acquisition takes you, as you said, into the data engineering space in it. But in thinking about the business analyst role, what's unique about designer cloud cloud, and how does it help these individuals? >>Yeah, I mean, you know, really, I go back to some of the feedback we've had from our customers, which is, um, you know, they oftentimes have dozens or hundreds of seats of our designer desktop product, you know, really, as they look to take the next step, they're trying to figure out how do I give access to that? Those types of analytics to thousands of people within the organization and designer cloud is, is really great for that. You've got the browser-based interface. So if folks are on a Mac, they can really easily just pop, open the browser and get access to all of those, uh, prep and blend capabilities to a lot of the analysis we're doing. Um, it's a great way to scale up access to the analytics and then start to put it in the hands of really anyone in the organization, not just those highly skilled power users. >>Okay, great. So now then you add in the hyper Anna acquisition. So now you're targeting the business user Trifacta comes into the mix that deeper it angle that we talked about, how does this all fit together? How should we be thinking about the new Altryx portfolio? >>Yeah, I mean, I think it's pretty exciting. Um, you know, when you think about democratizing analytics and providing access to all these different groups of people, um, you've not been able to do it through one platform before. Um, you know, it's not going to be one interface that meets the, of all these different groups within the organization. You really do need purpose built specialized capabilities for each group. And finally, today with the announcement of the alternates analytics cloud, we brought together all of those different capabilities, all of those different interfaces into a single in the end application. So really finally delivering on the promise of providing analytics to all, >>How much of this you've been able to share with your customers and maybe your partners. I mean, I know OD is fairly new, but if you've been able to get any feedback from them, what are they saying about it? >>Uh, I mean, it's, it's pretty amazing. Um, we ran a early access, limited availability program that led us put a lot of this technology in the hands of over 600 customers, um, over the last few months. So we have gotten a lot of feedback. I tell you, um, it's been overwhelmingly positive. I think organizations are really excited to unlock the insights that have been hidden in all this data. They've got, they're excited to be able to use analytics in every decision that they're making so that the decisions they have or more informed and produce better business outcomes. Um, and, and this idea that they're going to move from, you know, dozens to hundreds or thousands of people who have access to these kinds of capabilities, I think has been a really exciting thing that is going to accelerate the transformation that these customers are on. >>Yeah, those are good. Good, good numbers for, for preview mode. Let's, let's talk a little bit about vision. So it's democratizing data is the ultimate goal, which frankly has been elusive for most organizations over time. How's your cloud going to address the challenges of putting data to work across the entire enterprise? >>Yeah, I mean, I tend to think about the future and some of the investments we're making in our products and our roadmap across four big themes, you know, in the, and these are really kind of enduring themes that you're going to see us making investments in over the next few years, the first is having cloud centricity. You know, the data gravity has been moving to the cloud. We need to be able to provide access, to be able to ingest and manipulate that data, to be able to write back to it, to provide cloud solution. So the first one is really around cloud centricity. The second is around big data fluency. Once you have all of the data, you need to be able to manipulate it in a performant manner. So having the elastic cloud infrastructure and in database processing is so important, the third is around making AI a strategic advantage. >>So, uh, you know, getting everyone involved and accessing AI and machine learning to unlock those insights, getting it out of the hands of the small group of data scientists, putting it in the hands of analysts and business users. Um, and then the fourth thing is really providing access across the entire organization. You know, it and data engineers, uh, as well as business owners and analysts. So, um, cloud centricity, big data fluency, um, AI is a strategic advantage and, uh, personas across the organization are really the four big themes you're going to see us, uh, working on over the next few months and, uh, coming coming year. >>That's good. Thank you for that. So, so on a related question, how do you see the data organizations evolving? I mean, traditionally you've had, you know, monolithic organizations, uh, very specialized or I might even say hyper specialized roles and, and your, your mission of course is the customer. You, you, you, you and your customers, they want to democratize the data. And so it seems logical that domain leaders are going to take more responsibility for data, life cycles, data ownerships, low code becomes more important. And perhaps this kind of challenges, the historically highly centralized and really specialized roles that I just talked about. How do you see that evolving and, and, and what role will Altryx play? >>Yeah. Um, you know, I think we'll see sort of a more federated systems start to emerge. Those centralized groups are going to continue to exist. Um, but they're going to start to empower, you know, in a much more de-centralized way, the people who are closer to the business problems and have better business understanding. I think that's going to let the centralized highly skilled teams work on, uh, problems that are of higher value to the organization. The kinds of problems where one or 2% lift in the model results in millions of dollars a day for the business. And then by pushing some of the analytics out to, uh, closer to the edge and closer to the business, you'll be able to apply those analytics in every single decision. So I think you're going to see, you know, both the decentralized and centralized models start to work in harmony and a little bit more about almost a federated sort of a way. And I think, you know, the exciting thing for us at Altryx is, you know, we want to facilitate that. We want to give analytic capabilities and solutions to both groups and types of people. We want to help them collaborate better, um, and drive business outcomes with the analytics they're using. >>Yeah. I mean, I think my take on another one, if you could comment is to me, the technology should be an operational detail and it has been the, the, the dog that wags the tail, or maybe the other way around, you mentioned digital exhaust before. I mean, essentially it's digital exhaust coming out of operationals systems that then somehow, eventually end up in the hand of the domain users. And I wonder if increasingly we're going to see those domain users, users, those, those line of business experts get more access. That's your goal. And then even go beyond analytics, start to build data products that could be monetized, and that maybe it's going to take a decade to play out, but that is sort of a new era of data. Do you see it that way? >>Absolutely. We're actually making big investments in our products and capabilities to be able to create analytic applications and to enable somebody who's an analyst or business user to create an application on top of the data and analytics layers that they have, um, really to help democratize the analytics, to help prepackage some of the analytics that can drive more insights. So I think that's definitely a trend we're going to see more. >>Yeah. And to your point, if you can federate the governance and automate that, then that can happen. I mean, that's a key part of it, obviously. So, all right, Jay, we have to leave it there up next. We take a deep dive into the Altryx recent acquisition of Trifacta with Adam Wilson who led Trifacta for more than seven years. It's the recipe. Tyler is the chief product officer at Altryx to explain the rationale behind the acquisition and how it's going to impact customers. Keep it right there. You're watching the cube. You're a leader in enterprise tech coverage. >>It's go time, get ready to accelerate your data analytics journey with a unified cloud native platform. That's accessible for everyone on the go from home to office and everywhere in between effortless analytics to help you go from ideas to outcomes and no time. It's your time to shine. It's Altryx analytics cloud time. >>Okay. We're here with. Who's the chief product officer at Altryx and Adam Wilson, the CEO of Trifacta. Now of course, part of Altryx just closed this quarter. Gentlemen. Welcome. >>Great to be here. >>Okay. So let me start with you. In my opening remarks, I talked about Altrix is traditional position serving business analysts and how the hyper Anna acquisition brought you deeper into the business user space. What does Trifacta bring to your portfolio? Why'd you buy the company? >>Yeah. Thank you. Thank you for the question. Um, you know, we see, uh, we see a massive opportunity of helping, um, brands, um, democratize the use of analytics across their business. Um, every knowledge worker, every individual in the company should have access to analytics. It's no longer optional, um, as they navigate their businesses with that in mind, you know, we know designer and are the products that Altrix has been selling the past decade or so do a really great job, um, addressing the business analysts, uh, with, um, hyper Rana now kind of renamed, um, Altrix auto. We even speak with the business owner and the line of business owner. Who's looking for insights that aren't real in traditional dashboards and so on. Um, but we see this opportunity of really helping the data engineering teams and it organizations, um, to also make better use of analytics. Um, and that's where the drive factor comes in for us. Um, drive factor has the best data engineering cloud in the planet. Um, they have an established track record of working across multiple cloud platforms and helping data engineers, um, do better data pipelining and work better with, uh, this massive kind of cloud transformation that's happening in every business. Um, and so fact made so much sense for us. >>Yeah. Thank you for that. I mean, you, look, you could have built it yourself would have taken, you know, who knows how long, you know, but, uh, so definitely a great time to market move, Adam. I wonder if we could dig into Trifacta some more, I mean, I remember interviewing Joe Hellerstein in the early days. You've talked about this as well, uh, on the cube coming at the problem of taking data from raw refined to an experience point of view. And Joe in the early days, talked about flipping the model and starting with data visualization, something Jeff, her was expert at. So maybe explain how we got here. We used to have this cumbersome process of ETL and you may be in some others changed that model with ELL and then T explain how Trifacta really changed the data engineering game. >>Yeah, that's exactly right. Uh, David, it's been a really interesting journey for us because I think the original hypothesis coming out of the campus research, uh, at Berkeley and Stanford that really birth Trifacta was, you know, why is it that the people who know the data best can't do the work? You know, why is this become the exclusive purview of the highly technical? And, you know, can we rethink this and make this a user experience, problem powered by machine learning that will take some of the more complicated things that people want to do with data and really help to automate those. So, so a broader set of, of users can, um, can really see for themselves and help themselves. And, and I think that, um, there was a lot of pent up frustration out there because people have been told for, you know, for a decade now to be more data-driven and then the whole time they're saying, well, then give me the data, you know, in the shape that I could use it with the right level of quality and I'm happy to be, but don't tell me to be more data-driven and then, and, and not empower me, um, to, to get in there and to actually start to work with the data in meaningful ways. >>And so, um, that was really, you know, what, you know, the origin story of the company and I think is, as we, um, saw over the course of the last 5, 6, 7 years that, um, you know, uh, real, uh, excitement to embrace this idea of, of trying to think about data engineering differently, trying to democratize the, the ETL process and to also leverage all these exciting new, uh, engines and platforms that are out there that allow for processing, you know, ever more diverse data sets, ever larger data sets and new and interesting ways. And that's where a lot of the push-down or the ELT approaches that, you know, I think it could really won the day. Um, and that, and that for us was a hallmark of the solution from the very beginning. >>Yeah, this is a huge point that you're making is, is first of all, there's a large business, it's probably about a hundred billion dollar Tam. Uh, and the, the point you're making, because we've looked, we've contextualized most of our operational systems, but the big data pipeline is hasn't gotten there. But, and maybe we could talk about that a little bit because democratizing data is Nirvana, but it's been historically very difficult. You've got a number of companies it's very fragmented and they're all trying to attack their little piece of the problem to achieve an outcome, but it's been hard. And so what's going to be different about Altryx as you bring these puzzle pieces together, how is this going to impact your customers who would like to take that one? >>Yeah, maybe, maybe I'll take a crack at it. And Adam will, um, add on, um, you know, there hasn't been a single platform for analytics, automation in the enterprise, right? People have relied on, uh, different products, um, to solve kind of, uh, smaller problems, um, across this analytics, automation, data transformation domain. Um, and, um, I think uniquely Alcon's has that opportunity. Uh, we've got 7,000 plus customers who rely on analytics for, um, data management, for analytics, for AI and ML, uh, for transformations, uh, for reporting and visualization for automated insights and so on. Um, and so by bringing drive factor, we have the opportunity to scale this even further and solve for more use cases, expand the scenarios where it's applied and so multiple personas. Um, and we just talked about the data engineers. They are really a growing stakeholder in this transformation of data and analytics. >>Yeah, good. Maybe we can stay on this for a minute cause you, you you're right. You bring it together. Now at least three personas the business analyst, the end user slash business user. And now the data engineer, which is really out of an it role in a lot of companies, and you've used this term, the data engineering cloud, what is that? How is it going to integrate in with, or support these other personas? And, and how's it going to integrate into the broader ecosystem of clouds and cloud data warehouses or any other data stores? >>Yeah, no, that's great. Uh, yeah, I think for us, we really looked at this and said, you know, we want to build an open and interactive cloud platform for data engineers, you know, to collaboratively profile pipeline, um, and prepare data for analysis. And that really meant collaborating with the analysts that were in the line of business. And so this is why a big reason why this combination is so magic because ultimately if we can get the data engineers that are creating the data products together with the analysts that are in the line of business that are driving a lot of the decision making and allow for that, what I would describe as collaborative curation of the data together, so that you're starting to see, um, uh, you know, increasing returns to scale as this, uh, as this rolls out. I just think that is an incredibly powerful combination and, and frankly, something that the market is not crack the code on yet. And so, um, I think when we, when I sat down with Suresh and with mark and the team at Ultrix, that was really part of the, the, the big idea, the big vision that was painted and got us really energized about the acquisition and about the potential of the combination. >>And you're really, you're obviously writing the cloud and the cloud native wave. Um, and, but specifically we're seeing, you know, I almost don't even want to call it a data warehouse anyway, because when you look at what's, for instance, Snowflake's doing, of course their marketing is around the data cloud, but I actually think there's real justification for that because it's not like the traditional data warehouse, right. It's, it's simplified get there fast, don't necessarily have to go through the central organization to share data. Uh, and, and, and, but it's really all about simplification, right? Isn't that really what the democratization comes down to. >>Yeah. It's simplification and collaboration. Right. I don't want to, I want to kind of just what Adam said resonates with me deeply. Um, analytics is one of those, um, massive disciplines inside an enterprise that's really had the weakest of tools. Um, and we just have interfaces to collaborate with, and I think truly this was all drinks and a superpower was helping the analysts get more out of their data, get more out of the analytics, like imagine a world where these people are collaborating and sharing insights in real time and sharing workflows and getting access to new data sources, um, understanding data models better, I think, um, uh, curating those insights. I boring Adam's phrase again. Um, I think that creates a real value inside the organization because frankly in scaling analytics and democratizing analytics and data, we're still in such early phases of this journey. >>So how should we think about designer cloud, which is from Altrix it's really been the on-prem and the server desktop offering. And of course Trifacta is with cloud cloud data warehouses. Right. Uh, how, how should we think about those two products? Yeah, >>I think, I think you should think about them. And, uh, um, as, as very complimentary right designer cloud really shares a lot of DNA and heritage with, uh, designer desktop, um, the low code tooling and that interface, uh, the really appeals to the business analysts, um, and gets a lot of the things that they do well, we've also built it with interoperability in mind, right. So if you started building your workflows in designer desktop, you want to share that with design and cloud, we want to make it super easy for you to do that. Um, and I think over time now we're only a week into, um, this Alliance with, um, with, um, Trifacta, um, I think we have to get deeper inside to think about what does the data engineer really need? What's the business analysts really need and how to design a cloud, and Trifacta really support both of those requirements, uh, while kind of continue to build on the trifecta on the amazing Trifacta cloud platform. >>You know, >>I think we're just going to say, I think that's one of the things that, um, you know, creates a lot of, uh, opportunity as we go forward, because ultimately, you know, Trifacta took a platform, uh, first mentality to everything that we built. So thinking about openness and extensibility and, um, and how over time people could build things on top of factor that are a variety of analytic tool chain, or analytic applications. And so, uh, when you think about, um, Ultrix now starting to, uh, to move some of its capabilities or to provide additional capabilities, uh, in the cloud, um, you know, Trifacta becomes a platform that can accelerate, you know, all of that work and create, uh, uh, a cohesive set of, of cloud-based services that, um, share a common platform. And that maintains independence because both companies, um, have been, uh, you know, fiercely independent, uh, and, and really giving people choice. >>Um, so making sure that whether you're, uh, you know, picking one cloud platform and other, whether you're running things on the desktop, uh, whether you're running in hybrid environments, that, um, no matter what your decision, um, you're always in a position to be able to get out your data. You're always in a position to be able to cleanse transform shape structure, that data, and ultimately to deliver, uh, the analytics that you need. And so I think in that sense, um, uh, you know, this, this again is another reason why the combination, you know, fits so well together, giving people, um, the choice. Um, and as they, as they think about their analytics strategy and their platform strategy going forward, >>Yeah. I make a chuckle, but one of the reasons I always liked Altrix is cause you kinda did the little end run on it. It can be a blocker sometimes, but that created problems, right? Because the organization said, wow, this big data stuff has taken off, but we need security. We need governance. And it's interesting because you've got, you know, ETL has been complex, whereas the visualization tools, they really, you know, really weren't great at governance and security. It took some time there. So that's not, not their heritage. You're bringing those worlds together. And I'm interested, you guys just had your sales kickoff, you know, what was their reaction like? Uh, maybe Suresh, you could start off and maybe Adam, you could bring us home. >>Um, thanks for asking about our sales kickoff. So we met for the first time and you've got a two years, right. For, as, as it is for many of us, um, in person, uh, um, which I think was a, was a real breakthrough as Qualtrics has been on its transformation journey. Uh, we added a Trifacta to, um, the, the potty such as the tour, um, and getting all of our sales teams and product organizations, um, to meet in person in one location. I thought that was very powerful for other the company. Uh, but then I tell you, um, um, the reception for Trifacta was beyond anything I could have imagined. Uh, we were working out him and I will, when he's so hot on, on the deal and the core hypotheses and so on. And then you step back and you're going to share the vision with the field organization, and it blows you away, the energy that it creates among our sellers out of partners. >>And I'm sure Madam will and his team were mocked, um, every single day, uh, with questions and opportunities to bring them in. But Adam, maybe you should share. Yeah, no, it was, uh, it was through the roof. I mean, uh, uh, the, uh, the amount of energy, the, uh, certainly how welcoming everybody was, uh, uh, you know, just, I think the story makes so much sense together. I think culturally, the company is, are very aligned. Um, and, uh, it was a real, uh, real capstone moment, uh, to be able to complete the acquisition and to, and to close and announced, you know, at the kickoff event. And, um, I think, you know, for us, when we really thought about it, you know, when we ended, the story that we told was just, you have this opportunity to really cater to what the end users care about, which is a lot about interactivity and self-service, and at the same time. >>And that's, and that's a lot of the goodness that, um, that Altryx is, has brought, you know, through, you know, you know, years and years of, of building a very vibrant community of, you know, thousands, hundreds of thousands of users. And on the other side, you know, Trifacta bringing in this data engineering focus, that's really about, uh, the governance things that you mentioned and the openness, um, that, that it cares deeply about. And all of a sudden, now you have a chance to put that together into a complete story where the data engineering cloud and analytics, automation, you know, coming together. And, um, and I just think, you know, the lights went on, um, you know, for people instantaneously and, you know, this is a story that, um, that I think the market is really hungry for. And certainly the reception we got from, uh, from the broader team at kickoff was, uh, was a great indication. >>Well, I think the story hangs together really well, you know, one of the better ones I've seen in, in this space, um, and, and you guys coming off a really, really strong quarter. So congratulations on that jets. We have to leave it there. I really appreciate your time today. Yeah. Take a look at this short video. And when we come back, we're going to dig into the ecosystem and the integration into cloud data warehouses and how leading organizations are creating modern data teams and accelerating their digital businesses. You're watching the cube you're leader in enterprise tech coverage. >>This is your data housed neatly insecurely in the snowflake data cloud. And all of it has potential the potential to solve complex business problems, deliver personalized financial offerings, protect supply chains from disruption, cut costs, forecast, grow and innovate. All you need to do is put your data in the hands of the right people and give it an opportunity. Luckily for you. That's the easy part because snowflake works with Alteryx and Alteryx turns data into breakthroughs with just a click. Your organization can automate analytics with drag and drop building blocks, easily access snowflake data with both sequel and no SQL options, share insights, powered by Alteryx data science and push processing to snowflake for lightning, fast performance, you get answers you can put to work in your teams, get repeatable processes they can share in that's exciting because not only is your data no longer sitting around in silos, it's also mobilized for the next opportunity. Turn your data into a breakthrough Alteryx and snowflake >>Okay. We're back here in the queue, focusing on the business promise of the cloud democratizing data, making it accessible and enabling everyone to get value from analytics, insights, and data. We're now moving into the eco systems segment the power of many versus the resources of one. And we're pleased to welcome. Barb Hills camp was the senior vice president partners and alliances at Ultrix and a special guest Terek do week head of technology alliances at snowflake folks. Welcome. Good to see you. >>Thank you. Thanks for having me. Good to see >>Dave. Great to see you guys. So cloud migration, it's one of the hottest topics. It's the top one of the top initiatives of senior technology leaders. We have survey data with our partner ETR it's number two behind security, and just ahead of analytics. So we're hovering around all the hot topics here. Barb, what are you seeing with respect to customer, you know, cloud migration momentum, and how does the Ultrix partner strategy fit? >>Yeah, sure. Partners are central company's strategy. They always have been. We recognize that our partners have deep customer relationships. And when you connect that with their domain expertise, they're really helping customers on their cloud and business transformation journey. We've been helping customers achieve their desired outcomes with our partner community for quite some time. And our partner base has been growing an average of 30% year over year, that partner community and strategy now addresses several kinds of partners, spanning solution providers to global SIS and technology partners, such as snowflake and together, we help our customers realize the business promise of their journey to the cloud. Snowflake provides a scalable storage system altereds provides the business user friendly front end. So for example, it departments depend on snowflake to consolidate data across systems into one data cloud with Altryx business users can easily unlock that data in snowflake solving real business outcomes. Our GSI and solution provider partners are instrumental in providing that end to end benefit of a modern analytic stack in the cloud providing platform, guidance, deployment, support, and other professional services. >>Great. Let's get a little bit more into the relationship between Altrix and S in snowflake, the partnership, maybe a little bit about the history, you know, what are the critical aspects that we should really focus on? Barb? Maybe you could start an Interra kindly way in as well. >>Yeah, so the relationship started in 2020 and all shirts made a big bag deep with snowflake co-innovating and optimizing cloud use cases together. We are supporting customers who are looking for that modern analytic stack to replace an old one or to implement their first analytic strategy. And our joint customers want to self-serve with data-driven analytics, leveraging all the benefits of the cloud, scalability, accessibility, governance, and optimizing their costs. Um, Altrix proudly achieved. Snowflake's highest elite tier in their partner program last year. And to do that, we completed a rigorous third party testing process, which also helped us make some recommended improvements to our joint stack. We wanted customers to have confidence. They would benefit from high quality and performance in their investment with us then to help customers get the most value out of the destroyed solution. We developed two great assets. One is the officer starter kit for snowflake, and we coauthored a joint best practices guide. >>The starter kit contains documentation, business workflows, and videos, helping customers to get going more easily with an altered since snowflake solution. And the best practices guide is more of a technical document, bringing together experiences and guidance on how Altryx and snowflake can be deployed together. Internally. We also built a full enablement catalog resources, right? We wanted to provide our account executives more about the value of the snowflake relationship. How do we engage and some best practices. And now we have hundreds of joint customers such as Juniper and Sainsbury who are actively using our joint solution, solving big business problems much faster. >>Cool. Kara, can you give us your perspective on the partnership? >>Yeah, definitely. Dave, so as Barb mentioned, we've got this standing very successful partnership going back years with hundreds of happy joint customers. And when I look at the beginning, Altrix has helped pioneer the concept of self-service analytics, especially with use cases that we worked on with for, for data prep for BI users like Tableau and as Altryx has evolved to now becoming from data prep to now becoming a full end to end data science platform. It's really opened up a lot more opportunities for our partnership. Altryx has invested heavily over the last two years in areas of deep integration for customers to fully be able to expand their investment, both technologies. And those investments include things like in database pushed down, right? So customers can, can leverage that elastic platform, that being the snowflake data cloud, uh, with Alteryx orchestrating the end to end machine learning workflows Alteryx also invested heavily in snow park, a feature we released last year around this concept of data programmability. So all users were regardless of their business analysts, regardless of their data, scientists can use their tools of choice in order to consume and get at data. And now with Altryx cloud, we think it's going to open up even more opportunities. It's going to be a big year for the partnership. >>Yeah. So, you know, Terike, we we've covered snowflake pretty extensively and you initially solve what I used to call the, I still call the snake swallowing the basketball problem and cloud data warehouse changed all that because you had virtually infinite resources, but so that's obviously one of the problems that you guys solved early on, but what are some of the common challenges or patterns or trends that you see with snowflake customers and where does Altryx come in? >>Sure. Dave there's there's handful, um, that I can come up with today, the big challenges or trends for us, and Altrix really helps us across all of them. Um, there are three particular ones I'm going to talk about the first one being self-service analytics. If we think about it, every organization is trying to democratize data. Every organization wants to empower all their users, business users, um, you know, the, the technology users, but the business users, right? I think every organization has realized that if everyone has access to data and everyone can do something with data, it's going to make them competitively, give them a competitive advantage with Altrix is something we share that vision of putting that power in the hands of everyday users, regardless of the skillsets. So, um, with self-service analytics, with Ultrix designer they've they started out with self-service analytics as the forefront, and we're just scratching the surface. >>I think there was an analyst, um, report that shows that less than 20% of organizations are truly getting self-service analytics to their end users. Now, with Altryx going to Ultrix cloud, we think that's going to be a huge opportunity for us. Um, and then that opens up the second challenge, which is machine learning and AI, every organization is trying to get predictive analytics into every application that they have in order to be competitive in order to be competitive. Um, and with Altryx creating this platform so they can cater to both the everyday business user, the quote unquote, citizen data scientists, and making a code friendly for data scientists to be able to get at their notebooks and all the different tools that they want to use. Um, they fully integrated in our snow park platform, which I talked about before, so that now we get an end to end solution caring to all, all lines of business. >>And then finally this concept of data marketplaces, right? We, we created snowflake from the ground up to be able to solve the data sharing problem, the big data problem, the data sharing problem. And Altryx um, if we look at mobilizing your data, getting access to third-party datasets, to enrich with your own data sets, to enrich with, um, with your suppliers and with your partners, data sets, that's what all customers are trying to do in order to get a more comprehensive 360 view, um, within their, their data applications. And so with Altryx alterations, we're working on third-party data sets and marketplaces for quite some time. Now we're working on how do we integrate what Altrix is providing with the snowflake data marketplace so that we can enrich these workflows, these great, great workflows that Altrix writing provides. Now we can add third party data into that workflow. So that opens up a ton of opportunities, Dave. So those are three I see, uh, easily that we're going to be able to solve a lot of customer challenges with. >>So thank you for that. Terrick so let's stay on cloud a little bit. I mean, Altrix is undergoing a major transformation, big focus on the cloud. How does this cloud launch impact the partnership Terike from snowflakes perspective and then Barb, maybe, please add some color. >>Yeah, sure. Dave snowflake started as a cloud data platform. We saw our founders really saw the challenges that customers are having with becoming data-driven. And the biggest challenge was the complexity of having imagine infrastructure to even be able to do it, to get applications off the ground. And so we created something to be cloud-native. We created to be a SAS managed service. So now that that Altrix is moving to the same model, right? A cloud platform, a SAS managed service, we're just, we're just removing more of the friction. So we're going to be able to start to package these end to end solutions that are SAS based that are fully managed. So customers can, can go faster and they don't have to worry about all of the underlying complexities of, of, of stitching things together. Right? So, um, so that's, what's exciting from my viewpoint >>And I'll follow up. So as you said, we're investing heavily in the cloud a year ago, we had two pre desktop products, and today we have four cloud products with cloud. We can provide our users with more flexibility. We want to make it easier for the users to leverage their snowflake data in the Alteryx platform, whether they're using our beloved on-premise solution or the new cloud products were committed to that continued investment in the cloud, enabling our joint partner solutions to meet customer requirements, wherever they store their data. And we're working with snowflake, we're doing just that. So as customers look for a modern analytic stack, they expect that data to be easily accessible, right within a fast, secure and scalable platform. And the launch of our cloud strategy is a huge leap forward in making Altrix more widely accessible to all users in all types of roles, our GSI and our solution provider partners have asked for these cloud capabilities at scale, and they're excited to better support our customers, cloud and analytic >>Are. How about you go to market strategy? How would you describe your joint go to market strategy with snowflake? >>Sure. It's simple. We've got to work backwards from our customer's challenges, right? Driving transformation to solve problems, gain efficiencies, or help them save money. So whether it's with snowflake or other GSI, other partner types, we've outlined a joint journey together from recruit solution development, activation enablement, and then strengthening our go to market strategies to optimize our results together. We launched an updated partner program and within that framework, we've created new benefits for our partners around opportunity registration, new role based enablement and training, basically extending everything we do internally for our own go-to-market teams to our partners. We're offering partner, marketing resources and funding to reach new customers together. And as a matter of fact, we recently launched a fantastic video with snowflake. I love this video that very simply describes the path to insights starting with your snowflake data. Right? We do joint customer webinars. We're working on joint hands-on labs and have a wonderful landing page with a lot of assets for our customers. Once we have an interested customer, we engage our respective account managers, collaborating through discovery questions, proof of concepts really showcasing the desired outcome. And when you combine that with our partners technology or domain expertise, it's quite powerful, >>Dark. How do you see it? You'll go to market strategy. >>Yeah. Dave we've. Um, so we initially started selling, we initially sold snowflake as technology, right? Uh, looking at positioning the diff the architectural differentiators and the scale and concurrency. And we noticed as we got up into the larger enterprise customers, we're starting to see how do they solve their business problems using the technology, as well as them coming to us and saying, look, we want to also know how do you, how do you continue to map back to the specific prescriptive business problems we're having? And so we shifted to an industry focus last year, and this is an area where Altrix has been mature for probably since their inception selling to the line of business, right? Having prescriptive use cases that are particular to an industry like financial services, like retail, like healthcare and life sciences. And so, um, Barb talked about these, these starter kits where it's prescriptive, you've got a demo and, um, a way that customers can get off the ground and running, right? >>Cause we want to be able to shrink that time to market, the time to value that customers can watch these applications. And we want to be able to, to tell them specifically how we can map back to their business initiatives. So I see a huge opportunity to align on these industry solutions. As BARR mentioned, we're already doing that where we've released a few around financial services working in healthcare and retail as well. So that is going to be a way for us to allow customers to go even faster and start to map two lines of business with Alteryx. >>Great. Thanks Derek. Bob, what can we expect if we're observing this relationship? What should we look for in the coming year? >>A lot specifically with snowflake, we'll continue to invest in the partnership. Uh, we're co innovators in this journey, including snow park extensibility efforts, which Derek will tell you more about shortly. We're also launching these great news strategic solution blueprints, and extending that at no charge to our partners with snowflake, we're already collaborating with their retail and CPG team for industry blueprints. We're working with their data marketplace team to highlight solutions, working with that data in their marketplace. More broadly, as I mentioned, we're relaunching the ultra partner program designed to really better support the unique partner types in our global ecosystem, introducing new benefits so that with every partner, achievement or investment with ultra score, providing our partners with earlier access to benefits, um, I could talk about our program for 30 minutes. I know we don't have time. The key message here Alteryx is investing in our partner community across the business, recognizing the incredible value that they bring to our customers every day. >>Tarik will give you the last word. What should we be looking for from, >>Yeah, thanks. Thanks, Dave. As BARR mentioned, Altrix has been the forefront of innovating with us. They've been integrating into, uh, making sure again, that customers get the full investment out of snowflake things like in database push down that I talked about before that extensibility is really what we're excited about. Um, the ability for Ultrix to plug into this extensibility framework that we call snow park and to be able to extend out, um, ways that the end users can consume snowflake through, through sequel, which has traditionally been the way that you consume snowflake as well as Java and Scala, not Python. So we're excited about those, those capabilities. And then we're also excited about the ability to plug into the data marketplace to provide third party data sets, right there probably day sets in, in financial services, third party, data sets and retail. So now customers can build their data applications from end to end using ultrasound snowflake when the comprehensive 360 view of their customers, of their partners, of even their employees. Right? I think it's exciting to see what we're going to be able to do together with these upcoming innovations. Great >>Barb Tara, thanks so much for coming on the program, got to leave it right there in a moment, I'll be back with some closing thoughts in a summary, don't go away. >>1200 hours of wind tunnel testing, 30 million race simulations, 2.4 second pit stops make that 2.3. The sector times out the wazoo, whites are much of this velocity's pressures, temperatures, 80,000 components generating 11.8 billion data points and one analytics platform to make sense of it all. When McLaren needs to turn complex data into insights, they turn to Altryx Qualtrics analytics, automation, >>Okay, let's summarize and wrap up the session. We can pretty much agree the data is plentiful, but organizations continue to struggle to get maximum value out of their data investments. The ROI has been elusive. There are many reasons for that complexity data, trust silos, lack of talent and the like, but the opportunity to transform data operations and drive tangible value is immense collaboration across various roles. And disciplines is part of the answer as is democratizing data. This means putting data in the hands of those domain experts that are closest to the customer and really understand where the opportunity exists and how to best address them. We heard from Jay Henderson that we have all this data exhaust and cheap storage. It allows us to keep it for a long time. It's true, but as he pointed out that doesn't solve the fundamental problem. Data is spewing out from our operational systems, but much of it lacks business context for the data teams chartered with analyzing that data. >>So we heard about the trend toward low code development and federating data access. The reason this is important is because the business lines have the context and the more responsibility they take for data, the more quickly and effectively organizations are going to be able to put data to work. We also talked about the harmonization between centralized teams and enabling decentralized data flows. I mean, after all data by its very nature is distributed. And importantly, as we heard from Adam Wilson and Suresh Vittol to support this model, you have to have strong governance and service the needs of it and engineering teams. And that's where the trifecta acquisition fits into the equation. Finally, we heard about a key partnership between Altrix and snowflake and how the migration to cloud data warehouses is evolving into a global data cloud. This enables data sharing across teams and ecosystems and vertical markets at massive scale all while maintaining the governance required to protect the organizations and individuals alike. >>This is a new and emerging business model that is very exciting and points the way to the next generation of data innovation in the coming decade. We're decentralized domain teams get more facile access to data. Self-service take more responsibility for quality value and data innovation. While at the same time, the governance security and privacy edicts of an organization are centralized in programmatically enforced throughout an enterprise and an external ecosystem. This is Dave Volante. All these videos are available on demand@theqm.net altrix.com. Thanks for watching accelerating automated analytics in the cloud made possible by Altryx. And thanks for watching the queue, your leader in enterprise tech coverage. We'll see you next time.

Published Date : Mar 1 2022

SUMMARY :

It saw the need to combine and prep different data types so that organizations anyone in the business who wanted to gain insights from data and, or let's say use AI without the post isolation economy is here and we do so with a digital We're kicking off the program with our first segment. So look, you have a deep product background, product management, product marketing, And that results in a situation where the organization's, you know, the direction that your customers want to go and the problems that you're solving, what role does the cloud and really, um, you know, create a lot of the underlying data sets that are used in some of this, into the, to the business user with hyper Anna. of our designer desktop product, you know, really, as they look to take the next step, comes into the mix that deeper it angle that we talked about, how does this all fit together? analytics and providing access to all these different groups of people, um, How much of this you've been able to share with your customers and maybe your partners. Um, and, and this idea that they're going to move from, you know, So it's democratizing data is the ultimate goal, which frankly has been elusive for most You know, the data gravity has been moving to the cloud. So, uh, you know, getting everyone involved and accessing AI and machine learning to unlock seems logical that domain leaders are going to take more responsibility for data, And I think, you know, the exciting thing for us at Altryx is, you know, we want to facilitate that. the tail, or maybe the other way around, you mentioned digital exhaust before. the data and analytics layers that they have, um, really to help democratize the We take a deep dive into the Altryx recent acquisition of Trifacta with Adam Wilson It's go time, get ready to accelerate your data analytics journey the CEO of Trifacta. serving business analysts and how the hyper Anna acquisition brought you deeper into the with that in mind, you know, we know designer and are the products And Joe in the early days, talked about flipping the model that really birth Trifacta was, you know, why is it that the people who know the data best can't And so, um, that was really, you know, what, you know, the origin story of the company but the big data pipeline is hasn't gotten there. um, you know, there hasn't been a single platform for And now the data engineer, which is really And so, um, I think when we, when I sat down with Suresh and with mark and the team and, but specifically we're seeing, you know, I almost don't even want to call it a data warehouse anyway, Um, and we just have interfaces to collaborate And of course Trifacta is with cloud cloud data warehouses. What's the business analysts really need and how to design a cloud, and Trifacta really support both in the cloud, um, you know, Trifacta becomes a platform that can You're always in a position to be able to cleanse transform shape structure, that data, and ultimately to deliver, And I'm interested, you guys just had your sales kickoff, you know, what was their reaction like? And then you step back and you're going to share the vision with the field organization, and to close and announced, you know, at the kickoff event. And certainly the reception we got from, Well, I think the story hangs together really well, you know, one of the better ones I've seen in, in this space, And all of it has potential the potential to solve complex business problems, We're now moving into the eco systems segment the power of many Good to see So cloud migration, it's one of the hottest topics. on snowflake to consolidate data across systems into one data cloud with Altryx business the partnership, maybe a little bit about the history, you know, what are the critical aspects that we should really focus Yeah, so the relationship started in 2020 and all shirts made a big bag deep with snowflake And the best practices guide is more of a technical document, bringing together experiences and guidance So customers can, can leverage that elastic platform, that being the snowflake data cloud, one of the problems that you guys solved early on, but what are some of the common challenges or patterns or trends everyone has access to data and everyone can do something with data, it's going to make them competitively, application that they have in order to be competitive in order to be competitive. to enrich with your own data sets, to enrich with, um, with your suppliers and with your partners, So thank you for that. So now that that Altrix is moving to the same model, And the launch of our cloud strategy How would you describe your joint go to market strategy the path to insights starting with your snowflake data. You'll go to market strategy. And so we shifted to an industry focus So that is going to be a way for us to allow What should we look for in the coming year? blueprints, and extending that at no charge to our partners with snowflake, we're already collaborating with Tarik will give you the last word. Um, the ability for Ultrix to plug into this extensibility framework that we call Barb Tara, thanks so much for coming on the program, got to leave it right there in a moment, I'll be back with 11.8 billion data points and one analytics platform to make sense of it all. This means putting data in the hands of those domain experts that are closest to the customer are going to be able to put data to work. While at the same time, the governance security and privacy edicts

ENTITIES

Entity	Category	Confidence
Derek	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Suresh Vetol	PERSON	0.99+
Altryx	ORGANIZATION	0.99+
Jay	PERSON	0.99+
Joe Hellerstein	PERSON	0.99+
Dave	PERSON	0.99+
Dave Volante	PERSON	0.99+
Altrix	ORGANIZATION	0.99+
Jay Henderson	PERSON	0.99+
David	PERSON	0.99+
Adam	PERSON	0.99+
Barb	PERSON	0.99+
Jeff	PERSON	0.99+
2020	DATE	0.99+
Bob	PERSON	0.99+
Trifacta	ORGANIZATION	0.99+
Suresh Vittol	PERSON	0.99+
Tyler	PERSON	0.99+
Juniper	ORGANIZATION	0.99+
Alteryx	ORGANIZATION	0.99+
Ultrix	ORGANIZATION	0.99+
30 minutes	QUANTITY	0.99+
Terike	PERSON	0.99+
Adam Wilson	PERSON	0.99+
Joe	PERSON	0.99+
Suresh	PERSON	0.99+
Terrick	PERSON	0.99+
demand@theqm.net	OTHER	0.99+
thousands	QUANTITY	0.99+
Alcon	ORGANIZATION	0.99+
Kara	PERSON	0.99+
last year	DATE	0.99+
three	QUANTITY	0.99+
Qualtrics	ORGANIZATION	0.99+
less than 20%	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
one	QUANTITY	0.99+
One	QUANTITY	0.99+
Java	TITLE	0.99+
more than seven years	QUANTITY	0.99+
two acquisitions	QUANTITY	0.99+

Rik Tamm-Daniels, Informatica | AWS re:Invent 2021

>>Hey everyone. Welcome back to the cube. Live in Las Vegas, Lisa Martin, with Dave Nicholson, we are covering AWS reinvent 2021. This was probably one of the most important and largest hybrid tech events this year with AWS and its enormous ecosystem of partners. We're going to be talking with a hundred guests in the next couple of days. We started a couple of days ago and about really the innovation that's going to be going on in the cloud and tech in the next decade. We're pleased to welcome Rick Tam Daniel's as our next guest VP of strategic ecosystems at Informatica. Rick. Welcome to >>The program. Thank you for having me. It's a, it's a pleasure to be back. >>Isn't it nice to be back in person? Oh, it's amazing. All these conversations you just can't replicate by video conferencing. Absolutely >>Great to reconnect with folks haven't seen in a few years as well. >>Absolutely. That's been the sentiment. I think one of the, one of the sentiments that we've heard the last three days, so one of the things thematically that we've also been hearing about in, in between all of the plethora of AWS announcements, typical reinvent is that every company has to become a data company, public sector, private sector, small business, large business. Talk to us about how Informatica and AWS are helping companies become data companies so that they don't get left behind. >>But one of the biggest things that we're hearing at reinvent is that customers are really concerned with data, fragmentation, data silos, access to trusted data, and how do they, how do they get that information to really affect data led transformation? In fact, we did a survey earlier in the year of chief, the chief data officers were found that up to 80, almost 80% of organizations had 50% or more of their data in hybrid or multi-cloud environments. And also a 79% are looking to leverage more than 100 data sources. And 30% are looking to leverage more than 1000 data sources. So Informatica we, with our intelligent data management cloud, we're really focused on enabling customers to bring together the data assets, no matter where they live, what format they're in, on-premise cloud, multi-cloud bringing that all together. >>Well, we sold this massive scatter 22 months ago now, right? Of everyone just, and the edge exploded and data exploded and volumes and data sources exploded hard for organizations to get their head around that, to go or that the data is going to be living in all these different places. You talked about a lot of customers and every industry being hybrid multi-cloud because based on strategy, based on acquisition, but to get their arms around that data and to be able to actually extract value from it fast is going to be the difference between those businesses that succeed and those that don't >>Absolutely. And our partnership with AWS, that's a long standing partnership and we're very much focused on addressing the challenges you're talking about. Uh, and in fact, earlier this year we announced our cloud first, our cloud native, uh, data governance and data catalog on AWS, which is really focused on creating that central point of trusted data access and visibility for the organization. And just today, we had an announcement about how we're bringing data democratization and really accelerating data democratization for AWS lake formation. >>What is, when you, when you, we talk about data democratization often, what does that mean to you? What does that mean to Informatica? How do you deliver that to customers so that they can really be able to extract as much value as they can? >>Yeah. So a great question. And really that whole data management journey is a big piece of this. So it starts with data discovery. How do I even begin to find my data assets? How do I get them from where they are to where they need to go in the cloud? How do I make sure they're clean, they're ready to use. I trust them. I understand where they came from. And so the solution that we announced today is really focused on how do we provide a business users with a self-service way of getting access to data lake data, sitting in Amazon S3 with lake formation governance, but doing it in a way that doesn't create an undue burden on those business users, around data compliance and data policies. And so what we've done is we brought our business user-friendly self-service experience an axon data marketplace together with AWS lake formation. >>So Informatica has had a long history in the data world. Um, I think of terms like MDM and ETL. Yes. Where does, where does Informatica is history dovetail with the present day in terms of cloud the con does the concept of extract translate load? I think that's what ETL stood for too many TLAs running as far as trying to transform, uh, w where does that play in today's world? Are you focused separately on cloud from on-premise data center or do you, or do you link the two? Yeah, >>So we focus on, uh, addressing data management, uh, when, no matter where the data lives. So on-premise cloud multi-cloud, uh, on our intelligent data management cloud platform is a, is the industry's first end-to-end cloud native as a service data management platform that delivers all those capabilities. I mentioned before, uh, to customers. So we can manage all those workloads that are distributed from a single cloud-based as a service data management platform. So >>The platform is, is as a service in the cloud, but you could be managing data assets that are in traditional on premises, data centers, the like, absolutely. >>Okay. >>So congratulations on the IPO. Of course we can't, we can't not talk to Informatica without that. I imagined the momentum is probably pretty great right about now when we think of AWS, I, when I think of AWS, I always think of momentum. We, I mean the, the volume of announcements, but also when I think about AWS, I think about their absolute focus on the customer, that working backwards approach from a partnership perspective. Is there alignment there? I imagine, like I said, with the IPO, a lot of momentum right now, probably a lot of excitement are, is infant medical also was focused and customer obsessed as AWS's. >>Yeah. So, um, first of all, thank you so much. Congratulations. Uh, so we had a very successful IPO last month. And in fact, just yesterday, our CEO I'm at Wailea presented our Q3 results, uh, which showcase the continued growth of our subscription revenue or cloud revenue. And in fact, our cloud revenue grew 44% year over year, which is really reflective of our big shift to being a cloud first company and also the success of our intelligent data management cloud platform. Right. And, and that platform, again, as I mentioned, it's spanning all those aspects of data management and we're delivering that for more than 5,000 customers globally. Uh, and just from an adoption perspective, we processed about 23 trillion transactions a month for customers in our cloud platform. And that's doubling every six to 12 months. So it's incredible amount of adoption. Some of the biggest enterprises in the world like Unilever, Sanofi folks like that are using the cloud is their preferred data management platform of choice in the cloud. >>Well, you know, of course, congratulations is in order for the IPO, but also really on what you just mentioned, the trajectory of where Informatica is going, because Informatica wasn't born yesterday. Right. And, uh, we shouldn't overlook the fact that there are challenges associated with moving from the world as it exists on premises for still 80% of it spend at least navigating that transition, going from private to public, getting the right kind of investment where people realize that cloud is a significant barrier to entry, uh, for, for a lot of companies. I think it's, it's, you know, you have a lot of folks cheering for you as you navigate this transition. >>Well, one thing I do I say is, yes, we have it in the business of data for a long time, but we also then the business of cloud quite a long time. So this is true. This is the 10th reinvent. This is also the ten-year anniversary of the Informatica AWS partnership, right? So we've been working in the cloud with AWS for, for that long innovating all of these different, different core services. So, um, and from that perspective, you know, I think we're doing a tremendous amount of innovation together, you know, solutions like when we talked about for lake formation, but we also announced today a couple of key programs that we partnered with AWS around, around modernization and migration, right? So that's a big area of focus as well is how do we help customers modernize and take advantage of all the great services that AWS offers? So that's how we announced our membership and what's called the workload migration program and also the data lead migrations program, which is part of the public sector focus at AWS as well. >>The station perspective that was talked a lot about by Adam yesterday. And we've talked about it a lot today, every organization needs to monitorize, even some of those younger ones that you think, oh, aren't, they already, you know, fairly modern, but where, where are your customer conversations happening from a modernization perspective is that elevated up the, the C stat that we've got to modernize our or our organization get better handle of our data, be able to use it more protected, secure it so that we can be competitive and deliver outstanding customer experiences. >>What happens is the pain points that the legacy infrastructure has from the business perspective really do elevate the conversation to the C-suite. They're looking at saying, Hey, especially with the pandemic, right? We have to transform our business. We have to have data. We have to have trust in data. How do we do that? And we're not going to get there >>On rigid on-premise infrastructure. We need to be in a cloud native footprint. And so we've been focused on helping customers get to those cloud native end points, but also to a truly cloud native data management, we talked about earlier can manage all those different workloads, right? From a single that SAS serverless type experience. Right? What have been some of the interesting conversations that you've had here? Again, we are in person yep. Fresh off the IPO, lots of announcements coming out. You guys made announcements today. What's been the sentiment from the, those customers and partners that you've talked about. >>Well, I'll give you guys actually a little sneak preview of another announcement we have coming tomorrow, uh, with our friends at Databricks. So we, uh, we are announcing a data, data democratization solution with Databricks accelerating some of the same, you know, addressing some of the same challenges we were talking about here, but in the data breaks in the Lakehouse environment. Um, so, so, but around that, and I had a great conversation with some partners here, some of the global system integrators, and they're just so happy to see that, right, because a lot of the infrastructure that's around data lakes are lake formation. It's pretty technical it's for a technical audience. And, and Informatica has really been focused on how do we expand the base of users that are able to tap into data and that's through no code experiences, right? It's through visual experiences. And we bring that tightly coupled together with the performance and the power and scale of platforms like Databricks and the AWS Redshift and S3, it's really transformative for customers. >>What are some of the things that here we are wrapping up the 10th, re-invent almost as tomorrow, but also wrapping up the end of 2021. What are some of the things that th th that there's obviously a lot of momentum with Informatica right now that from a partnership perspective, anything that you, you just gave us some breaking news. Thank you. We always love that. What are some of the things that you're looking forward to in 2022 that you think are really going to help Informatica customers just be incredibly competitive and utilizing data in the cloud on prem to their maximum? >>Well, I think as we go into the next year data complexity data fragmentation, it's gonna continue to grow. It's, it's, it's exploding out there. Uh, and one of the key components of our platform or the IDMC platform is we call it Clare, which is the industry first kind of metadata driven AI engine. And what we've done is we've taken the intelligence of machine learning and AI, and brought that to the business of data management. And we truly believe that the way customers are going to tame that data, they're going to address those problems and continue to scale and keep up is leveraging the power of AI in a cloud native cloud, first data management platform. >>Excellent. Rick, thank you so much for joining us today. Again, congratulations on last month, Informatica IPO, great solid, strong, deep partnership with AWS. We thank you for your insights and best of luck next year. >>Awesome. Thank you so much. Pleasure being here. Our >>Pleasure to have you for my co-host David Nicholson, I'm Martin. You're watching the cube, the global leader in live tech coverage.

Published Date : Dec 2 2021

SUMMARY :

We started a couple of days ago and about really the innovation that's going to be It's a, it's a pleasure to be back. Isn't it nice to be back in person? that every company has to become a data company, public sector, private sector, But one of the biggest things that we're hearing at reinvent is that customers are really concerned with data, fast is going to be the difference between those businesses that succeed and those And just today, we had an announcement about how we're bringing data democratization And so the solution that we announced today So Informatica has had a long history in the data world. So we focus on, uh, addressing data management, uh, when, no matter where the data lives. The platform is, is as a service in the cloud, but you could be managing data assets that are So congratulations on the IPO. And that's doubling every six to 12 months. that cloud is a significant barrier to entry, uh, but we also announced today a couple of key programs that we partnered with AWS around, our organization get better handle of our data, be able to use it more protected, secure it so that we can really do elevate the conversation to the C-suite. What have been some of the interesting conversations that you've had here? some of the same, you know, addressing some of the same challenges we were talking about here, but in the data breaks in the Lakehouse environment. What are some of the things that here we are wrapping up the 10th, and brought that to the business of data management. We thank you for your insights and best of luck next year. Thank you so much. Pleasure to have you for my co-host David Nicholson, I'm Martin.

ENTITIES

Entity	Category	Confidence
David Nicholson	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Informatica	ORGANIZATION	0.99+
Dave Nicholson	PERSON	0.99+
Rick	PERSON	0.99+
Unilever	ORGANIZATION	0.99+
44%	QUANTITY	0.99+
Sanofi	ORGANIZATION	0.99+
Databricks	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
Lisa Martin	PERSON	0.99+
2022	DATE	0.99+
yesterday	DATE	0.99+
Las Vegas	LOCATION	0.99+
50%	QUANTITY	0.99+
Martin	PERSON	0.99+
next year	DATE	0.99+
tomorrow	DATE	0.99+
two	QUANTITY	0.99+
Adam	PERSON	0.99+
first	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
more than 1000 data sources	QUANTITY	0.99+
more than 100 data sources	QUANTITY	0.99+
today	DATE	0.99+
79%	QUANTITY	0.99+
last month	DATE	0.99+
last month	DATE	0.99+
more than 5,000 customers	QUANTITY	0.99+
Rick Tam Daniel	PERSON	0.99+
Rik Tamm-Daniels	PERSON	0.99+
Wailea	ORGANIZATION	0.99+
ten-year	QUANTITY	0.99+
22 months ago	DATE	0.98+
12 months	QUANTITY	0.98+
30%	QUANTITY	0.98+
first company	QUANTITY	0.98+
this year	DATE	0.97+
earlier this year	DATE	0.97+
2021	DATE	0.97+
one	QUANTITY	0.97+
Informatica AWS	ORGANIZATION	0.96+
next decade	DATE	0.96+
end of 2021	DATE	0.95+
up to 80	QUANTITY	0.95+
almost 80%	QUANTITY	0.94+
about 23 trillion transactions a month	QUANTITY	0.91+
next couple of days	DATE	0.88+
single	QUANTITY	0.88+

Ed Walsh and Thomas Hazel, ChaosSearch | JSON

>>Hi, Brian, this is Dave Volante. Welcome to this cube conversation with Thomas Hazel was the founder and CTO of chaos surgeon. I'm also joined by ed Walsh. Who's the CEO Thomas. Good to see you. >>Great to be here. >>Explain Jason. First of all, what >>Jason, Jason has a powerful data representation, a data source. Uh, but let's just say that we try to drive value out of it. It gets complicated. Uh, I can search. We activate customers, data lakes. So, you know, customers stream their Jason data to this, uh, cloud stores that we activate. Now, the trick is the complexity of a Jason data structure. You can do all these complexity of representation. Now here's the problem putting that representation into a elastic search database or relational databases, very problematic. So what people choose to do is they pick and choose what they want and or they just stored as a blob. And so I said, what if, what if we create a new index technology that could store it as a full representation, but dynamically in a, we call our data refinery published access to all the permutations that you may want, where if you do a full on flatten, your flattening of its Jason, one row theoretically could be put into a million rows and relational data sort of explode, >>But then it gets really expensive. But so, but everybody says they have Jason support, every database vendor that I talked to, it's a big announcement. We now support Jason. What's the deal. >>Exactly. So you take your relational database with all those relational constructs and you have a proprietary Jason API to pick and choose. So instead of picking, choosing upfront, now you're picking, choosing in the backend where you really want us the power of the relational analysis of that Jaison data. And that's where chaos comes in, where we expand those data streams we do in a relational way. So all that tooling you've been built to know and love. Now you can access to it. So if you're doing proprietary APIs or Jason data, you're not using Looker, you're not using Tableau. You're doing some type of proprietary, probably emailing now on the backend. >>Okay. So you say all the tools that you've trained, everybody on you can't really use them. You got to build some custom stuff and okay, so, so, so maybe bring that home then in terms of what what's the money, why do the suits care about this stuff? >>The reason this is so important is think about anything, cloud native Kubernetes, your different applications. What you're doing in Mongo is all Jason is it's very powerful but painful, but if you're not keeping the data, what people are doing a data scientist is, or they're just doing leveling, they're saying I'm going to only keep the first four things. So think about it's Kubernetes, it's your app logs. They're trying to figure out for black Friday, what happens? It's Lilly saying, Hey, every minute they'll cut a new log. You're able to say, listen, these are the users that were in that system for an hour. And here's a different things. They do. The fact of the matter is if you cut it off, you lose all that fidelity, all that data. So it's really important that to have. So if you're trying to figure out either what happened for security, what happened for on a performance, or if you're trying to figure out, Hey, I'm VP of product or growth, how do I cross sell things? >>You need to know what everyone's doing. If you're not handling Jason natively, like we're doing either your, it keeps on expanding on black Friday. All of a sudden the logs get huge. And the next day it's not, but it's really powerful data that you need to harness for business values. It's, what's going to drive growth. It's what's going to do the digital transformation. So without the technology, you're kind of blind. And to be honest, you don't know. Cause a data scientist is kind of deleted the data on you. So this is big for the business and digital transformation, but also it was such a pain. The data scientists in DBS were forced to just basically make it simple. So it didn't blow up their system. We allow them to keep it simple, but yes, >>Both power. It reminds me if you like, go on vacation, you got your video camera. Somebody breaks into your house. You go back to Lucas and see who and that the data's gone. The video's gone because it didn't, you didn't, you weren't able to save it cause it's too >>Expensive. Well, it's funny. This is the first day source. That's driving the design of the database because of all the value we should be designed the database around the information. It stores not the structure and how it's been organized. And so our viewpoint is you get to choose your structure yet contain all that content. So if a vendor >>It says to kind of, I'm a customer then says, Hey, we got Jason support. What questions should I ask to really peel the onion? >>Well, particularly relational. Is it a relational access to that data? Now you could say, oh, I've ETL does Jason into it. But chances are the explosion of Jason permutations of one row to a million. They're probably not doing the full representation. So from our viewpoint is either you're doing a blob type access to proprietary Jason APIs or you're picking and choosing those, the choices say that is the market thought. However, what if you could take all the vegetation and design your schema based on how you want to consume it versus how you could store it. And that's a big difference with, >>So I should be asking how, how do I consume this data? Are you ETL? Bring it in how much data explosion is going to occur. Once I do this, and you're saying for chaos, search the answer to those questions. >>The answer is, again, our philosophy simply stream your data into your cloud object, storage, your data lake and with our index technology and our data refinery. You get to create views, dynamic the incident, whether it's a terabyte or petabyte, and describe how you want your data because consumed in a relational way or an elastic search way, both are consumable through our data refinery, which is >>For us. The refinery gives you the view. So what happens if someone wants a different view, I want to actually unpack different columns or different matrices. You able to do that in a virtual view, it's available immediately over petabytes of data. You don't have that episode where you come back, look at the video camera. There's no data there left. So that's, >>We do appreciate the time and the explanation on really understanding Jason. Thank you. All right. And thank you for watching this cube conversation. This is Dave Volante. We'll see you next time.

Published Date : Nov 2 2021

SUMMARY :

Good to see you. First of all, what where if you do a full on flatten, your flattening of its Jason, one row theoretically What's the deal. So you take your relational database with all those relational constructs and you have a proprietary You got to build some custom The fact of the matter is if you cut it off, you lose all that And to be honest, you don't know. It reminds me if you like, go on vacation, you got your video camera. And so our viewpoint is you It says to kind of, I'm a customer then says, Hey, we got Jason support. However, what if you could take all the vegetation and design your schema based on how you want to Bring it in how much data explosion is going to occur. whether it's a terabyte or petabyte, and describe how you want your data because consumed in a relational way You don't have that episode where you come back, look at the video camera. And thank you for watching this cube conversation.

ENTITIES

Entity	Category	Confidence
Dave Volante	PERSON	0.99+
Brian	PERSON	0.99+
Jason	PERSON	0.99+
Thomas Hazel	PERSON	0.99+
Lilly	PERSON	0.99+
Ed Walsh	PERSON	0.99+
JSON	PERSON	0.99+
Thomas	PERSON	0.99+
first day	QUANTITY	0.99+
black Friday	EVENT	0.99+
an hour	QUANTITY	0.98+
both	QUANTITY	0.97+
Both	QUANTITY	0.97+
ed Walsh	PERSON	0.97+
Tableau	TITLE	0.95+
first four things	QUANTITY	0.94+
Kubernetes	TITLE	0.93+
one row	QUANTITY	0.92+
Mongo	ORGANIZATION	0.9+
Jason	ORGANIZATION	0.89+
ChaosSearch	ORGANIZATION	0.89+
a million	QUANTITY	0.88+
next day	DATE	0.86+
Jason	TITLE	0.81+
First	QUANTITY	0.74+
million rows	QUANTITY	0.73+
ETL	ORGANIZATION	0.7+
petabytes	QUANTITY	0.69+
Looker	ORGANIZATION	0.66+
DBS	ORGANIZATION	0.58+
Jaison	PERSON	0.52+
Lucas	PERSON	0.49+

UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases

>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.

Published Date : Mar 30 2020

SUMMARY :

Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Sumeet	PERSON	0.99+
Sumeet Keswani	PERSON	0.99+
Shirang Kamat	PERSON	0.99+
Jeff Healey	PERSON	0.99+
6 nodes	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
six years	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
12 nodes	QUANTITY	0.99+
Shirang	PERSON	0.99+
1,000 files	QUANTITY	0.99+
one	QUANTITY	0.99+
12 shards	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
99.9%	QUANTITY	0.99+
two modes	QUANTITY	0.99+
S3	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
first subcluster	QUANTITY	0.99+
first time	QUANTITY	0.99+
two options	QUANTITY	0.99+
first	QUANTITY	0.99+
first option	QUANTITY	0.99+
each	QUANTITY	0.99+
two subclusters	QUANTITY	0.99+
Each node	QUANTITY	0.99+
hundreds of nodes	QUANTITY	0.99+
Today	DATE	0.99+
each app	QUANTITY	0.99+
today	DATE	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
One	QUANTITY	0.98+
three nodes	QUANTITY	0.98+
SQL	TITLE	0.98+
Eon Mode	TITLE	0.98+
single container	QUANTITY	0.97+
this week	DATE	0.97+
16 secondary shard subscription	QUANTITY	0.97+
two types	QUANTITY	0.97+
Sizing and Configuring Vertica in Eon Mode for Different Use Cases	TITLE	0.97+
Vertica	TITLE	0.97+
one limitation	QUANTITY	0.97+

Christian Romming, Etleap | AWS re:Invent 2019

>>LA from Las Vegas. It's the cube covering AWS reinvent 2019, brought to you by Amazon web services and along with its ecosystem partners. >>Oh, welcome back. Inside the sands, we continue our coverage here. Live coverage on the cube of AWS. Reinvent 2019. We're in day three at has been wall to wall, a lot of fun here. Tuesday, Wednesday now Thursday. Dave Volante. I'm John Walls and we're joined by Christian Rahman who was the founder and CEO of for Christian. Good morning to you. Good morning. Thanks for having afternoon. If you're watching on the, uh, on the East coast right now. Um, let's talk about sleep a little bit. I know you're all about data, um, but let's go ahead and introduce the company to those at home who might not be familiar with what your, your poor focus was. The primary focus. Absolutely. So athlete is a managed ETL as a service company. ETL is extract, transform, and load basically about getting data from different data sources, like different applications and databases into a place where it can be analyzed. >>Typically a data warehouse or a data Lake. So let's talk about the big picture then. I mean, because this has been all about data, right? I mean, accessing data, coming from the edge, coming from multiple sources, IOT, all of this, right? You had this proliferation of data and applications that come with that. Um, what are you seeing that big picture wise in terms of what people are doing with their data, how they're trying to access their data, how to turn to drive more value from it and how you serve all those masters, if you will. So there are a few trends that we see these days. One is a, you know, an obvious one that data warehouses are moving to the cloud, right? So, you know, uh, companies used to have, uh, data warehouses on premises and now they're in the cloud. They're, uh, cheaper and um, um, and more scalable, right? With services like a Redshift and snowflake in particular on AWS. Um, and then, uh, another trend is that companies have a lot more applications than they used to. You know, in the, um, in the old days you would have maybe a few data ware, sorry, databases, uh, on premises that you would integrate into your data warehouses. Nowadays you have companies have hundreds or even thousands of applications, um, that effectively become data silos, right? Where, um, uh, analysts are seeing value in that data and they want to want to have access to it. >>So, I mean, ETL is obviously not going away. I mean, it's been here forever and it'll, it'll be here forever. The challenge with ETL has always been it's cumbersome and it's expensive. It's, and now we have this new cloud era. Um, how are you guys changing ETL? >>Yeah. ETL is something that everybody would like to see go away. Everybody would just like, not to do it, but I just want to get access to their data and it should be very unfortunate for you. Right. Well, so we started, uh, we started athlete because we saw that ETL is not going away. In fact, with all the, uh, all these applications and all these needs that analysts have, it's actually becoming a bigger problem than it used to be. Um, and so, uh, what we wanted to do is basically take, take some of that pain out, right? So that companies can get to analyzing their data faster and with less engineering effort. >>Yeah. I mean, you hear this, you know, the typical story is that data scientists spend 80% of their time wrangling data and it's, and it's true in any situation. So, um, are you trying to simplify, uh, or Cloudify ETL? And if so, how are you doing that? >>So with, uh, with the growth in the number of data analysts and the number of data analytics projects that companies wants to take on the, the traditional model of having a few engineers that know how to basically make the data available for analysts, that that model is essentially now broken. And so, uh, just like you want to democratize, uh, BI and democratize analytics, you essentially have to democratize ETL as well, right? Basically that process of making the data ready for analysis. And, uh, and that is really what we're doing at athlete. We're, we're opening up ETL to a much broader audience. >>So I'm interested in how I, so I'm in pain. It's expensive. It's time consuming. Help me Christian, how, how can you help me, sir? >>So, so first of all, we're, we're, um, uh, at least specifically we're a hundred percent AWS, so we're deeply focused on, uh, Redshift data warehouses and S3 and good data lakes. Uh, and you know, there's tremendous amount of innovation. Um, those two sort of sets of technologies now, um, Redshift made a bunch of very cool announcements era at AWS reinvent this year. Um, and so what we do is we take the, uh, the infrastructure piece out, you know, so you can deploy athlete as a hosted service, uh, where we manage all the infrastructure for you or you can deploy it within your VPC. Um, again, you know, in a much, much simplified way, uh, compared to a traditional ETL technologies. Um, and then, you know, beyond that taking, uh, building pipelines, you know, building data pipelines used to be something that would take engineers six months to 18 months, something like that. But, um, but now what we, what we see is companies using athlete, they're able to do it much faster often, um, often an hours or days. >>A couple of questions there. So it's exclusively red shift, is that right? Or other analytic databases and make is >>a hundred percent AWS we're deeply focused on, on integrating well with, with AWS technologies and services. So, um, so on the data warehousing side, we support Redshift and snowflake. >>Okay, great. So I was going to ask you if snowflake was part of that. So, well you saw red shift kind of, I sort of tongue in cheek joke. They took a page out of snowflake separating compute and storage that's going to make customers very happen so they get happy. So they can scale that independently. But there's a big trend going on. I wonder if you can address it in your, you were pointing out before that there's more data sources now because of the cloud. We were just having that conversation and you're seeing the data exchange, more data sources, things like Redshift and snowflake, uh, machine intelligence, other tools like Databricks coming in at the Sage maker, a Sage maker studios, making it simpler. So it's just going to keep going faster and faster and faster, which creates opportunities for you guys. So are you seeing that trend? It's almost like a new wave of compute and workload coming into the cloud? >>Yeah, it's, it's super interesting. Companies can now access, um, a lot more data, more varied data, bigger volumes of data that they could before and um, and they want faster access to it, both in terms of the time that it takes to, you know, to, to bite zero, right? Like the time, the time that it takes to get to the first, uh, first analysis. Um, and also, um, and also in terms of the, the, the data flow itself, right? They, they not want, um, up to the second or up to the millisecond, um, uh, essentially fresh data, uh, in their dashboards and for interactive analysis. And what about the analytics side of this then when we were talking about, you know, warehousing but, but also having access to it and doing something with it. Um, what's that evolution looking like now in this new world? So lots of, um, lots of new interesting technologies there to, um, um, you know, on the, on the BI side and, um, and our focus is on, on integrating really well with the warehouses and lakes so that those, those BI tools can plug in and, and, um, um, and, and, you know, um, get access to the data straight away. Okay. >>So architecturally, why are you, uh, how are you solving the problem? Why are you able to simplify? I'm presuming it's all built in the cloud. That's been, that's kind of an obvious one. Uh, but I wonder if you could talk about that a little bit because oftentimes when we talk to companies that have started born in the cloud, John furrier has been using this notion of, you know, cloud native. Well, the meme that we've started is you take out the T it cloud native and it's cloud naive. So you're cloud native. Now what happens oftentimes with cloud native guys is much simpler, faster, lower cost, agile, you know, cloud mentality. But maybe some, sometimes it's not as functional as a company that's been around for 40 years. So you have to build that up. What's the state of ETL, you know, in your situation. Can you maybe describe that a little bit? How is it that the architecture is different and how address functionality? >>Yeah, I mean, um, so a couple of things there. Uh, um, you, you mentioned Redshift earlier and how they now announce the separation of storage and compute. I think the same is true for e-tail, right? We can, we can build on, um, on these great services that AWS develops like S three and, and, uh, a database migration service and easy to, um, elastic MapReduce, right? We can, we can take advantage of all these, all these cloud primitives and um, um, and, and so the, the infrastructure becomes operationally, uh, easier that way. Um, and, and less expensive and all, all those good things. >>You know, I wonder, Christian, if I can ask you something, given you where you live in a complicated world, I mean, data's complicated and it's getting more complicated. We heard Andy Jassy on Tuesday really give a message to the, to the enterprise. It wasn't really so much about the startups as it previously been at, at AWS reinvent. I mean, certainly talking to developers, but he, he was messaging CEOs. He had two or three CEOs on stage. But what we're describing here with, with red shift, and I threw in Databricks age maker, uh, elastic MapReduce, uh, your tooling. Uh, we just had a company on that. Does governance and, and builders have to kind of cobble these things together? Do you see an opportunity to actually create solutions for the enterprise or is that antithetical to the AWS cloud model? What, what are your thoughts? >>Oh, absolutely know them. Um, uh, these cloud services are, are fantastic primitives, but um, but enterprises clearly have a lot of, and we, we're seeing a lot of that, right? We started out in venture Bactec and, and, and got, um, a lot of, a lot of venture backed tech companies up and running quickly. But now that we're sort of moving up market and, and uh, and into the enterprise, we're seeing that they have a requirements that go way beyond, uh, beyond what, what venture tech, uh, needs. Right. And in terms of security, governance, you know, in, in ETL specifically, right? That that manifests itself in terms of, uh, not allowing data to flow out of, of the, the company's virtual private cloud for example. That's something that's very important in enterprise, a much less important than in, uh, in, in venture-backed tech. Um, data lineage. Right? That's another one. Understanding how data, uh, makes it from, you know, all those sources into the warehouse. What happens along the way. Right. And, and regulated industries in particular, that's very important. >>Yeah. I mean, I, you know, AWS is mindset is we got engineers, we're going to throw engineers at the problem and solve it. Many enterprises look at it differently. We'll pay money to save time, you know, cause we don't have the time. We don't have the resource, I feel like I, I'd like to see sort of a increasing solutions focus. Maybe it's the big SIS that provide that. Now are you guys in the marketplace today? We are. Yup. That's awesome. So how's that? How's that going? >>Yeah. Um, you mean AWS market? Yes. Yes. Uh, yeah, it's, it's um, um, that's definitely one, one channel that, uh, where there's a lot of, a lot of promise I think both. Um, for, for for enterprise companies. Yeah. >>Cause I mean, you've got to work it obviously it doesn't, just the money just doesn't start rolling in you gotta you gotta market yourselves. >>But that's definitely simplifies that, um, that model. Right? So delivering, delivering solutions to the enterprise for sure. So what's down the road for you then, uh, from, from ETL leaps perspectives here or at leaps perspectives. Um, you've talked about the complexities and what's occurred and you're not going away. ETL is here to say problems are getting bigger. What do you see the next year, 12, 18, 24 months as far as where you want to focus on? What do you think your customers are going to need you to focus on? So the big challenge, right is that, um, um, bigger and bigger companies now are realizing that there is a ton of value in their data, in all these applications, right? But in order to, in order to get value out of it, um, you have to put, uh, engineering effort today into building and maintaining these data pipelines. >>And so, uh, so yeah, so our focus is on reducing that, reducing those engineering requirements. Um, right. So that both in terms of infrastructure, pipeline, operation, pipeline setup, uh, and, and those kinds of things. So where, uh, we believe that a lot of that that's traditionally been done with specialized engineering can be done with great software. So that's, that's what we're focused on building. I love the, you know, the company tagged the perfect data pipeline. I think of like the perfect summer, the guy catching a big wave out in Maui or someplace. Good luck on catching that perfect data pipeline you guys are doing. You're solving a real problem regulations. Yeah. Good to meet you. That cause more. We are alive at AWS reinvent 2019 and you are watching the cube.

Published Date : Dec 5 2019

SUMMARY :

AWS reinvent 2019, brought to you by Amazon web services Inside the sands, we continue our coverage here. Um, what are you seeing that big picture wise in terms of what people are doing how are you guys changing ETL? So that companies can get to analyzing their data faster and with less engineering effort. So, um, are you trying to simplify, And so, uh, just like you want to democratize, uh, Help me Christian, how, how can you help me, sir? Um, and then, you know, beyond that taking, So it's exclusively red shift, is that right? So, um, so on the data warehousing side, we support Redshift and snowflake. So are you seeing that trend? both in terms of the time that it takes to, you know, to, to bite zero, right? born in the cloud, John furrier has been using this notion of, you know, you mentioned Redshift earlier and how they now announce the separation of storage and compute. Do you see an opportunity to actually create Understanding how data, uh, makes it from, you know, all those sources into the warehouse. time, you know, cause we don't have the time. it's um, um, that's definitely one, one channel that, uh, where there's a lot of, So what's down the road for you then, uh, from, from ETL leaps perspectives I love the, you know, the company tagged the perfect data pipeline.

ENTITIES

Entity	Category	Confidence
Dave Volante	PERSON	0.99+
two	QUANTITY	0.99+
Christian Rahman	PERSON	0.99+
John Walls	PERSON	0.99+
Christian Romming	PERSON	0.99+
80%	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Andy Jassy	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
Tuesday	DATE	0.99+
six months	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
Maui	LOCATION	0.99+
Sage	ORGANIZATION	0.99+
first	QUANTITY	0.99+
LA	LOCATION	0.99+
18 months	QUANTITY	0.99+
both	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
next year	DATE	0.98+
first analysis	QUANTITY	0.98+
18	QUANTITY	0.98+
one	QUANTITY	0.98+
today	DATE	0.98+
Redshift	TITLE	0.98+
S three	TITLE	0.97+
Bactec	ORGANIZATION	0.97+
John furrier	PERSON	0.97+
zero	QUANTITY	0.96+
Thursday	DATE	0.95+
12	QUANTITY	0.95+
hundred percent	QUANTITY	0.95+
this year	DATE	0.95+
Wednesday	DATE	0.94+
applications	QUANTITY	0.94+
second	QUANTITY	0.94+
one channel	QUANTITY	0.93+
S3	TITLE	0.91+
ETL	TITLE	0.9+
Sage maker	ORGANIZATION	0.9+
24 months	QUANTITY	0.9+
day three	QUANTITY	0.9+
red shift	TITLE	0.89+
two sort	QUANTITY	0.88+
three CEOs	QUANTITY	0.87+
an	QUANTITY	0.86+
Etleap	PERSON	0.83+
venture	ORGANIZATION	0.83+
ETL	ORGANIZATION	0.81+
40 years	QUANTITY	0.81+
MapReduce	ORGANIZATION	0.79+
2019	DATE	0.78+
couple	QUANTITY	0.78+
Redshift	ORGANIZATION	0.77+
snowflake	TITLE	0.77+
One	QUANTITY	0.73+
Cloudify	TITLE	0.67+
Christian	ORGANIZATION	0.67+
2019	TITLE	0.54+
Invent	EVENT	0.47+

Tracey Newell, Informatica | Informatica World 2019

>> Live from Las Vegas, it's theCUBE. Covering Informatica World 2019. Brought to you by Informatica. >> Welcome back, everyone, to theCUBE's live coverage of Informatica World 2019. I'm your host Rebecca Knight, along with my co-host John Furrier. We are joined by Tracey Newell, she is the President Global Field Operations at Informatica. Thank you so much for coming on theCUBE, for coming back on theCUBE. >> Coming back on theCUBE, it's great to be here. >> So the last time you were on, you had just taken over as the president of Global Field Operations. Give our viewers a catch up on exactly what you've been doing over these past two years, and what the journey's been like. >> Yeah, no that's great, thanks so much. As a reminder the last time we were together, I had just joined the company. I was literally two weeks in, and yet I actually did join Informatica three years ago. So I joined on the board of directors, and I was on the board for two years, and the company was doing so extremely well that after a couple of years we all agreed that I would step off the board and join the management team. >> I got to get in on this! >> I know, exactly. I've got to get off the sidelines and get into the game. >> Both sides of the table, literally. >> Exactly. >> So that's really interesting that you were on the board watching this growth and seeing, obviously participating in it, too, as a board member, but then you said, "I want to be here, I want to be doing this." What was it about the opportunity that so excited you that you felt that way? >> Well, it's funny, because when I did join the management team I spent two months on a listening tour, and the first question from all the employees and our partners was, "Why'd you do that?" Usually it goes the other way around, you go from the management team to the board. And the answer was really simple in that my hypothesis in joining the board was that digital transformation is an enterprise board of director's decision, that governments and large organizations are trying to figure this out with the CEO, the board, the management team, because it's critical, and yet it's also really hard. It's complicated, the data is everywhere. And so when you have something that's important and really complicated, you need a thought leader. And so my belief was that Informatica should be that thought leader. And two years in we were doing so phenomenally well with the platform play that we had been driving from an R&D standpoint, it just seemed like such an amazing opportunity to literally get off the sidelines and get into the game. And it's just been fabulous. >> And you have experience, obviously, doing field organizations so you've been there, done that. Also you have some public sector experience, so also being on the board was a time when Informatica went private. And that was a good call because they don't have to deal with the shot clock of the public markets and doing all those mandatory filings, and a lot of energy, management energy goes into being public company. >> That's right. >> At the time where they could get the product development and reposition some of the assets, and the thing that was interesting with you guys, they had customers already. So they didn't have to go out and get new customers to test new theses. >> That's right. >> They had existing customers. >> Oh no, we serve the biggest companies and governments on the planet. Globally, a very large percentage of the global 2000, is kind of our sweet spot. And yet thousands and thousands of customers in the mid market. And so to your point, John, exactly we had built out this platform that included all things on-premise, we're almost synonymous, PowerCenter and ETL, that's kind of been our sweet spot. And MDM data quality, but adding in all of the focus on big data, all the area of IPAAS, all the work that everybody's doing with AWS, with Azure, with Salesforce.com, with Google Cloud, and suddenly we've got this platform play, backed by AI and machine learning, and it's a huge differentiator. >> So you've seen a lot of experience, again you worked in the industry for a long time, you know what the field playbook is, VCs say the enterprise playbook. It's changing, though, you're seeing some shifts and Bruce Chizen was talking to me yesterday about this, there's a shift back to technology advantage and openness. It used to be technology advantage, protect it, that's your competitive advantage, hold it, lock in, but it's changing from that to technology, but open. This is the new equation, what's your take on that? >> Our strategy's been really simple, that we want to be best of breed in everything that we do. And Gartner seems to agree with us. In all five categories we play in we are up and to the right. And yet we want you to get a benefit that if you do decide to buy one product, and then add a second, or a third, or a fourth family, you're going to get the benefit of all that being backed by a platform play, and by AI and machine learning. And so this concept of we'll work with everybody, a customer called us Switzerland of Data, and that's certainly true, we partner with everybody. Where you do see synergies to leverage your entire data platform, you're going to get a real advantage that no one else will have. >> You've got a lot of customers, this is a very intimate conference here at Informatica, this is our fourth year covering it, it's been great to watch the journey, but also the evolution and the tailwinds you guys have. What are some of the customer conversations you're having? You're in all the top meetings here, I know you guys are busy running around, I see you doing meetings and the whole team's here. What are some of the top-level priorities and challenges and opportunities that your customers have? >> We literally have thousands of people at the conference here as you know, and it's just been phenomenal. So I've been in back-to-back meetings, meeting with some of the largest companies in retail that are trying to figure out, "How do I serve my customer base online?" "And yet when they walk into one of my stores, "I want to know that. "My salesperson needs to know exactly what that person's "been shopping for, and looking on the Internet for, "if they're on my site, "or perhaps what they've been tweeting about." So they want to know everything about their customer that there is to know. The banks want to know who their high wealth clients are. And hey want to make sure that if they call in on a checking account and have a bad customer service experience, they want to know that. If it's a hospitality company, they want to understand what's going on every time you check into a hotel. If you looked for a quote and you don't actually follow through, they want to understand that. And so there's this theme of understanding everything that there is to know about a customer. And yet at the same time, a huge requirement for governance, in the California Privacy Act, the CCPA and GDPR are changing everything. I had a large bank once say, and this was years ago, "How can I forget you?" Which is what GDPR says I have the right, you have the right to be forgotten in Europe. How can I forget you if I don't know who you are? Again that's because data's everywhere, and again we're enabling that, so it's a pretty exciting time. It literally is about companies transforming themselves. >> I remember the industry when search engines came out, when the web came out, you had Google and those greenfield opportunities, they were excellent, you type in a keyword and you get results. When people tried to do enterprise search, it was like all these different databases, so you had constraints and you had legacy. Similar today, right? So how has that changed? What's different about it now? And again you had compliance and regulation coming over the top. How does an enterprise unlock those constraints? >> It's funny, you say unlock the power of data is one of our catchphrases. I'm meeting with CIOs around the planet who sound like they're CMOs, because they're using these phrases. They're saying things like, "I need to disrupt myself before someone disrupts me." Or there was one, it was a large oil and energy, it was a CIO at this massive company said, "Data's the new goldmine, and I need a shovel." So they're using these phrases, and to your point, how do you do that? Again, we do think it is about getting the right platform that plays both on-premise and ties in everything the customers are doing in cloud. So we see partnerships as being critical here. But at the same time, one of our fastest growing solutions has been our enterprise data catalog, which is operating at the metadata level. My peer in products Amit Walia likes to say, "How come you can ask the Internet anything at all?" You're so used to it, when your kids ask you a question, you just get online, I don't know, and get the answer. But you can't do that in your own enterprise. And suddenly, because of what we're doing at the metadata level working with all of the different companies around the globe through open APIs, you can now do that inside your enterprise, and that is really unlocking the capabilities for companies to run their businesses. >> You're giving us so much great insight into the kinds of conversations you're having about this deep desire to know the customer and understand his wants and needs at every moment. And yet the technology is so often the easy part, and the hard part of the implementation are the people and the processes. Can you talk a little bit about the stumbling blocks and the challenges that you're seeing with customers as they are embarking on their digital transformations? >> That's a great question. Because one of the things that I caution our clients about is companies get so focused on, I've got to pick the right technology. And we agree with that, again, that's why we focus so much, we've got to be best in breed in every decision. We're not going to lock you into something that doesn't make sense. And yet half of the battle, if you would, in these projects, it's not about the technology, it's a people/process issue. So think about to have a comprehensive view of your data, if you're a large CPG company or a large bank, you might have 10 CIOs, 50 CIOs. We have customers that have 10 ERP systems, we have folks that talk about 50 ERP systems. These are very cross functional, complex projects, and so our focus is on customer success and customer for life. I have more people in customer success than I do in sales by design. Literally thousands of people around the world, this is all that we do, that are focused on business outcomes. And so we really give an extra guarantee, if you would, to our customers to make sure they know that we're in this to make sure that they're successful, and when we start running into challenges, we're going to raise those high so that both organizations can make sure that we get to that promise that everybody is committed to. >> Talk about the ecosystem, because you continue to get success with the catalog, which is looking good. Great that, by the way, we covered that on theCUBE, I remember those conversations like it was yesterday. That really enables a lot, so you're seeing some buzz here around obviously the big clouds, the Google announcement, Amazon, and Microsoft are all here, on-premise, you've got that covered. But the ecosystem partners have a huge economic opportunity, because with the value proposition that you guys are putting forth that's rolling out with a huge customer base, the value-to-economic shift has changed, so that the economics are changing for the better for the customer and the value's increasing. That's kind of an Amazon-like effect if you think about that flywheel. That's attracting a lot of people in to your ecosystem because there's a money making opportunity. >> That's right. >> Talk about that dynamic. >> It's been humbling. I'm really pleased with Informatica World and how things are shaping up because we've had some amazing speakers here as you mentioned, from Amazon, Thomas Crane here from Google Cloud, AWS sending their CMO. It's just been a phenomenal event, yet if you go to the show for literally dozens and dozens and dozens of other providers that are critical to our customers that we want to partner with. When we say partner, we actually do deep R&D together so that there's a true value proposition where the customer gets more and a better-together solution when they choose Informatica and their critical partners. There's another category of partners that I think you're hinting at which is the large GSIs. >> The global system integrators, yeah. >> The global systems integrators. >> Accenture, Deloitte. >> Accenture, Deloitte, Cognizant have been phenomenal partners to us. And so again, when you talk about this being a board level discussion, which literally I've met with so many CIOs who say, "I just presented to my board last week, "let me tell you about this journey that we're on." Of course the large global system integrators are in the middle of that and we are very clear, we don't want to compete with those folks that are so good at both the vision and also really good in arms and legs and execution to help drive massive workflow change for our clients. So we work together brilliantly with those folks. >> And these are meaty projects, too, so it's not like they're used to, back in the old days when these projects were massive, rolling out these big ERP systems, the CRMs, back when people were instrumenting their operation of businesses. Similar now with data, these are massive, lucrative, profitable opportunities. >> These are really strategic for the client, the global system integrator, and for us for all of the same reasons. This drives massive change in a good way for our clients to keep ahead of whoever's nipping at their heels, but certainly it's a tremendous services opportunity for the large integrators, there's no question. >> Being humble. >> One of the things that's really coming through here is Informatica's commitment to solving the skills gap, especially with the Next 25 program, and this is something your company's being really thoughtful about. I'm interested from your perspective, particularly as somebody who's been in the technology industry and was on the board for a while, how do you see the skills gap and what the technology industry is doing as a whole to combat it? And then your advice from your vantage point in terms of what you think are the next things that kids should be studying in schools? >> This reminds me, and Furrier, you're talking about the old days, so I'm going to date myself, it reminds me a lot of when the Internet first started to occur. This is a very similar type change. People have been, companies have been trying to make these changes and they're starting to realize that it does start, they've got to have a good grasp of the data in order to run all of these strategic initiatives that they've got. And so it's tremendous opportunity, to your point, for young people. So how do we think about that? Certainly we do our fair share of hiring interns trying to get them early in life, when they're sophomores, juniors coming into senior year and then hiring those folks. So we see an opportunity for our own company to bring in those young people, if you would. And then the GSIs, the global systems integrators, we partner quite a bit with them, because we see them as massive scalers, they have-- >> How about people specialize in majors, any areas of interest that someone might want to specialize in to be a great contributor in the data world? Obviously stats and math are clear on machine learning and that side. But there's affects, there's societal, business outcome challenges that have not yet been figured out. What areas do you see that someone can go after, have a career around? >> So it literally is a business and a technical problem that we're solving, and so there's going to be career opportunities for everyone that's in school. Whether it be on the business side, whether it's business management, marketing, sales, because again think about when you talk about change of management, it is a CMO trying to rethink how do they reach their clients. It is a sales leader thinking, "How do I get better analytics as to what's working "and what's not working?" And then of course it crosses over into computer science and engineering, as well, where you're actually developing these products, and developing these AI applications that are just beginning to take off. But it's in the early days, so for young folks coming out of schools this is a tremendous opportunity. >> Well, next you'll have to find what's up with the field, and your customers, and then next year, next event. >> Yeah, I can't wait, it's great. I've really enjoyed spending time with you all, and we look forward to seeing you soon. >> Indeed, well thank you so much for coming on theCUBE, Tracey. >> Okay, thank you. >> Thank you. I'm Rebecca Knight, for John Furrier, you've been watching theCUBE's live coverage of Informatica World, stay tuned. (upbeat music)

Published Date : May 22 2019

SUMMARY :

Brought to you by Informatica. We are joined by Tracey Newell, she is the President So the last time you were on, you had just taken over and the company was doing so extremely well I've got to get off the sidelines and get into the game. that you felt that way? And so when you have something that's important so also being on the board was a time and the thing that was interesting with you guys, and governments on the planet. This is the new equation, what's your take on that? And yet we want you to get a benefit but also the evolution and the tailwinds you guys have. and you don't actually follow through, and you get results. the capabilities for companies to run their businesses. and the challenges that you're seeing with customers And so we really give an extra guarantee, if you would, so that the economics are changing for the better and dozens of other providers that are critical And so again, when you talk about this being back in the old days when these projects were massive, These are really strategic for the client, in the technology industry and was on the board for a while, of the data in order to run What areas do you see that someone can go after, and so there's going to be career opportunities and your customers, and then next year, next event. and we look forward to seeing you soon. Indeed, well thank you so much of Informatica World, stay tuned.

ENTITIES

Entity	Category	Confidence
Rebecca Knight	PERSON	0.99+
Tracey Newell	PERSON	0.99+
John	PERSON	0.99+
AWS	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Informatica	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Deloitte	ORGANIZATION	0.99+
Bruce Chizen	PERSON	0.99+
California Privacy Act	TITLE	0.99+
thousands	QUANTITY	0.99+
two months	QUANTITY	0.99+
Accenture	ORGANIZATION	0.99+
Tracey	PERSON	0.99+
two years	QUANTITY	0.99+
dozens	QUANTITY	0.99+
Gartner	ORGANIZATION	0.99+
yesterday	DATE	0.99+
two weeks	QUANTITY	0.99+
last week	DATE	0.99+
first question	QUANTITY	0.99+
Thomas Crane	PERSON	0.99+
three years ago	DATE	0.99+
Google	ORGANIZATION	0.99+
next year	DATE	0.99+
Las Vegas	LOCATION	0.99+
50 CIOs	QUANTITY	0.99+
IPAAS	TITLE	0.99+
one product	QUANTITY	0.99+
one	QUANTITY	0.99+
10 CIOs	QUANTITY	0.99+
fourth year	QUANTITY	0.99+
10 ERP systems	QUANTITY	0.99+
five categories	QUANTITY	0.98+
Both sides	QUANTITY	0.98+
GDPR	TITLE	0.97+
One	QUANTITY	0.97+
today	DATE	0.97+
Global Field Operations	ORGANIZATION	0.97+
Amit Walia	PERSON	0.97+
both	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
both organizations	QUANTITY	0.96+
Cognizant	ORGANIZATION	0.96+
fourth family	QUANTITY	0.95+
first	QUANTITY	0.94+
thousands of people	QUANTITY	0.94+
CCPA	TITLE	0.94+
Azure	TITLE	0.93+
2019	DATE	0.91+
thousands of customers	QUANTITY	0.89+
Informatica World	ORGANIZATION	0.89+
World	TITLE	0.89+
Switzerland	ORGANIZATION	0.86+
about 50 ERP systems	QUANTITY	0.84+
Informatica World 2019	EVENT	0.84+
years	DATE	0.8+
playbook	TITLE	0.8+
President	PERSON	0.76+
president	PERSON	0.75+
Furrier	ORGANIZATION	0.75+
a second	QUANTITY	0.71+
couple of years	QUANTITY	0.7+

Randy Mickey, Informatica & Charles Emer, Honeywell | Informatica World 2019

>> Live from Las Vegas, it's theCUBE, covering Informatica World 2019. Brought to you by Informatica. >> Welcome back, everyone, to theCUBE's live coverage of Informatica World 2019. I'm your host, Rebecca Knight, along with my cohost, John Furrier. We have two guests for this segment. We have Charlie Emer. He is the senior director data management and governance strategy at Honeywell. Thanks for joining us. >> Thank you. >> And Randy Mickey, senior vice president global professional services at Informatica. Thanks for coming on theCUBE. >> Thank you. >> Charlie, I want to start with you. Honeywell is a household name, but tell us a little bit about the business now and about your role at Honeywell. >> Think about it this way. When I joined Honeywell, even before I knew Honeywell, all I thought was thermostats. That's what people would think about Honeywell. >> That's what I thought. >> But Honeywell's much bigger than that. Look, if you go back to the Industrial Revolution, back in, I think, '20s, we talked about new things. Honeywell was involved from the beginning making things. But we think this year and moving forward in this age, Honeywell is looking at it as the new Industrial Revolution. What is that? Because Honeywell makes things. We make aircraft engines, we make aircraft parts. We make everything, household goods, sensors, all types of sensors. We make things. So when we say the new Industrial Revolution is about the Internet of Things, who best to participate because we make those things. So what we are doing now is what we call IIOT, Industrial Internet of Things. Now, that is what Honeywell is about, and that's the direction we are heading, connecting those things that we make and making them more advancing, sort of making life easier for people, including people's quality of life by making those things that we make more usable for them and durable. >> Now, you're a broad platform customer of Informatica. I'd love to hear a little bit from both of you about the relationship and how it's evolved over the years. >> Look, we look at Informatica as supporting our fundamentals, our data fundamentals. For us to be successful in what we do, we need to have good quality data, well governed, well managed, and secure. Not only that, and also accessible. And we using Informatica almost end to end. We are using Informatica for our data movement ETL platform. We're using Informatica for our data quality. We're using Informatica for our master data management. And we have Informatica beginning now to explore and to use Informatica big data management capabilities. And more to that, we also utilize Informatica professional services to help us realize those values from the platforms that we are deploying. IIoT, Industrial IoT has really been a hot trend. Industrial implies factories building big things, planes, wind farms, we've heard that before. But what's interesting is these are pre-existing physical things, these plants and all this manufacturing. When you add digital connectivity to it and power, it's going to change what they were used to be doing to new things. So how do you see Industrial IoT changing or creating a builder culture of new things? Because this connect first, got to have power and connectivity. 5G's coming around, Wi-Fi 6 is around the corner. This is going to light up all these devices that might have had battery power or older databases. What's the modernization of these industrial environments going to look like in your view? First of all, let me give you an example of the value that is coming with this connectivity. Think of it, if you are an aircraft engineer. Back in the day, a plane landed in Las Vegas. You went and inspected it, physically, and checked in your manual when to replace a part. But now Honeywell is telling you, we're connecting directly to the mechanic who is going to inspect the plane, and there will be sort of in their palms they can see and say wait a minute. This part, one more flight and I should replace this part. Now, we are advising you now, doing some predictive analytics, and telling you when this part could even fail. We're telling you when to replace it. So we're saying okay, the plane is going to fly from here to California. Prepare the mechanics in California when it lands with the part so they can replace it. That's already safety 101. So guaranteeing safety, sort of improving the equity or the viability of the products that we produce. When we're moving away from continue to build things because people still need those things built, safety products, but we're just making them more. We've heard supply chain's a real low-hanging fruit on this, managing the efficiency so there's no waste. Having someone ready at the plane is efficient. That's kind of low-hanging fruit. Any ideas on some of the creativity of new applications that's going to come from the data? Because now you start getting historical data from the connections, that's where I think the thing can get interesting here. Maybe new jobs, new types of planes, new passenger types. >> We are not only using the data to improve on the products and help us improve customer needs, design new products, create new products, but we also monitorizing that data, allowing our partners to also get some insights from that data to develop their own products. So creating sort of an environment where there is a partnership between those who use our products. And guess what, most of the people who use our products, our products actually input into their products. So we are a lot more business-to-business company than a B2C. So I see a lot of value in us being able to share that intelligence, that insight, in our data at a level of scientific discovery for our partners. >> Randy, I want to bring you into the conversation a little bit here (laughs). >> Thanks. >> So you lead Informatica's professional services. I'm interested to hear your work with Honeywell, and then how it translates to the other companies that you engage with. Honeywell is such a unique company, 130 years of innovation, inventor of so many important things that we use in our everyday lives. That's not your average company, but talk a little bit about their journey and how it translates to other clients. >> Sure, well, you could tell, listening to Charlie, how strategic data is, as well as our relationship. And it's not just about evolution from their perspective, but also you mentioned the historicals and taking advantage of where you've been and where you need to go. So Charlie's made it very clear that we need to be more than just a partner with products. We need to be a partner with outcomes for their business. So hence, a professional services relationship with Honeywell and Charlie and the organization started off more straightforward. You mentioned ETL, and we started off 2000, I believe, so 19 years ago. So it's been a journey already, and a lot more to go. But over the years you can kind of tell, using data in different ways within the organization, delivering business outcomes has been at the forefront, and we're viewed strategically, not just with the products, but professional services as well, to make sure that we can continue to be there, both in an advisory capacity, but also in driving the right outcomes. And something that Charlie even said this morning was that we were kind of in the fabric. We have a couple of team members that are just like Honeywell team members. We're in the fabric of the organization. I think that's really critically important for us to really derive the outcomes that Charlie and the business need. >> And data is so critical to their business. You have to be, not only from professional services, but as a platform. Yes. This is kind of where the value comes from. Now, I can't help but just conjure up images of space because I watch my kids that watch, space is now hot. People love space. You see SpaceX landing their rocket boosters to the finest precision. You got Blue Origin out there with Amazon. And they are Honeywell sensors either. Honeywell's in every manned NASA mission. You have a renaissance of activity going on in a modern way. This is exciting, this is critical. Without data, you can't do it. >> Absolutely, I mean, also sometimes we take a break. I'm a fundamentalist. I tell everybody that excitement is great, but let's take a break. Let's make sure the fundamentals are in place. And we actually know what is it, what are those critical data that we need to be tracking and managing? Because you don't just have to manage a whole world of data. There's so much of it, and believe me, there's not all value in everything. You have to be critical about it and strategic about it. What are the critical data that we need to manage, govern, and actually, because it's expensive to manage the critical data. So we look at a value tree as well, and say, okay, if we, as Honeywell, want to be able to be also an efficient business enabler, we have to be efficient inside. So there's looking out, and there's also looking inside to make sure that we are in the right place, we are understanding our data, our people understand data. Talking about our relationship with IPS, Informatica Professional Services, one of the things that we're looking at is getting the right people, the engineers, the people to actually realize that okay, we have the platform, we've heard of Clare, We heard of all those stuff. But where are the people to actually go and do the real stuff, like actually programming, writing the code, connecting things and making it work? It's not easy because the technology's going faster than the capabilities in terms of people, skills. So the partnership we're building with Informatica professional services, and we're beginning to nurture, inside that, we want to be in a position were Honeywell doesn't have to worry so much about the churn in terms of getting people and retraining and retraining and retraining. We want to have a reliable partner who is also moving with the certain development and the progress around the products that we bought so we can have that success. So the partnership with IPS is for the-- >> The skill gaps we've been talking about, I know she's going to ask next, but I'll just jump in because I know there's two threads here. One is there's a new generation coming into the workforce, okay, and they're all data-full. They've been experiencing the digital lifestyle, the engineering programs. To data, it's all changing. What are some of the new expertise that really stand out when evaluating candidates, both from the Informatica side and also Honeywell? What's the ideal candidate look like, because there's no real four-year degree anymore? Well, Berkeley just had their first class of data analytics. That's new two-generation. But what are some of those skills? There's no degree out there. You can't really get a degree in data yet. >> Do you want to talk about that? >> Sure, I can just kick off with what we're looking at and how we're evolving. First of all, the new graduates are extremely innovative and exciting to bring on. We've been in business for 26 years, so we have a lot of folks that have done some great work. Our retention is through the roof, so it's fun to meld the folks that have been doing things for over 10, 15 years, to see what the folks have new ideas about how to leverage data. The thing I can underscore is it's business and technology, and I think the new grads get that really, really well in terms of data. To them, data's not something that's stored somewhere in the cloud or in a box. It's something that's practically applied for business outcomes, and I think they get that right out of school, and I think they're getting that message loud and clear. Lot of hybrid programs. We do hire direct from college, but we also hire experienced hires. And we look for people that have had degrees that are balanced. So the traditional just CS-only degrees, still very relevant, but we're seeing a lot of people do hybrids because they know they want to understand supply chain along with CS and data. And there are programs around just data, how organizations can really capitalize on that. >> And also we're hearing, too, that having domain expertise is actually just as important as having the coding skills because you got to know what an outcome looks like before you collect the data. You got to know what checkmate is if you're going to play chess. That's the old expression, right? >> I think people with the domain, both the hybrid experience or expertise, are more valuable to the company because maybe from the product perspective, from building products, you could be just a scientist, code the code. But when you come to Honeywell, for example, we want you to be able to understand, think about materials. Want you to be able to understand what are the products, what are the materials that we use. What are the inputs that we have to put into these products? Now a simple thing like a data scientist deciding what the right correct value of what an attribute should be, that's not something that because you know code you can determine. You have to understand the domain, the domain you're dealing with. You have to understand the context. So that comes, the question of context management, understanding the context and bringing it together. That is a big challenge, and I can tell you that's a big gap there. >> Big gap indeed, and understand the business and the data too. >> Yes. >> Charles, Randy, thank you both so much for coming on theCUBE. It's been a great conversation. >> Thank you. >> Thank you. >> I'm Rebecca Knight for John Furrier. You are watching theCUBE. (funky techno music)

Published Date : May 22 2019

SUMMARY :

Brought to you by Informatica. He is the senior director data management And Randy Mickey, senior vice president Charlie, I want to start with you. That's what people would think about Honeywell. and that's the direction we are heading, I'd love to hear a little bit from both of you from the platforms that we are deploying. So we are a lot more business-to-business Randy, I want to bring you into the conversation So you lead Informatica's professional services. But over the years you can kind of tell, And data is so critical to their business. What are the critical data that we need to manage, What are some of the new expertise that really So the traditional just CS-only degrees, is actually just as important as having the coding skills What are the inputs that we have to put into these products? and the data too. Charles, Randy, thank you both so much You are watching theCUBE.

ENTITIES

Entity	Category	Confidence
Rebecca Knight	PERSON	0.99+
Charlie Emer	PERSON	0.99+
Honeywell	ORGANIZATION	0.99+
California	LOCATION	0.99+
Randy Mickey	PERSON	0.99+
Randy	PERSON	0.99+
John Furrier	PERSON	0.99+
Charles	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Charlie	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
26 years	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
two guests	QUANTITY	0.99+
two threads	QUANTITY	0.99+
130 years	QUANTITY	0.99+
both	QUANTITY	0.99+
four-year	QUANTITY	0.99+
SpaceX	ORGANIZATION	0.99+
2000	DATE	0.98+
One	QUANTITY	0.98+
Charles Emer	PERSON	0.98+
NASA	ORGANIZATION	0.98+
IPS	ORGANIZATION	0.98+
Clare	PERSON	0.98+
IIOT	ORGANIZATION	0.98+
one	QUANTITY	0.98+
this year	DATE	0.97+
First	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
Berkeley	ORGANIZATION	0.97+
19 years ago	DATE	0.96+
Charlie	ORGANIZATION	0.96+
two-generation	QUANTITY	0.95+
2019	DATE	0.94+
over 10, 15 years	QUANTITY	0.94+
Industrial Revolution	EVENT	0.94+
ETL	ORGANIZATION	0.94+
one more flight	QUANTITY	0.92+
first	QUANTITY	0.88+
Informatica Professional Services	ORGANIZATION	0.88+
Informatica World 2019	EVENT	0.87+
this morning	DATE	0.85+
first class	QUANTITY	0.84+
Blue Origin	ORGANIZATION	0.83+
Informatica World	ORGANIZATION	0.83+

Tendü Yogurtçu, Syncsort | DataWorks Summit 2018

>> Live from San Jose, in the heart of Silicon Valley, It's theCUBE, covering DataWorks Summit 2018. Brought to you by Hortonworks. >> Welcome back to theCUBE's live coverage of DataWorks here in San Jose, California, I'm your host, along with my cohost, James Kobielus. We're joined by Tendu Yogurtcu, she is the CTO of Syncsort. Thanks so much for coming on theCUBE, for returning to theCUBE I should say. >> Thank you Rebecca and James. It's always a pleasure to be here. >> So you've been on theCUBE before and the last time you were talking about Syncsort's growth. So can you give our viewers a company update? Where you are now? >> Absolutely, Syncsort has seen extraordinary growth within the last the last three year. We tripled our revenue, doubled our employees and expanded the product portfolio significantly. Because of this phenomenal growth that we have seen, we also embarked on a new initiative with refreshing our brand. We rebranded and this was necessitated by the fact that we have such a broad portfolio of products and we are actually showing our new brand here, articulating the value our products bring with optimizing existing infrastructure, assuring data security and availability and advancing the data by integrating into next generation analytics platforms. So it's very exciting times in terms of Syncsort's growth. >> So the last time you were on the show it was pre-GT prop PR but we were talking before the cameras were rolling and you were explaining the kinds of adoption you're seeing and what, in this new era, you're seeing from customers and hearing from customers. Can you tell our viewers a little bit about it? >> When we were discussing last time, I talked about four mega trends we are seeing and those mega trends were primarily driven by the advanced business and operation analytics. Data governance, cloud, streaming and data science, artificial intelligence. And we talked, we really made a lot of announcement and focus on the use cases around data governance. Primarily helping our customers for the GDPR Global Data Protection Regulation initiatives and how we can create that visibility in the enterprise through the data by security and lineage and delivering trust data sets. Now we are talking about cloud primarily and the keynotes, this event and our focus is around cloud, primarily driven by again the use cases, right? How the businesses are adopting to the new era. One of the challenges that we see with our enterprise customers, over 7000 customers by the way, is the ability to future-proof their applications. Because this is a very rapidly changing stack. We have seen the keynotes talking about the importance of how do you connect your existing infrastructure with the future modern, next generation platforms. How do you future-proof the platform, make a diagnostic about whether it's Amazon, Microsoft of Google Cloud. Whether it's on-premise in legacy platforms today that the data has to be available in the next generation platforms. So the challenge we are seeing is how do we keep the data fresh? How do we create that abstraction that applications are future-proofed? Because organizations, even financial services customers, banking, insurance, they now have at least one cluster running in the public cloud. And there's private implementations, hybrid becomes the new standard. So our focus and most recent announcements have been around really helping our customers with real-time resilient changes that capture, keeping the data fresh, feeding into the downstream applications with the streaming and messaging data frames, for example Kafka, Amazon Kinesis, as well as keeping the persistent stores and how to Data Lake on-premise in the cloud fresh. >> Puts you into great alignment with your partner Hortonworks so, Tendu I wonder if we are here at DataWorks, it's Hortonworks' show, if you can break out for our viewers, what is the nature, the levels of your relationship, your partnership with Hortonworks and how the Syncsort portfolio plays with HDP 3.0 with Hortonworks DataFlow and the data plan services at a high level. >> Absolutely, so we have been a longtime partner with Hortonworks and a couple of years back, we strengthened our partnership. Hortonworks is reselling Syncsort and we have actually a prescriptive solution for Hadoop and ETL onboarding in Hadoop jointly. And it's very complementary, our strategy is very complementary because what Hortonworks is trying and achieving, is creating that abstraction and future-proofing and interaction consistency around referred as this morning. Across the platform, whether it's on-premise or in the cloud or across multiple clouds. We are providing the data application layer consistency and future-proofing on top of the platform. Leveraging the tools in the platform for orchestration, integrating with HTP, certifying with Trange or HTP, all of the tools DataFlow and at last of course for lineage. >> The theme of this conference is ideas, insights and innovation and as a partner of Hortonworks, can you describe what it means for you to be at this conference? What kinds of community and deepening existing relationships, forming new ones. Can you talk about what happens here? >> This is one of the major events around data and it's DataWorks as opposed to being more specific to the Hadoop itself, right? Because stack is evolving and data challenges are evolving. For us, it means really the interactions with the customers, the organizations and the partners here. Because the dynamics of the use cases is also evolving. For example Data Lake implementations started in U.S. And we started MER European organizations moving to streaming, data streaming applications faster than U.S. >> Why is that? >> Yeah. >> Why are Europeans moving faster to streaming than we are in North America? >> I think a couple of different things might participate. The open sources really enabling organizations to move fast. When the Data Lake initiative started, we have seen a little bit slow start in Europe but more experimentation with the Open Source Stack. And by that the more transformative use cases started really evolving. Like how do I manage interactions of the users with the remote controls as they are watching live TV, type of transformative use cases became important. And as we move to the transformative use cases, streaming is also very critical because lots of data is available and being able to keep the cloud data stores as well as on-premise data stores and downstream applications with fresh data becomes important. We in fact in early June announced that Syncsort's now's a part of Microsoft One Commercial Partner Program. With that our integrate solutions with data integration and data quality are Azure gold certified and Azure ready. We are in co-sale agreement and we are helping jointly a lot of customers, moving data and workloads to Azure and keeping those data stores close to platforms in sync. >> Right. >> So lots of exciting things, I mean there's a lot happening with the application space. There's also lots still happening connected to the governance cases that we have seen. Feeding security and IT operations data into again modern day, next generation analytics platforms is key. Whether it's Splunk, whether it's Elastic, as part of the Hadoop Stack. So we are still focused on governance as part of this multi-cloud and on-premise the cloud implementations as well. We in fact launched our Ironstream for IBMI product to help customers, not just making this state available for mainframes but also from IBMI into Splunk, Elastic and other security information and event management platforms. And today we announced work flow optimization across on-premise and multi-cloud and cloud platforms. So lots of focus across to optimize, assure and integrate portfolio of products helping customers with the business use cases. That's really our focus as we innovate organically and also acquire technologies and solutions. What are the problems we are solving and how we can help our customers with the business and operation analytics, targeting those mega trends around data governance, cloud streaming and also data science. >> What is the biggest trend do you think that is sort of driving all of these changes? As you said, the data is evolving. The use cases are evolving. What is it that is keeping your customers up at night? >> Right now it's still governance, keeping them up at night, because this evolving architecture is also making governance more complex, right? If we are looking at financial services, banking, insurance, healthcare, there are lots of existing infrastructures, mission critical data stores on mainframe IBMI in addition to this gravity of data changing and lots of data with the online businesses generated in the cloud. So how to govern that also while optimizing and making those data stores available for next generation analytics, makes the governance quite complex. So that really keeps and creates a lot of opportunity for the community, right? All of us here to address those challenges. >> Because it sounds to me, I'm hearing Splunk, Advanced Machine did it, I think of the internet of things and sensor grids. I'm hearing IBM mainframes, that's transactional data, that's your customer data and so forth. It seems like much of this data that you're describing that customers are trying to cleanse and consolidate and provide strict governance on, is absolutely essential for them to drive more artificial intelligence into end applications and mobile devices that are being used to drive the customer experience. Do you see more of your customers using your tools to massage the data sets as it were than data scientists then use to build and train their models for deployment into edge applications. Is that an emerging area where your customers are deploying Syncsort? >> Thank you for asking that question. >> It's a complex question. (laughing) But thanks for impacting it... >> It is a complex question but it's very important question. Yes and in the previous discussions, we have seen, and this morning also, Rob Thomas from IBM mentioned it as well, that machine learning and artificial intelligence data science really relies on high-quality data, right? It's 1950s anonymous computer scientist says garbage in, garbage out. >> Yeah. >> When we are using artificial intelligence and machine learning, the implications, the impact of bad data multiplies. Multiplies with the training of historical data. Multiplies with the insights that we are getting out of that. So data scientists today are still spending significant time on preparing the data for the iPipeline, and the data science pipeline, that's where we shine. Because our integrate portfolio accesses the data from all enterprise data stores and cleanses and matches and prepares that in a trusted manner for use for advanced analytics with machine learning, artificial intelligence. >> Yeah 'cause the magic of machine learning for predictive analytics is that you build a statistical model based on the most valid data set for the domain of interest. If the data is junk, then you're going to be building a junk model that will not be able to do its job. So, for want of a nail, the kingdom was lost. For want of a Syncsort, (laughing) Data cleansing and you know governance tool, the whole AI superstructure will fall down. >> Yes, yes absolutely. >> Yeah, good. >> Well thank you so much Tendu for coming on theCUBE and for giving us a lot of background and information. >> Thank you for having me, thank you. >> Good to have you. >> Always a pleasure. >> I'm Rebecca Knight for James Kobielus. We will have more from theCUBE's live coverage of DataWorks 2018 just after this. (upbeat music)

Published Date : Jun 19 2018

SUMMARY :

in the heart of Silicon Valley, It's theCUBE, We're joined by Tendu Yogurtcu, she is the CTO of Syncsort. It's always a pleasure to be here. and the last time you were talking about Syncsort's growth. and expanded the product portfolio significantly. So the last time you were on the show it was pre-GT prop One of the challenges that we see with our enterprise and how the Syncsort portfolio plays with HDP 3.0 We are providing the data application layer consistency and innovation and as a partner of Hortonworks, can you Because the dynamics of the use cases is also evolving. When the Data Lake initiative started, we have seen a little What are the problems we are solving and how we can help What is the biggest trend do you think that is businesses generated in the cloud. massage the data sets as it were than data scientists It's a complex question. Yes and in the previous discussions, we have seen, and the data science pipeline, that's where we shine. If the data is junk, then you're going to be building and for giving us a lot of background and information. of DataWorks 2018 just after this.

ENTITIES

Entity	Category	Confidence
Rebecca	PERSON	0.99+
James Kobielus	PERSON	0.99+
James	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Rebecca Knight	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Tendu Yogurtcu	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
San Jose	LOCATION	0.99+
U.S.	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Syncsort	ORGANIZATION	0.99+
1950s	DATE	0.99+
San Jose, California	LOCATION	0.99+
Hortonworks'	ORGANIZATION	0.99+
North America	LOCATION	0.99+
early June	DATE	0.99+
DataWorks	ORGANIZATION	0.99+
over 7000 customers	QUANTITY	0.99+
One	QUANTITY	0.98+
theCUBE	ORGANIZATION	0.98+
DataWorks Summit 2018	EVENT	0.97+
Elastic	TITLE	0.97+
one	QUANTITY	0.96+
today	DATE	0.96+
IBMI	TITLE	0.96+
four	QUANTITY	0.95+
Splunk	TITLE	0.95+
Tendü Yogurtçu	PERSON	0.95+
Kafka	TITLE	0.94+
this morning	DATE	0.94+
Data Lake	ORGANIZATION	0.93+
DataWorks	TITLE	0.92+
iPipeline	COMMERCIAL_ITEM	0.91+
DataWorks 2018	EVENT	0.91+
Splunk	PERSON	0.9+
ETL	ORGANIZATION	0.87+
Azure	TITLE	0.85+
Google Cloud	ORGANIZATION	0.83+
Hadoop	TITLE	0.82+
last three year	DATE	0.82+
couple of years back	DATE	0.81+
Syncsort	PERSON	0.8+
HTP	TITLE	0.78+
European	OTHER	0.77+
Tendu	PERSON	0.74+
Europeans	PERSON	0.72+
Data Protection Regulation	TITLE	0.71+
Kinesis	TITLE	0.7+
least one cluster	QUANTITY	0.7+
Ironstream	COMMERCIAL_ITEM	0.66+
Program	TITLE	0.61+
Azure	ORGANIZATION	0.54+
Commercial Partner	OTHER	0.54+
DataFlow	TITLE	0.54+
One	TITLE	0.54+
CTO	PERSON	0.53+
3.0	TITLE	0.53+
Trange	TITLE	0.53+
Stack	TITLE	0.51+

Stephano Celati, BNova | PentahoWorld 2017

>> Announcer: Live from Orlando, Florida. It's theCube covering PentahoWorld 2017, brought to you buy Hitachi Ventara. >> Welcome back to theCube's live coverage of PentahoWorld, brought to you of course by Hitachi Ventara. I'm your host, Rebecca Knight, along with my cohost James Kobielus. We are joined by Stephano Celati. He is a Pentaho Solutions consultant at BNova. Thanks so much for coming on theCube, Stephano. >> Thank you for having me. >> So I should say congratulations are in order because you are here to accept the Pentaho Excellence Award for the ROI category on behalf of LAZIOcrea. Tell us about the award. >> Yes, as I was saying, I'm really proud of this award because it is something that is related to public administration savings, which is a good thing, first of all for me as a citizen, let's say. This project is about healthcare spending. In Italy the National Healthcare Services allows the drugstore to sell medicines to total or partial reimbursement by NHS itself. And they also have the possibility to replace the medicine with a generic drug which normally costs less to the people and also to the health service itself. So a couple of years ago (speaks in foreign language) which is the political area to which Rome belongs just to explain, launched a new project to monitor, analyze and inspect the spending flow in drugs. So we partnered with LAZIOcrea to create a business analytics platform based on Pentaho obviously, and which collects all the data coming from the prescriptions and store it in an analytical database that is Vertica, and uses PDI/ETL tools to store this data. >> That's for Pentaho Data Integration. >> Yes, PDI is Pentaho Data Integration, good point. And after that we present the data in terms of reporting, analysis, dashboards, to all the people that are interested in this data. So we talk about regional managers, we talk about auditors, and also to local district users which are in charge of managing the expenditure for drugs. The outcome of this project was real impressive because we had an expenditure fell by 3.6%, which in a region where we have more than 200 million prescriptions every year means 34 million Euros in a years. >> Rebecca: Wow. >> So it was really huge result. We were very happy about that. And it was so simple because simply monitoring better the expenditure, monitoring how they deliver the drugs out, what kind of medicine they prescribe and targeting what pharmacies sell to the end user just gave these impressive results. And this year they are forecasting for 41 million Euros in savings more, so it's a huge result. It's something that is for us really a good result. >> So here in the U.S., I mean we have problems very similar to what you just described in Italy. And just putting the transparency around the data would be a huge revelation for the United States, too. How big a departure was it in Italy? >> Well, it was a really a big problem to start because they didn't have any system to collect all this data. So they had to set up everything from scratch, let's say, just by acquiring the paper where the physician writes the recipe, so it was not that easy to build it from scratch. But after that the region has had the opportunity to monitor this data and also to publish this data, which is something that in Italy is really relevant in this moment because we are talking about open government, we are talking about open data, and so again, the result was really impressive. >> Do you see any follow on opportunities to use this data for other purposes other than the initial application? >> Yes, we already experienced a different usage of this data because during the last major earthquake we have in 2016 in this area, those guys from LAZIOcrea were able to produce a list of mostly the drugs in that area just in a couple of hours, just by using the ETL and setting up this list that somehow help the first aid units in giving the right assistance on time. And next steps will be about hyper prescriptions because we want to monitor if there are any doctors that prescribe drugs that are not really necessary. And we also try to move our inspection also to hospitals because when you do a surgery, you get medicine, you get a lot of assistance in the hospital. So we want also to monitor that kind of the aspect, which is again in charge of the health system. >> To make sure that the right medicines are being distributed to the right regions at the right time for the intent to likely-- >> Yes, this could also lead to something that is a correlation analysis, meaning what is your pain and what are you assuming so that they can have an historical data they can use to prescribe better medicines. >> But the anecdote he was sharing about the earthquake too is really compelling too, if you think about a public health crisis and outbreak of some sort, to be able to get drugs quickly to those in needs, it's really astonishing. >> Again, this morning we were talking about data lake. This is a sort of data lake. We found several ways to use that data, to fish them back from the data, let's say from the lake, and it's really impressive what you can do if you have the right information and you know how to use it. >> How do you see the market developing over the next year, next five years? >> Yes, the problem in Italy is that the market is not so responsive to innovation like others, let's say U.S. or U.K. and Europe. So for this reason my company Bnova set up annual event which is called Big Data Tech, and the purpose of this event is to spread knowledge about big data systems, products, architecture and so on, which helps companies in knowing better what they can do with these platforms. So in the next month we see a lot of opportunities. Generically speaking data mining field, we start talking about predictive analysis, we start talking about smart cities and other stuff like that. So again, we will need maybe to enter in a new phase of let's say (mumbling) because companies like BNova and others that operate in this field of business analytics need to put to general knowledge what other innovative companies are doing. So in the next month we will for sure move to newer architectures, new technology, and we will have to support all the companies with this kind of stuff. >> In terms of the new technology you're moving to, is there a role for the internet of things, both in your plans and really in terms of the Italian market. What sort of potential applications are there for IOT related perhaps to the use of it with health data going forward in Italy? >> Yes, also for healthcare, but in Italy the IOT team is a parallel line that is growing thanks to a governmental initiative which is called Industry 4.0, which encourages the usag of interconnected machines, connected to the internet, so classical approach of the IOT field. So with this new approach and the government sustain we believe that the IOT will have a big improvement in the next years. Again, we are talking about Italy, so we are not so fast in growing. But again, we are starting to talk about smart cities for energy saving, sustainable energy and other stuff in which the IOT plays a key role. So as far as our business is concerned, that is business analytics, so on top of that we see a lot of opportunities coming from predictive analysis, which means to prevent the maintenance of a machine, for example, or to use virtual reality to simulate a laboratory test and other stuff. So with these opportunities for sure the usage of data mining tools, such Wake Up when we're talking about Pentaho Solutions, could be a great advantage because you will apply the knowledge to your data. So you will not only analyze the data, but you will also extract some sort of knowledge from the data which can help companies. >> Of course, Italy is where the renaissance began, and it just sounds like you, I mean renaissance use of analytics to help the Italian people and the Italian economy to continue to grow and innovate. >> Stephano: Yes, yes. >> So I want to see not a data lake, a data colosseum, that should be on your to do list. >> I want a data gallery with lots of data masterpieces hanging on the walls all around Italy. >> Exactly. >> You'll be the new Leonardo and Michelangelo. >> Stefano , I love it. Well, thank you so much for coming on theCube. >> Thank you for having me. >> I am Rebecca Knight for Jim Kubielus. We will have more from PentahoWorld just after this.

Published Date : Oct 26 2017

SUMMARY :

brought to you buy Hitachi Ventara. brought to you of course Pentaho Excellence Award for the that is related to public and also to local district the expenditure, monitoring So here in the U.S., I has had the opportunity to assistance in the hospital. lead to something that is a But the anecdote he was that data, to fish them back So in the next month we of the Italian market. in the next years. people and the Italian that should be on your to do list. hanging on the walls all around Italy. You'll be the new Well, thank you so much I am Rebecca Knight for Jim Kubielus.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Rebecca	PERSON	0.99+
Stephano Celati	PERSON	0.99+
Stephano	PERSON	0.99+
Italy	LOCATION	0.99+
Stefano	PERSON	0.99+
2016	DATE	0.99+
Jim Kubielus	PERSON	0.99+
Michelangelo	PERSON	0.99+
LAZIOcrea	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
Orlando, Florida	LOCATION	0.99+
Leonardo	PERSON	0.99+
U.S.	LOCATION	0.99+
National Healthcare Services	ORGANIZATION	0.99+
3.6%	QUANTITY	0.99+
U.K.	LOCATION	0.99+
Pentaho Solutions	ORGANIZATION	0.99+
BNova	ORGANIZATION	0.99+
Bnova	ORGANIZATION	0.99+
34 million Euros	QUANTITY	0.99+
more than 200 million prescriptions	QUANTITY	0.99+
next month	DATE	0.99+
Hitachi Ventara	ORGANIZATION	0.99+
PentahoWorld	ORGANIZATION	0.98+
NHS	ORGANIZATION	0.98+
this year	DATE	0.98+
United States	LOCATION	0.98+
both	QUANTITY	0.98+
Pentaho Excellence Award	TITLE	0.97+
41 million Euros	QUANTITY	0.97+
next year	DATE	0.97+
Pentaho	ORGANIZATION	0.96+
outbreak	EVENT	0.95+
first aid	QUANTITY	0.92+
this morning	DATE	0.91+
2017	DATE	0.89+
every year	QUANTITY	0.87+
couple of years ago	DATE	0.86+
Italian	LOCATION	0.85+
Big Data Tech	EVENT	0.84+
a years	QUANTITY	0.81+
next five years	DATE	0.8+
Rome	ORGANIZATION	0.79+
PentahoWorld 2017	EVENT	0.78+
PentahoWorld	TITLE	0.74+
Italian	OTHER	0.73+
BNova	PERSON	0.72+
theCube	ORGANIZATION	0.68+
couple of hours	QUANTITY	0.68+
Wake Up	TITLE	0.67+
4.0	OTHER	0.64+
Vertica	ORGANIZATION	0.62+
next years	DATE	0.59+
first	QUANTITY	0.58+
ETL	ORGANIZATION	0.42+
ROI	TITLE	0.36+

Claus Moldt, FICO | AWS Summit 2017

>> Announcer: Live from Manhattan, it's theCUBE! Covering AWS Summit, New York City, 2017. Brought to you by Amazon Web Services. >> And welcome back here on theCUBE, continuing our coverage of AWS Summit here, 2017. We're at the Javitz Center. Hustlin', bustlin' midtown New York. A lot of things happening here in Manhattan, one of those things happening is Stu Mennamen. Stu, you're always happening. >> Thank you John. >> Are you curious about your credit score, by the way? Do you have any inclination or any kind of curiosity about that? >> John, I'm happy with my credit score, I don't think I need any more credit, is the thing I think we're talking about. >> Well just in case, we have with us the CIO of FICO to join us, Claus Moldt, Claus, good to see ya. >> Thank you very much, good to be here. >> Yeah, we'll get to the credit scores later, cuz we do want to touch base on that. >> We do want to check up on that. >> Nice job on the keynote stage. >> Thank you. >> You talked about a lot of things, you had processing, planning, automation, managing, microservices, a lot of, for folks at home who weren't privy to the presentation, just kind of sum it up a little bit for me, if you would, the message you were trying to get across this morning. >> Very high level. We are a 61-year-old company, we built a ton of software which we primarily have delivered on-prem. And it was about four years ago, that's when we started to go to our private cloud and develop our solutions on the private cloud. But it was mostly done in a lift-and-shift fashion. We took the solutions, implemented in our data centers, optimized it a little bit so we could do the shared services for the cloud, et cetera. But as we saw our customers starting to go to the public cloud, a lot of financial institutions, now it's more secure to run and have your data in the public cloud, we have auditor compliance associated with the public cloud, so we obviously wanted to go to the public cloud so we could meet our customers there. So that was a very, very big message today. By going to the public cloud, obviously we reap the benefits of what the public cloud has to offer. We can lower our cost, we still had to rewrite a lot of our applications to take full advantage of the services that AWS could provide, and that means that our new applications, to be able to scale up and scale down, you also build the images that we deploy on AWS so we can deploy them at a much more rapid pace. So we can enable scale for our customers, setting the solutions up in days and not weeks or months, like we used to, so that's another huge benefit. And we talked about all the regions that AWS provides. 16 regions around the globe. We want to grow with our customer base, and we don't want to build data centers around the globe. There's absolutely no need to, no value added in doing so. So we go where AWS goes, and AWS keeps expanding their regions, and we can deploy our software, now, at a rapid pace, again, in the various regions. And then finally, what I said, which is very important, that's about security and compliance. Security aspects, we've gotten a significant amount of help, so we build our services in a very secure fashion, but a lot of the serv6ices that now AWS provides is already pre-audited, and hence, compliant, such as PCI, et cetera, is inherited as part of these services. So our solution, we use the extension of the services that AWS provides, and that, of course, enables us to be able to go through the audit process at a much more rapid pace. So as you can hear, a significant amount of benefits moving to the public cloud. >> Claus, 61-year-old company, you know, obviously lots of legacy, probably lots of applications, where are you with your application portfolio? How much do you still own on-prem versus public cloud, and how do you make those kinds of decision points? >> Yeah, we already had a pretty significant amount of our install base still on the private cloud, as well as on-prem, right, the majority is still on-prem. Having said that, more and more of our customers have asked, How can we be smarter, so we don't have to maintain all the upgrades, all the security, et cetera, how can you enable us to move faster? So what we are seeing is that our customers are asking us to move to the cloud. And the cloud for them probably doesn't always mean the public cloud or the private cloud, they just want somebody else to manage their infrastructure. Having said that, a lot of them, as I said, have started to experiment with the public cloud. And that means that they're learning more and more about AWS and how to operate there, and they're asking us to go there. So I would say, we're still early in our journey. I would say there's a high demand for us to deliver our services to the cloud. And delivering to the private cloud, we probably just can't accelerate and do it fast enough for the customers who want to migrate, hence the reason for why we're going to an already-API-enabled infrastructure, to deal with these services. >> Obviously, Amazon has a lot of data services, you need to worry about your governance and compliance, I got a note from the community, actually, wondering if you've had the chance to look at Amazon Glue? You know, things like ETL, how much of a burden is that for you, is that offering something that's compelling? How do you really look at that space? >> Yeah, so obviously, as part of our services, we use ETL service, we developed our own ETL service, and we do that specifically to our products. Having said that, we look at every single service that AWS brings to the table, and we looked at Glue. Glue may not deliver exactly what we need to do at this point in time, but we think that, as Glue evolves, we're likely going to use the services. We rather would not develop and maintain those services ourselves, no good reason. So if all the criteria for the next-generation services on AWS is met, and it's an easy shift, I mean, it's a no-brainer for us to use those features, right? >> But how long have you been the CIO of FICO? >> I joined about 18 months ago. I had my own company before, ran the infrastructure for Salesforce for around seven years, and then ran the infrastructure for eBay for around four years. So I grew up in the cloud world. (laughing) >> So it sets me up, you know, one of the questions we've all been looking at for the last decade, what does cloud mean for the role of the CIO? >> Yeah, well, "cloud" means a lot of things for me, right? It definitely means that I focus on evolving the business. Focus on the business value I can bring to the table. Not focus on, really, building infrastructure, which doesn't really add any value to our day-to-day. It did, once, one time, where a lot of the feature set or security aspects or deployment aspects was not where they needed to be in the cloud. The services now that AWS provides gives up the ability to use those services and rewrite our stack, so I don't have to worry about our capex, et cetera, we shift it all to OPEX, and we scale it as we see fits with our customers. We've got faster deployments, faster ways for innovation, utilizing all the new services that's being deployed. And that, to me, is 6truly a business benefit. I don't want to run data service, it doesn't add a significant amount of value. >> I did an interview with FICO a couple of years ago where it was early in looking at container services. Bring us up to speed, where do containers fit into your environment, what do you look to Amazon for that environment, you're playing with server lists at all yet? >> Yeah, obviously we experiment with all of the new functionality that's being brought to the table. We have done quite a bit on the docker fronts, we are evaluating the Kubernetes, we're excited that that is the direction that AWS is going, we would like to see some of the things move a little faster but that's always the case -- >> That was the news last week, you know, we're hoping to get Adrian on because we're supporting CNC and Kubernetes how, when? You know! (laughs) >> Exactly, exactly.6 But of course we're experimenting, and the moment that it's there and available for us, I'm fairly certain that we'll head down that route, right, it's nice packaging, it's an easy deployment and easy update. You also talked about Serverless. Serverless is a big thing for us. We mostly use it for admin functions, to kick things off, to enable the auto-scaling, et cetera. We don't really run the critical transactions as a part of Serverless, as you can imagine, because we operate largely regulated industries, so we have to have significant logging around what we do. But for a lot of the admin functions, we actually already use Serverless as part of our platform on AWS. >> How, you've been on the job since April last year, you walk in at this very transformative time, in FICO. And the role of the CIO in general was at a transformative time too, because you have a lot more options. So scale all that into terms of speed and how quickly you have to make decisions, how your role changes now, because the capabilities that you have at your disposal and the options that you have to decide between. >> Yeah, you know the interesting part of this is, and we talked about it with some of the other speakers backstage, we've seen the move before, right? On Salesforce, we build our own infrastructure, et cetera, but we used AWS as well for a lot of our development. So it's not like it's so new anymore. AWS, even, is not a young company anymore, it actually is a proven company, you heard it, right, a million users, et cetera. So for us it's pretty easy to go with proven technology. To learn where others have been, right. We stand on the shoulders of giants. We have a lot of companies that already have done, and moved to the cloud and run successfully in the cloud. I think, actually, the financial and insurance industry is some of the folks that's late to the game, if you saw me, because they have not been used to running in public clouds. So that mindset is something that you have to bring to the table, and we have to ensure that we educate all of our folks that's been used to on-prem, that has been used to operating in a certain world and still runs COBOL systems on mainframes. Of what it means to move to the clouds. And that's a big transformation, to change that mindset. And operate, also, in an agile way. We have to change the way that we operate, the way that we plan, the way that we deliver, so that it's all coordinated across multiple product lines and planned out and delivered in an agile fashion so we can sync our product and have the products interact with each other. And that delivery cycle and the way that we package things has to be thought of very differently. So for that we actually created a series of micro-learning within the company, where we just recorded five minutes of our CEO, five minutes of some of the engineering, the packaging, so everybody could get up to speed at a very fast pace of what it really means. And then we do deep-dive curriculums that are specific to your role within the company. >> Claus, what's on your list of what you're looking for from Amazon and your other vendors out there, to make your life easier? It sounded like Glue sounded interesting but doesn't fit exactly the way you do it, Kubernetes, you're keeping an eye on, what else is out there? >> Yeah, obviously we want to see reference architectures. We want to see and learn from what people have done before. In some cases we will be first to market with certain things but we're looking to get that jumpstart in general, right. We want to leapfrog the way that we deliver our stuff, we want to make sure that we can do faster, bigger, better, smarter, and that means that we have to test and validate a ton of technologies out there. If there's somebody that's already gone down the routes that we really can have a pens-down kind of discussion with to understand what's actually going on, that helps us. That helps us understand what's been done, what the capabilities are, et cetera. And that we can utilize in how we deliver our service to our customers. So when you think about it, it's less than 12 months ago that we started the AWS journey. We now have both MIFI scores on AWS, we have marketing services, and we're just about to release a series of our solutions on AWS for the financial services, so it's fast, it's going fast. And we have to understand all the new technologies. >> I feel like I've got whiplash going on right now because, you've covered a lot of ground in a very short period of time! >> Yeah, well as I said, we're fortunate enough to have folks that understand clouds, and that helps, and this is really a big team effort, right? All the way from engineering, our CEO, our CTO, sales has to understand how to sell the platform, rather than sell solutions. So there's a lot of education that has to go on, and I think that's actually key, to ensure that you bring the whole company with you, you know. It doesn't matter if you have one or two or five people that can run really fast, because they'll turn around and find out that nobody's running with them. So we have to make sure that we bring the whole company with us as we implement these solutions. And explain what it is! What are the benefits, why, how does it work? And we put a significant amount of effort into that in our company. >> Well, can we work on this FICO credit thing off-camera? How about that, off-camera? >> Yeah, good luck with that! (all laughing) >> Claus, thanks for being with us, we appreciate your time and again, nicely done this morning on the keynote stage. Claus Moldt, CIO of FICO joining us here on the Cube, we continue our coverage from the AWS Summit from New York, right after this!

Published Date : Aug 14 2017

SUMMARY :

Brought to you by Amazon Web Services. We're at the Javitz Center. is the thing I think we're talking about. Well just in case, we have with us the CIO of FICO cuz we do want to touch base on that. the message you were trying to get across this morning. By going to the public cloud, obviously we reap the benefits And delivering to the private cloud, we probably just can't brings to the table, and we looked at Glue. I had my own company before, ran the infrastructure and we scale it as we see fits with our customers. I did an interview with FICO a couple of years ago Yeah, obviously we experiment with all of the new But for a lot of the admin functions, we actually already and the options that you have to decide between. And that delivery cycle and the way that we package things And that we can utilize in how we deliver our service So we have to make sure that we bring the whole company we continue our coverage from the AWS Summit

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
one	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
Claus Moldt	PERSON	0.99+
John	PERSON	0.99+
Stu Mennamen	PERSON	0.99+
Manhattan	LOCATION	0.99+
two	QUANTITY	0.99+
New York	LOCATION	0.99+
five minutes	QUANTITY	0.99+
eBay	ORGANIZATION	0.99+
Claus	PERSON	0.99+
April last year	DATE	0.99+
last week	DATE	0.99+
FICO	ORGANIZATION	0.99+
AWS Summit	EVENT	0.99+
Stu	PERSON	0.99+
Javitz Center	LOCATION	0.99+
2017	DATE	0.99+
five people	QUANTITY	0.99+
around four years	QUANTITY	0.98+
61-year-old	QUANTITY	0.98+
Adrian	PERSON	0.98+
one time	QUANTITY	0.98+
around seven years	QUANTITY	0.98+
both	QUANTITY	0.97+
16 regions	QUANTITY	0.97+
first	QUANTITY	0.97+
New York City	LOCATION	0.96+
today	DATE	0.96+
a million users	QUANTITY	0.95+
AWS Summit 2017	EVENT	0.95+
Serverless	TITLE	0.95+
less than 12 months ago	DATE	0.93+
Glue	TITLE	0.91+
18 months ago	DATE	0.88+
Kubernetes	TITLE	0.86+
this morning	DATE	0.82+
capex	ORGANIZATION	0.82+
COBOL	ORGANIZATION	0.81+
about four years ago	DATE	0.79+
a couple of years ago	DATE	0.77+
OPEX	ORGANIZATION	0.74+
single service	QUANTITY	0.72+
Salesforce	TITLE	0.71+
Kubernetes	PERSON	0.71+
last decade	DATE	0.69+
once	QUANTITY	0.58+
ETL	ORGANIZATION	0.56+
Glue	ORGANIZATION	0.51+
Cube	COMMERCIAL_ITEM	0.45+

Shafaq Abdullah, The Honest Company - #SparkSummit - #theCUBE

>> Announcer: Covering Spark Summit 2017, brought to you by Databricks. >> This is theCUBE, and we're having a great time at Spark Summit 2017. One of our last guests of the day is Shafaq Abdullah, who is the director of data infrastructure at the Honest Company. Shafaq, welcome to the show. >> Thank you. >> Now, I heard about The Honest Company because of the celebrity founder, right, Jessica Alba? >> Shafaq: That's correct. >> Okay, but how did you end up at the company, weren't you at a start-up before? >> That's exactly correct. So, basically, we did a start-up called InSnap before we actually got into Honest, and the way it happened is that, Insnap was more about instantaneous building personas and using machine learning and Big Data Stack, and Honest at that time was trying to find someone who could them with the data challenges. So, Insnap was the right piece in some of its technology and expertise in big data and machine learning, so we basically built a real-time, instantaneous personas to increase engagement and monetization. It was backed up by Big Data machine learning and Spark instead of our technology. So we used that to basically help Honest to really become data driven, to solve their next generation problem of making products which drive value out of data, and understand their customers better, operate better business, optimize business better. That is why they acquired us, and essentially, we deal with the technology in their stack, not only the technology, but also the culture, the business processes, and the teams which operate those. >> Okay, we're going to dive into some of the technical details about what you're developing with George in just a second, but I have to ask, the company culture is really important at The Honest Company, right? They're well known for being eco-friendly and socially responsible. What was it like moving from a start-up into that company environment, or was it just a natural? >> Basically, of course, Honest was a much bigger start-up for four or five years after it was initially created, so we at Insnap, very lean, agile and much more data driven. That was a bigger difference. So the way we solved it was, we actually, they actually allowed us to create our data organization called Data Signs, which was heading all the data initiatives. And then, we worked with other cross function teams, with finance, with accounting, with growth, with sales to basically help them understand what their needs are, and how to become really data driven by driving the value out of the data by using the state of the art technology. So it was a mix of team alignment and cultural change, focused on the business goal, and getting agrigate to gather around it to make the change. I really enjoyed that while we actually carried out this journey of Honest from being just descriptive, which is essentially just finding what has happened in the data, just generating reports for revenue. By becoming more predictive and prescriptive, which is more like advanced analytics and also advanced advisory role, which together plays in making decisions around features, around businesses and the operations. >> And George, you talked to a lot of customers today, and some of the same themes. Do you want to drill down some of the details of what they're doing. >> I'm curious about how you chose the first projects to get quick wins and to establish credibility? >> Yeah, that's actually a very good question. Basically, we were focused around the low hanging fruit in order to give us a jump-start and to build our reputation so that we can actually take much more advanced technology projects. And in order to do that, what we did was, if you go to Honest.com, and you search in their search bar, their search was very flimsy, and it was not revealing good results. We already built our engine, like a matching engine, so it was very easy to extend it into a full search engine. That was the first deliverable which we could deliver, and we delivered it under a month and a half or two months, right when we came in. And it was like, hey, these guys just improved our search by 10x or 100x; we are getting much more hits, much more coverage of the third strums And that served the tone. Then it was like we also wanted to, another piece which we wanted to tackle was, how do we improve Honest recommendations. That was another project. But before doing that, Honest did not even have a data warehouse, which it could call an advisor warehouse, so that you can get all the data in one place, like a like a data lake, because the data was siloed in organizations, and the analysts could not really get the data into one place and mix and match and analyze the data. So that was another big piece which we did, but we did it very early on. That was the second big deliverable, even before recommendation, the data warehouse. So basically, we plugged in Spark right in the middle, suck up all the data from different places, shove the data in, made this ETL king, which basically extracted, transformed and loaded the data into the data warehouse. Now, this data warehouse basically broke away those silos and made them a cohesive data lake which could be used for driving value and understanding patterns, especially for machine learning, analysts and all the decision makers. >> Was it a data warehouse, or was it a data lake? The reason I ask for distinction is, data warehouse is usually extremely well curated for navigation and discoverability, whereas the data lake is, as some people say, nuts, a little step up from a swamp. >> That's right, so basically, when I call it a data lake, I actually call it, because we have two data aggregation or data gathering infrastructure. One is backed by Spark and S3, which we call a data lake, where unstructured, structured data, there are all kinds of data there, mix and match, and it's not that easy sometimes, you need to do some transformation on top of the data, which is sitting there in order to really get to the needle in the haystack. But data warehouse has in grad shift, which basically gets the data from the data lake, or like the Spark ideal engine, and then makes it more like a metric-driven report, so that it's easily discoverable and it is more like what the business requires right now. It's more like formal reports, and the dimensions and all those attributes are much more well thought of. Whereas data lake is kind of like throwing it all in one piece so that at least we have the data in one place, and then we can analyze and process it. >> In putting all the data first in the data lake and then, essentially, refining it into the data warehouse, what did you use to keep track of the lineage and to make sure that you knew the truth, or truthfulness, behind all the data in the data warehouse once it got there? >> So basically, we built data model on top of S3 and Spark. We used that data model as a basis, as a source of truth to feed in the reports, and that data model was consistent across wherever you find it. So we want to make sure that those attributes, those dimensions and anything related to that data model for the e-com as well as offline patron is consistent. And so we use Spark, we use S3, essentially, to get that data model consistent, and also, we use a bunch of advanced monitoring stuff for that. When we are processing jobs, we want to make sure that we don't lose the data, and we remove the coupling between the systems by decoupling them, and essentially, in the next version, we made it even stream, even based streams, so that was like general strategy which we adopted in order to make sure that we have consistency around data lake and data warehouse. >> What would be the next step? So, now you've significantly enhanced business intelligence, and you have the richest repository behind that data warehouse. What would you do either with the data in the data warehouse or the data in the data lake repository? >> So we are constantly enriching our data lake because that needs to be updated all the time, but at the same time, we want to connect business with our metrics; they essentially derive all of that data which is sitting in the data lake to help optimize a problem. For example, we are working on sales optimization. We are working on operations optimization, demand planning, supply planning, in addition to customer insights. We are also working on other strategic project. For example, instead of just recommending or predicting LTV return, what we are doing is, we are trying to be more descriptive in our analytics in which it takes an advisory role, and looks over all the marketing spend, not just predict the high LTV customers, but actually allocates budget for different marketing spend across different channels for omni comment. For example, TV display ads, you know, all of that, so that's also happening as we speak, as we enrich our data lake and essentially generate those reports. Now, then we also need to circle back with the business folks or decision makers in order to really convince them to use that. So that's why we created these cross-functional teams, aligned to a business goal contextually aware teams, which know their roles and responsibilities, but at the same time, which can collaborate effectively and produce a result which drives the bottom line. >> What kind of customer insights were you looking for? Do they deliver family products, diapers to the home and that sort of thing? What sort of customer insights were you looking for and how is it working? >> Basically, Honest, in all our target customers, we need to better understand what their needs are. So customer insights, for example, the demographics of the customers. In addition we also wanted to see what are the things, what are the patterns which are common in customer, so that we can recommend products which are being bought by one segment of customer versus another. Those common properties, it could be mothers, who have recently had children, but who live in this neighborhood and have this kind of income level. So how do we ensure that we actually predict their demands before it actually happens. So we need to understand their habits, we need to understand the context behind it, if we are making some search, how many pages they use for this kind of product or that kind of a product, and similarly other things which enhance the understanding of the customers, make them into different buckets of segments, and then using those segments to target, because we already have data about LTV and turn as predictive models revealing if a customer is going to turn for whatever reason, we know by doing a similar campaign for other customers this has successfully given us more subscriptions or helped us to reduce a turn, that is how we target them and optimize our campaigns or our promotions for that. >> David: Sure. >> We're also looking for the overall lifestyle of the people who are passionate about Honest brands or brands that exhibit similar values, for example, eco-friendly, safe, and trusted products. >> Right, so we have just a couple of minutes to go before we get to the break. This is great stuff and George, I'll come back to you for a final question in just a moment, but in 30 seconds or so, tell us why you selected Databricks. You probably looked at other options, right? >> Shafaq: Absolutely. >> Can you give us a quick, why you made the decision? >> Absolutely, when we came in at Honest, all they had was a bunch of my secret developers, and very limited big data knowledge. So, now that they need a jump start in order to really get to that level in a very small time. How that's even achievable? We don't even have dedicated data-ops on our team. So basically, Databricks helped to bridge that gap by allowing us to get the infrastructure efficiency we needed by spinning up in hassle free manner. They also had this notebooks feature where we can scale the code and scale the team by actually reusing the boiler plate code, and similarly, different teams have different expertise. For example data science teams like Biton and data engineers like Scallop. So now those Scallop people write function which can be called by teams in data science in the same notebook, essentially giving them the ability to collaborate effectively. And then we also needed some tool to give more traction and visualization for data scientists well as data engineers. Databricks has a big visualization built in which helps to understand the causation corelation at least corelation right of the band, without even importing the data into our, or some other external tool, and making those charts. So there are a bunch of advantages around which we wanted. And then it has a platform API, like DBFS, like a disability files, it's similar to our vestry, which are cool APIs which again provide us the jump start which we needed, in so less amount of time, we actually made those, not only data warehouse, but also data driven parts. >> It sounds like Databricks has delivered. >> Shafaq: Oh yeah. >> Awesome. All right, George, just enough time for one more question if you want to throw on in. >> This one is kind of technical, but not on the technology side so much as, how do you guys measure attribution between channels and omni-channel marketing? >> That's a very good question. We have this project called Marketing Attribution, and essentially, the scope of that project is, we want to give the right ways to the right clicks of the customer as a journey of subscription or conversion. So, we have a model which basically use a bunch of techniques, including weighted and linear regression to basically come up with some kind of a weighted way of allowing those weights to be distributed among different channels. And then we also, the first problem to solve is that we needed to instrument logging so that we get those clicks and searches, all of that, into our data lake. That was done before hand, before starting the MT project, because we have a bunch of touch points. Customer could be doing search, he could be calling our sales rep, he could be tracking his order online, or he could be just leaving his cart in a state which is not fulfilled. And then, now we are trying to get it offline also, on top of that, and we are working on to get so that we know what a customer is doing in store and we have seamless experience using this MTA as a next version of it to give them a seamless experience in brick and mortar store or online. >> Great, that's great stuff, Shafaq. I wish we had more time to go. We'll talk to you more after we stop rolling. Thank you for being so honest, and we appreciate you being on the show. >> Thank you, I really appreciate it. >> Thank you so much. >> George: Shafaq, that was Great. >> All right, to all of you, thank you so much. We're going to be back in a few moments with the daily wrap up. You don't want to miss that. Thank you for joining us on theCUBE for Spark Summit 2017.

Published Date : Jun 7 2017

SUMMARY :

brought to you by Databricks. One of our last guests of the day is Shafaq Abdullah, and essentially, we deal with the technology in their stack, some of the technical details about what you're developing So the way we solved it was, we actually, and some of the same themes. our reputation so that we can actually Was it a data warehouse, or was it a data lake? and then we can analyze and process it. in order to make sure that we have consistency or the data in the data lake repository? but at the same time, we want to connect so that we can recommend products We're also looking for the overall lifestyle of the people to go before we get to the break. in so less amount of time, we actually made those, for one more question if you want to throw on in. so that we know what a customer is doing in store and we appreciate you being on the show. All right, to all of you, thank you so much.

ENTITIES

Entity	Category	Confidence
Shafaq	PERSON	0.99+
George	PERSON	0.99+
Jessica Alba	PERSON	0.99+
David	PERSON	0.99+
Shafaq Abdullah	PERSON	0.99+
four	QUANTITY	0.99+
five years	QUANTITY	0.99+
30 seconds	QUANTITY	0.99+
first	QUANTITY	0.99+
Honest	ORGANIZATION	0.99+
10x	QUANTITY	0.99+
100x	QUANTITY	0.99+
The Honest Company	ORGANIZATION	0.99+
one piece	QUANTITY	0.99+
Insnap	ORGANIZATION	0.99+
One	QUANTITY	0.98+
Databricks	ORGANIZATION	0.98+
Honest.com	ORGANIZATION	0.98+
Scallop	ORGANIZATION	0.98+
S3	TITLE	0.98+
under a month and a half	QUANTITY	0.98+
one place	QUANTITY	0.98+
one segment	QUANTITY	0.98+
first projects	QUANTITY	0.98+
Spark Summit 2017	EVENT	0.98+
Honest Company	ORGANIZATION	0.98+
first problem	QUANTITY	0.97+
Spark	TITLE	0.97+
one more question	QUANTITY	0.97+
Data Signs	ORGANIZATION	0.97+
InSnap	ORGANIZATION	0.97+
two data	QUANTITY	0.97+
today	DATE	0.94+
Biton	ORGANIZATION	0.94+
second big	QUANTITY	0.86+
Spark	ORGANIZATION	0.85+
two months	QUANTITY	0.85+
third strums	QUANTITY	0.78+
S3	ORGANIZATION	0.7+
a second	QUANTITY	0.69+
MTA	TITLE	0.67+
theCUBE	ORGANIZATION	0.61+
Big	ORGANIZATION	0.59+
couple of minutes	QUANTITY	0.52+
last guests	QUANTITY	0.5+
DBFS	TITLE	0.49+
ETL	ORGANIZATION	0.47+
#SparkSummit	EVENT	0.39+

Next-Generation Analytics Social Influencer Roundtable - #BigDataNYC 2016 #theCUBE

>> Narrator: Live from New York, it's the Cube, covering big data New York City 2016. Brought to you by headline sponsors, CISCO, IBM, NVIDIA, and our ecosystem sponsors, now here's your host, Dave Valante. >> Welcome back to New York City, everybody, this is the Cube, the worldwide leader in live tech coverage, and this is a cube first, we've got a nine person, actually eight person panel of experts, data scientists, all alike. I'm here with my co-host, James Cubelis, who has helped organize this panel of experts. James, welcome. >> Thank you very much, Dave, it's great to be here, and we have some really excellent brain power up there, so I'm going to let them talk. >> Okay, well thank you again-- >> And I'll interject my thoughts now and then, but I want to hear them. >> Okay, great, we know you well, Jim, we know you'll do that, so thank you for that, and appreciate you organizing this. Okay, so what I'm going to do to our panelists is ask you to introduce yourself. I'll introduce you, but tell us a little bit about yourself, and talk a little bit about what data science means to you. A number of you started in the field a long time ago, perhaps data warehouse experts before the term data science was coined. Some of you started probably after Hal Varian said it was the sexiest job in the world. (laughs) So think about how data science has changed and or what it means to you. We're going to start with Greg Piateski, who's from Boston. A Ph.D., KDnuggets, Greg, tell us about yourself and what data science means to you. >> Okay, well thank you Dave and thank you Jim for the invitation. Data science in a sense is the second oldest profession. I think people have this built-in need to find patterns and whatever we find we want to organize the data, but we do it well on a small scale, but we don't do it well on a large scale, so really, data science takes our need and helps us organize what we find, the patterns that we find that are really valid and useful and not just random, I think this is a big challenge of data science. I've actually started in this field before the term Data Science existed. I started as a researcher and organized the first few workshops on data mining and knowledge discovery, and the term data mining became less fashionable, became predictive analytics, now it's data science and it will be something else in a few years. >> Okay, thank you, Eves Mulkearns, Eves, I of course know you from Twitter. A lot of people know you as well. Tell us about your experiences and what data scientist means to you. >> Well, data science to me is if you take the two words, the data and the science, the science it holds a lot of expertise and skills there, it's statistics, it's mathematics, it's understanding the business and putting that together with the digitization of what we have. It's not only the structured data or the unstructured data what you store in the database try to get out and try to understand what is in there, but even video what is coming on and then trying to find, like George already said, the patterns in there and bringing value to the business but looking from a technical perspective, but still linking that to the business insights and you can do that on a technical level, but then you don't know yet what you need to find, or what you're looking for. >> Okay great, thank you. Craig Brown, Cube alum. How many people have been on the Cube actually before? >> I have. >> Okay, good. I always like to ask that question. So Craig, tell us a little bit about your background and, you know, data science, how has it changed, what's it all mean to you? >> Sure, so I'm Craig Brown, I've been in IT for almost 28 years, and that was obviously before the term data science, but I've evolved from, I started out as a developer. And evolved through the data ranks, as I called it, working with data structures, working with data systems, data technologies, and now we're working with data pure and simple. Data science to me is an individual or team of individuals that dissect the data, understand the data, help folks look at the data differently than just the information that, you know, we usually use in reports, and get more insights on, how to utilize it and better leverage it as an asset within an organization. >> Great, thank you Craig, okay, Jennifer Shin? Math is obviously part of being a data scientist. You're good at math I understand. Tell us about yourself. >> Yeah, so I'm a senior principle data scientist at the Nielsen Company. I'm also the founder of 8 Path Solutions, which is a data science, analytics, and technology company, and I'm also on the faculty in the Master of Information and Data Science program at UC Berkeley. So math is part of the IT statistics for data science actually this semester, and I think for me, I consider myself a scientist primarily, and data science is a nice day job to have, right? Something where there's industry need for people with my skill set in the sciences, and data gives us a great way of being able to communicate sort of what we know in science in a way that can be used out there in the real world. I think the best benefit for me is that now that I'm a data scientist, people know what my job is, whereas before, maybe five ten years ago, no one understood what I did. Now, people don't necessarily understand what I do now, but at least they understand kind of what I do, so it's still an improvement. >> Excellent. Thank you Jennifer. Joe Caserta, you're somebody who started in the data warehouse business, and saw that snake swallow a basketball and grow into what we now know as big data, so tell us about yourself. >> So I've been doing data for 30 years now, and I wrote the Data Warehouse ETL Toolkit with Ralph Timbal, which is the best selling book in the industry on preparing data for analytics, and with the big paradigm shift that's happened, you know for me the past seven years has been, instead of preparing data for people to analyze data to make decisions, now we're preparing data for machines to make the decisions, and I think that's the big shift from data analysis to data analytics and data science. >> Great, thank you. Miriam, Miriam Fridell, welcome. >> Thank you. I'm Miriam Fridell, I work for Elder Research, we are a data science consultancy, and I came to data science, sort of through a very circuitous route. I started off as a physicist, went to work as a consultant and software engineer, then became a research analyst, and finally came to data science. And I think one of the most interesting things to me about data science is that it's not simply about building an interesting model and doing some interesting mathematics, or maybe wrangling the data, all of which I love to do, but it's really the entire analytics lifecycle, and a value that you can actually extract from data at the end, and that's one of the things that I enjoy most is seeing a client's eyes light up or a wow, I didn't really know we could look at data that way, that's really interesting. I can actually do something with that, so I think that, to me, is one of the most interesting things about it. >> Great, thank you. Justin Sadeen, welcome. >> Absolutely, than you, thank you. So my name is Justin Sadeen, I work for Morph EDU, an artificial intelligence company in Atlanta, Georgia, and we develop learning platforms for non-profit and private educational institutions. So I'm a Marine Corp veteran turned data enthusiast, and so what I think about data science is the intersection of information, intelligence, and analysis, and I'm really excited about the transition from big data into smart data, and that's what I see data science as. >> Great, and last but not least, Dez Blanchfield, welcome mate. >> Good day. Yeah, I'm the one with the funny accent. So data science for me is probably the funniest job I've ever to describe to my mom. I've had quite a few different jobs, and she's never understood any of them, and this one she understands the least. I think a fun way to describe what we're trying to do in the world of data science and analytics now is it's the equivalent of high altitude mountain climbing. It's like the extreme sport version of the computer science world, because we have to be this magical unicorn of a human that can understand plain english problems from C-suite down and then translate it into code, either as soles or as teams of developers. And so there's this black art that we're expected to be able to transmogrify from something that we just in plain english say I would like to know X, and we have to go and figure it out, so there's this neat extreme sport view I have of rushing down the side of a mountain on a mountain bike and just dodging rocks and trees and things occasionally, because invariably, we do have things that go wrong, and they don't quite give us the answers we want. But I think we're at an interesting point in time now with the explosion in the types of technology that are at our fingertips, and the scale at which we can do things now, once upon a time we would sit at a terminal and write code and just look at data and watch it in columns, and then we ended up with spreadsheet technologies at our fingertips. Nowadays it's quite normal to instantiate a small high performance distributed cluster of computers, effectively a super computer in a public cloud, and throw some data at it and see what comes back. And we can do that on a credit card. So I think we're at a really interesting tipping point now where this coinage of data science needs to be slightly better defined, so that we can help organizations who have weird and strange questions that they want to ask, tell them solutions to those questions, and deliver on them in, I guess, a commodity deliverable. I want to know xyz and I want to know it in this time frame and I want to spend this much amount of money to do it, and I don't really care how you're going to do it. And there's so many tools we can choose from and there's so many platforms we can choose from, it's this little black art of computing, if you'd like, we're effectively making it up as we go in many ways, so I think it's one of the most exciting challenges that I've had, and I think I'm pretty sure I speak for most of us in that we're lucky that we get paid to do this amazing job. That we get make up on a daily basis in some cases. >> Excellent, well okay. So we'll just get right into it. I'm going to go off script-- >> Do they have unicorns down under? I think they have some strange species right? >> Well we put the pointy bit on the back. You guys have in on the front. >> So I was at an IBM event on Friday. It was a chief data officer summit, and I attended what was called the Data Divas' breakfast. It was a women in tech thing, and one of the CDOs, she said that 25% of chief data officers are women, which is much higher than you would normally see in the profile of IT. We happen to have 25% of our panelists are women. Is that common? Miriam and Jennifer, is that common for the data science field? Or is this a higher percentage than you would normally see-- >> James: Or a lower percentage? >> I think certainly for us, we have hired a number of additional women in the last year, and they are phenomenal data scientists. I don't know that I would say, I mean I think it's certainly typical that this is still a male-dominated field, but I think like many male-dominated fields, physics, mathematics, computer science, I think that that is slowly changing and evolving, and I think certainly, that's something that we've noticed in our firm over the years at our consultancy, as we're hiring new people. So I don't know if I would say 25% is the right number, but hopefully we can get it closer to 50. Jennifer, I don't know if you have... >> Yeah, so I know at Nielsen we have actually more than 25% of our team is women, at least the team I work with, so there seems to be a lot of women who are going into the field. Which isn't too surprising, because with a lot of the issues that come up in STEM, one of the reasons why a lot of women drop out is because they want real world jobs and they feel like they want to be in the workforce, and so I think this is a great opportunity with data science being so popular for these women to actually have a job where they can still maintain that engineering and science view background that they learned in school. >> Great, well Hillary Mason, I think, was the first data scientist that I ever interviewed, and I asked her what are the sort of skills required and the first question that we wanted to ask, I just threw other women in tech in there, 'cause we love women in tech, is about this notion of the unicorn data scientist, right? It's been put forth that there's the skill sets required to be a date scientist are so numerous that it's virtually impossible to have a data scientist with all those skills. >> And I love Dez's extreme sports analogy, because that plays into the whole notion of data science, we like to talk about the theme now of data science as a team sport. Must it be an extreme sport is what I'm wondering, you know. The unicorns of the world seem to be... Is that realistic now in this new era? >> I mean when automobiles first came out, they were concerned that there wouldn't be enough chauffeurs to drive all the people around. Is there an analogy with data, to be a data-driven company. Do I need a data scientist, and does that data scientist, you know, need to have these unbelievable mixture of skills? Or are we doomed to always have a skill shortage? Open it up. >> I'd like to have a crack at that, so it's interesting, when automobiles were a thing, when they first bought cars out, and before they, sort of, were modernized by the likes of Ford's Model T, when we got away from the horse and carriage, they actually had human beings walking down the street with a flag warning the public that the horseless carriage was coming, and I think data scientists are very much like that. That we're kind of expected to go ahead of the organization and try and take the challenges we're faced with today and see what's going to come around the corner. And so we're like the little flag-bearers, if you'd like, in many ways of this is where we're at today, tell me where I'm going to be tomorrow, and try and predict the day after as well. It is very much becoming a team sport though. But I think the concept of data science being a unicorn has come about because the coinage hasn't been very well defined, you know, if you were to ask 10 people what a data scientist were, you'd get 11 answers, and I think this is a really challenging issue for hiring managers and C-suites when the generants say I was data science, I want big data, I want an analyst. They don't actually really know what they're asking for. Generally, if you ask for a database administrator, it's a well-described job spec, and you can just advertise it and some 20 people will turn up and you interview to decide whether you like the look and feel and smell of 'em. When you ask for a data scientist, there's 20 different definitions of what that one data science role could be. So we don't initially know what the job is, we don't know what the deliverable is, and we're still trying to figure that out, so yeah. >> Craig what about you? >> So from my experience, when we talk about data science, we're really talking about a collection of experiences with multiple people I've yet to find, at least from my experience, a data science effort with a lone wolf. So you're talking about a combination of skills, and so you don't have, no one individual needs to have all that makes a data scientist a data scientist, but you definitely have to have the right combination of skills amongst a team in order to accomplish the goals of data science team. So from my experiences and from the clients that I've worked with, we refer to the data science effort as a data science team. And I believe that's very appropriate to the team sport analogy. >> For us, we look at a data scientist as a full stack web developer, a jack of all trades, I mean they need to have a multitude of background coming from a programmer from an analyst. You can't find one subject matter expert, it's very difficult. And if you're able to find a subject matter expert, you know, through the lifecycle of product development, you're going to require that individual to interact with a number of other members from your team who are analysts and then you just end up well training this person to be, again, a jack of all trades, so it comes full circle. >> I own a business that does nothing but data solutions, and we've been in business 15 years, and it's been, the transition over time has been going from being a conventional wisdom run company with a bunch of experts at the top to becoming more of a data-driven company using data warehousing and BI, but now the trend is absolutely analytics driven. So if you're not becoming an analytics-driven company, you are going to be behind the curve very very soon, and it's interesting that IBM is now coining the phrase of a cognitive business. I think that is absolutely the future. If you're not a cognitive business from a technology perspective, and an analytics-driven perspective, you're going to be left behind, that's for sure. So in order to stay competitive, you know, you need to really think about data science think about how you're using your data, and I also see that what's considered the data expert has evolved over time too where it used to be just someone really good at writing SQL, or someone really good at writing queries in any language, but now it's becoming more of a interdisciplinary action where you need soft skills and you also need the hard skills, and that's why I think there's more females in the industry now than ever. Because you really need to have a really broad width of experiences that really wasn't required in the past. >> Greg Piateski, you have a comment? >> So there are not too many unicorns in nature or as data scientists, so I think organizations that want to hire data scientists have to look for teams, and there are a few unicorns like Hillary Mason or maybe Osama Faiat, but they generally tend to start companies and very hard to retain them as data scientists. What I see is in other evolution, automation, and you know, steps like IBM, Watson, the first platform is eventually a great advance for data scientists in the short term, but probably what's likely to happen in the longer term kind of more and more of those skills becoming subsumed by machine unique layer within the software. How long will it take, I don't know, but I have a feeling that the paradise for data scientists may not be very long lived. >> Greg, I have a follow up question to what I just heard you say. When a data scientist, let's say a unicorn data scientist starts a company, as you've phrased it, and the company's product is built on data science, do they give up becoming a data scientist in the process? It would seem that they become a data scientist of a higher order if they've built a product based on that knowledge. What is your thoughts on that? >> Well, I know a few people like that, so I think maybe they remain data scientists at heart, but they don't really have the time to do the analysis and they really have to focus more on strategic things. For example, today actually is the birthday of Google, 18 years ago, so Larry Page and Sergey Brin wrote a very influential paper back in the '90s About page rank. Have they remained data scientist, perhaps a very very small part, but that's not really what they do, so I think those unicorn data scientists could quickly evolve to have to look for really teams to capture those skills. >> Clearly they come to a point in their career where they build a company based on teams of data scientists and data engineers and so forth, which relates to the topic of team data science. What is the right division of roles and responsibilities for team data science? >> Before we go, Jennifer, did you have a comment on that? >> Yeah, so I guess I would say for me, when data science came out and there was, you know, the Venn Diagram that came out about all the skills you were supposed to have? I took a very different approach than all of the people who I knew who were going into data science. Most people started interviewing immediately, they were like this is great, I'm going to get a job. I went and learned how to develop applications, and learned computer science, 'cause I had never taken a computer science course in college, and made sure I trued up that one part where I didn't know these things or had the skills from school, so I went headfirst and just learned it, and then now I have actually a lot of technology patents as a result of that. So to answer Jim's question, actually. I started my company about five years ago. And originally started out as a consulting firm slash data science company, then it evolved, and one of the reasons I went back in the industry and now I'm at Nielsen is because you really can't do the same sort of data science work when you're actually doing product development. It's a very very different sort of world. You know, when you're developing a product you're developing a core feature or functionality that you're going to offer clients and customers, so I think definitely you really don't get to have that wide range of sort of looking at 8 million models and testing things out. That flexibility really isn't there as your product starts getting developed. >> Before we go into the team sport, the hard skills that you have, are you all good at math? Are you all computer science types? How about math? Are you all math? >> What were your GPAs? (laughs) >> David: Anybody not math oriented? Anybody not love math? You don't love math? >> I love math, I think it's required. >> David: So math yes, check. >> You dream in equations, right? You dream. >> Computer science? Do I have to have computer science skills? At least the basic knowledge? >> I don't know that you need to have formal classes in any of these things, but I think certainly as Jennifer was saying, if you have no skills in programming whatsoever and you have no interest in learning how to write SQL queries or RR Python, you're probably going to struggle a little bit. >> James: It would be a challenge. >> So I think yes, I have a Ph.D. in physics, I did a lot of math, it's my love language, but I think you don't necessarily need to have formal training in all of these things, but I think you need to have a curiosity and a love of learning, and so if you don't have that, you still want to learn and however you gain that knowledge I think, but yeah, if you have no technical interests whatsoever, and don't want to write a line of code, maybe data science is not the field for you. Even if you don't do it everyday. >> And statistics as well? You would put that in that same general category? How about data hacking? You got to love data hacking, is that fair? Eaves, you have a comment? >> Yeah, I think so, while we've been discussing that for me, the most important part is that you have a logical mind and you have the capability to absorb new things and the curiosity you need to dive into that. While I don't have an education in IT or whatever, I have a background in chemistry and those things that I learned there, I apply to information technology as well, and from a part that you say, okay, I'm a tech-savvy guy, I'm interested in the tech part of it, you need to speak that business language and if you can do that crossover and understand what other skill sets or parts of the roles are telling you I think the communication in that aspect is very important. >> I'd like throw just something really quickly, and I think there's an interesting thing that happens in IT, particularly around technology. We tend to forget that we've actually solved a lot of these problems in the past. If we look in history, if we look around the second World War, and Bletchley Park in the UK, where you had a very similar experience as humans that we're having currently around the whole issue of data science, so there was an interesting challenge with the enigma in the shark code, right? And there was a bunch of men put in a room and told, you're mathematicians and you come from universities, and you can crack codes, but they couldn't. And so what they ended up doing was running these ads, and putting challenges, they actually put, I think it was crossword puzzles in the newspaper, and this deluge of women came out of all kinds of different roles without math degrees, without science degrees, but could solve problems, and they were thrown at the challenge of cracking codes, and invariably, they did the heavy lifting. On a daily basis for converting messages from one format to another, so that this very small team at the end could actually get in play with the sexy piece of it. And I think we're going through a similar shift now with what we're refer to as data science in the technology and business world. Where the people who are doing the heavy lifting aren't necessarily what we'd think of as the traditional data scientists, and so, there have been some unicorns and we've championed them, and they're great. But I think the shift's going to be to accountants, actuaries, and statisticians who understand the business, and come from an MBA star background that can learn the relevant pieces of math and models that we need to to apply to get the data science outcome. I think we've already been here, we've solved this problem, we've just got to learn not to try and reinvent the wheel, 'cause the media hypes this whole thing of data science is exciting and new, but we've been here a couple times before, and there's a lot to be learned from that, my view. >> I think we had Joe next. >> Yeah, so I was going to say that, data science is a funny thing. To use the word science is kind of a misnomer, because there is definitely a level of art to it, and I like to use the analogy, when Michelangelo would look at a block of marble, everyone else looked at the block of marble to see a block of marble. He looks at a block of marble and he sees a finished sculpture, and then he figures out what tools do I need to actually make my vision? And I think data science is a lot like that. We hear a problem, we see the solution, and then we just need the right tools to do it, and I think part of consulting and data science in particular. It's not so much what we know out of the gate, but it's how quickly we learn. And I think everyone here, what makes them brilliant, is how quickly they could learn any tool that they need to see their vision get accomplished. >> David: Justin? >> Yeah, I think you make a really great point, for me, I'm a Marine Corp veteran, and the reason I mentioned that is 'cause I work with two veterans who are problem solvers. And I think that's what data scientists really are, in the long run are problem solvers, and you mentioned a great point that, yeah, I think just problem solving is the key. You don't have to be a subject matter expert, just be able to take the tools and intelligently use them. >> Now when you look at the whole notion of team data science, what is the right mix of roles, like role definitions within a high-quality or a high-preforming data science teams now IBM, with, of course, our announcement of project, data works and so forth. We're splitting the role division, in terms of data scientist versus data engineers versus application developer versus business analyst, is that the right breakdown of roles? Or what would the panelists recommend in terms of understanding what kind of roles make sense within, like I said, a high performing team that's looking for trying to develop applications that depend on data, machine learning, and so forth? Anybody want to? >> I'll tackle that. So the teams that I have created over the years made up these data science teams that I brought into customer sites have a combination of developer capabilities and some of them are IT developers, but some of them were developers of things other than applications. They designed buildings, they did other things with their technical expertise besides building technology. The other piece besides the developer is the analytics, and analytics can be taught as long as they understand how algorithms work and the code behind the analytics, in other words, how are we analyzing things, and from a data science perspective, we are leveraging technology to do the analyzing through the tool sets, so ultimately as long as they understand how tool sets work, then we can train them on the tools. Having that analytic background is an important piece. >> Craig, is it easier to, I'll go to you in a moment Joe, is it easier to cross train a data scientist to be an app developer, than to cross train an app developer to be a data scientist or does it not matter? >> Yes. (laughs) And not the other way around. It depends on the-- >> It's easier to cross train a data scientist to be an app developer than-- >> Yes. >> The other way around. Why is that? >> Developing code can be as difficult as the tool set one uses to develop code. Today's tool sets are very user friendly. where developing code is very difficult to teach a person to think along the lines of developing code when they don't have any idea of the aspects of code, of building something. >> I think it was Joe, or you next, or Jennifer, who was it? >> I would say that one of the reasons for that is data scientists will probably know if the answer's right after you process data, whereas data engineer might be able to manipulate the data but may not know if the answer's correct. So I think that is one of the reasons why having a data scientist learn the application development skills might be a easier time than the other way around. >> I think Miriam, had a comment? Sorry. >> I think that what we're advising our clients to do is to not think, before data science and before analytics became so required by companies to stay competitive, it was more of a waterfall, you have a data engineer build a solution, you know, then you throw it over the fence and the business analyst would have at it, where now, it must be agile, and you must have a scrum team where you have the data scientist and the data engineer and the project manager and the product owner and someone from the chief data office all at the table at the same time and all accomplishing the same goal. Because all of these skills are required, collectively in order to solve this problem, and it can't be done daisy chained anymore it has to be a collaboration. And that's why I think spark is so awesome, because you know, spark is a single interface that a data engineer can use, a data analyst can use, and a data scientist can use. And now with what we've learned today, having a data catalog on top so that the chief data office can actually manage it, I think is really going to take spark to the next level. >> James: Miriam? >> I wanted to comment on your question to Craig about is it harder to teach a data scientist to build an application or vice versa, and one of the things that we have worked on a lot in our data science team is incorporating a lot of best practices from software development, agile, scrum, that sort of thing, and I think particularly with a focus on deploying models that we don't just want to build an interesting data science model, we want to deploy it, and get some value. You need to really incorporate these processes from someone who might know how to build applications and that, I think for some data scientists can be a challenge, because one of the fun things about data science is you get to get into the data, and you get your hands dirty, and you build a model, and you get to try all these cool things, but then when the time comes for you to actually deploy something, you need deployment-grade code in order to make sure it can go into production at your client side and be useful for instance, so I think that there's an interesting challenge on both ends, but one of the things I've definitely noticed with some of our data scientists is it's very hard to get them to think in that mindset, which is why you have a team of people, because everyone has different skills and you can mitigate that. >> Dev-ops for data science? >> Yeah, exactly. We call it insight ops, but yeah, I hear what you're saying. Data science is becoming increasingly an operational function as opposed to strictly exploratory or developmental. Did some one else have a, Dez? >> One of the things I was going to mention, one of the things I like to do when someone gives me a new problem is take all the laptops and phones away. And we just end up in a room with a whiteboard. And developers find that challenging sometimes, so I had this one line where I said to them don't write the first line of code until you actually understand the problem you're trying to solve right? And I think where the data science focus has changed the game for organizations who are trying to get some systematic repeatable process that they can throw data at and just keep getting answers and things, no matter what the industry might be is that developers will come with a particular mindset on how they're going to codify something without necessarily getting the full spectrum and understanding the problem first place. What I'm finding is the people that come at data science tend to have more of a hacker ethic. They want to hack the problem, they want to understand the challenge, and they want to be able to get it down to plain English simple phrases, and then apply some algorithms and then build models, and then codify it, and so most of the time we sit in a room with whiteboard markers just trying to build a model in a graphical sense and make sure it's going to work and that it's going to flow, and once we can do that, we can codify it. I think when you come at it from the other angle from the developer ethic, and you're like I'm just going to codify this from day one, I'm going to write code. I'm going to hack this thing out and it's just going to run and compile. Often, you don't truly understand what he's trying to get to at the end point, and you can just spend days writing code and I think someone made the comment that sometimes you don't actually know whether the output is actually accurate in the first place. So I think there's a lot of value being provided from the data science practice. Over understanding the problem in plain english at a team level, so what am I trying to do from the business consulting point of view? What are the requirements? How do I build this model? How do I test the model? How do I run a sample set through it? Train the thing and then make sure what I'm going to codify actually makes sense in the first place, because otherwise, what are you trying to solve in the first place? >> Wasn't that Einstein who said if I had an hour to solve a problem, I'd spend 55 minutes understanding the problem and five minutes on the solution, right? It's exactly what you're talking about. >> Well I think, I will say, getting back to the question, the thing with building these teams, I think a lot of times people don't talk about is that engineers are actually very very important for data science projects and data science problems. For instance, if you were just trying to prototype something or just come up with a model, then data science teams are great, however, if you need to actually put that into production, that code that the data scientist has written may not be optimal, so as we scale out, it may be actually very inefficient. At that point, you kind of want an engineer to step in and actually optimize that code, so I think it depends on what you're building and that kind of dictates what kind of division you want among your teammates, but I do think that a lot of times, the engineering component is really undervalued out there. >> Jennifer, it seems that the data engineering function, data discovery and preparation and so forth is becoming automated to a greater degree, but if I'm listening to you, I don't hear that data engineering as a discipline is becoming extinct in terms of a role that people can be hired into. You're saying that there's a strong ongoing need for data engineers to optimize the entire pipeline to deliver the fruits of data science in production applications, is that correct? So they play that very much operational role as the backbone for... >> So I think a lot of times businesses will go to data scientist to build a better model to build a predictive model, but that model may not be something that you really want to implement out there when there's like a million users coming to your website, 'cause it may not be efficient, it may take a very long time, so I think in that sense, it is important to have good engineers, and your whole product may fail, you may build the best model it may have the best output, but if you can't actually implement it, then really what good is it? >> What about calibrating these models? How do you go about doing that and sort of testing that in the real world? Has that changed overtime? Or is it... >> So one of the things that I think can happen, and we found with one of our clients is when you build a model, you do it with the data that you have, and you try to use a very robust cross-validation process to make sure that it's robust and it's sturdy, but one thing that can sometimes happen is after you put your model into production, there can be external factors that, societal or whatever, things that have nothing to do with the data that you have or the quality of the data or the quality of the model, which can actually erode the model's performance over time. So as an example, we think about cell phone contracts right? Those have changed a lot over the years, so maybe five years ago, the type of data plan you had might not be the same that it is today, because a totally different type of plan is offered, so if you're building a model on that to say predict who's going to leave and go to a different cell phone carrier, the validity of your model overtime is going to completely degrade based on nothing that you have, that you put into the model or the data that was available, so I think you need to have this sort of model management and monitoring process to take this factors into account and then know when it's time to do a refresh. >> Cross-validation, even at one point in time, for example, there was an article in the New York Times recently that they gave the same data set to five different data scientists, this is survey data for the presidential election that's upcoming, and five different data scientists came to five different predictions. They were all high quality data scientists, the cross-validation showed a wide variation about who was on top, whether it was Hillary or whether it was Trump so that shows you that even at any point in time, cross-validation is essential to understand how robust the predictions might be. Does somebody else have a comment? Joe? >> I just want to say that this even drives home the fact that having the scrum team for each project and having the engineer and the data scientist, data engineer and data scientist working side by side because it is important that whatever we're building we assume will eventually go into production, and we used to have in the data warehousing world, you'd get the data out of the systems, out of your applications, you do analysis on your data, and the nirvana was maybe that data would go back to the system, but typically it didn't. Nowadays, the applications are dependent on the insight coming from the data science team. With the behavior of the application and the personalization and individual experience for a customer is highly dependent, so it has to be, you said is data science part of the dev-ops team, absolutely now, it has to be. >> Whose job is it to figure out the way in which the data is presented to the business? Where's the sort of presentation, the visualization plan, is that the data scientist role? Does that depend on whether or not you have that gene? Do you need a UI person on your team? Where does that fit? >> Wow, good question. >> Well usually that's the output, I mean, once you get to the point where you're visualizing the data, you've created an algorithm or some sort of code that produces that to be visualized, so at the end of the day that the customers can see what all the fuss is about from a data science perspective. But it's usually post the data science component. >> So do you run into situations where you can see it and it's blatantly obvious, but it doesn't necessarily translate to the business? >> Well there's an interesting challenge with data, and we throw the word data around a lot, and I've got this fun line I like throwing out there. If you torture data long enough, it will talk. So the challenge then is to figure out when to stop torturing it, right? And it's the same with models, and so I think in many other parts of organizations, we'll take something, if someone's doing a financial report on performance of the organization and they're doing it in a spreadsheet, they'll get two or three peers to review it, and validate that they've come up with a working model and the answer actually makes sense. And I think we're rushing so quickly at doing analysis on data that comes to us in various formats and high velocity that I think it's very important for us to actually stop and do peer reviews, of the models and the data and the output as well, because otherwise we start making decisions very quickly about things that may or may not be true. It's very easy to get the data to paint any picture you want, and you gave the example of the five different attempts at that thing, and I had this shoot out thing as well where I'll take in a team, I'll get two different people to do exactly the same thing in completely different rooms, and come back and challenge each other, and it's quite amazing to see the looks on their faces when they're like, oh, I didn't see that, and then go back and do it again until, and then just keep iterating until we get to the point where they both get the same outcome, in fact there's a really interesting anecdote about when the UNIX operation system was being written, and a couple of the authors went away and wrote the same program without realizing that each other were doing it, and when they came back, they actually had line for line, the same piece of C code, 'cause they'd actually gotten to a truth. A perfect version of that program, and I think we need to often look at, when we're building models and playing with data, if we can't come at it from different angles, and get the same answer, then maybe the answer isn't quite true yet, so there's a lot of risk in that. And it's the same with presentation, you know, you can paint any picture you want with the dashboard, but who's actually validating when the dashboard's painting the correct picture? >> James: Go ahead, please. >> There is a science actually, behind data visualization, you know if you're doing trending, it's a line graph, if you're doing comparative analysis, it's bar graph, if you're doing percentages, it's a pie chart, like there is a certain science to it, it's not that much of a mystery as the novice thinks there is, but what makes it challenging is that you also, just like any presentation, you have to consider your audience. And your audience, whenever we're delivering a solution, either insight, or just data in a grid, we really have to consider who is the consumer of this data, and actually cater the visual to that person or to that particular audience. And that is part of the art, and that is what makes a great data scientist. >> The consumer may in fact be the source of the data itself, like in a mobile app, so you're tuning their visualization and then their behavior is changing as a result, and then the data on their changed behavior comes back, so it can be a circular process. >> So Jim, at a recent conference, you were tweeting about the citizen data scientist, and you got emasculated by-- >> I spoke there too. >> Okay. >> TWI on that same topic, I got-- >> Kirk Borne I hear came after you. >> Kirk meant-- >> Called foul, flag on the play. >> Kirk meant well. I love Claudia Emahoff too, but yeah, it's a controversial topic. >> So I wonder what our panel thinks of that notion, citizen data scientist. >> Can I respond about citizen data scientists? >> David: Yeah, please. >> I think this term was introduced by Gartner analyst in 2015, and I think it's a very dangerous and misleading term. I think definitely we want to democratize the data and have access to more people, not just data scientists, but managers, BI analysts, but when there is already a term for such people, we can call the business analysts, because it implies some training, some understanding of the data. If you use the term citizen data scientist, it implies that without any training you take some data and then you find something there, and they think as Dev's mentioned, we've seen many examples, very easy to find completely spurious random correlations in data. So we don't want citizen dentists to treat our teeth or citizen pilots to fly planes, and if data's important, having citizen data scientists is equally dangerous, so I'm hoping that, I think actually Gartner did not use the term citizen data scientist in their 2016 hype course, so hopefully they will put this term to rest. >> So Gregory, you apparently are defining citizen to mean incompetent as opposed to simply self-starting. >> Well self-starting is very different, but that's not what I think what was the intention. I think what we see in terms of data democratization, there is a big trend over automation. There are many tools, for example there are many companies like Data Robot, probably IBM, has interesting machine learning capability towards automation, so I think I recently started a page on KDnuggets for automated data science solutions, and there are already 20 different forums that provide different levels of automation. So one can deliver in full automation maybe some expertise, but it's very dangerous to have part of an automated tool and at some point then ask citizen data scientists to try to take the wheels. >> I want to chime in on that. >> David: Yeah, pile on. >> I totally agree with all of that. I think the comment I just want to quickly put out there is that the space we're in is a very young, and rapidly changing world, and so what we haven't had yet is this time to stop and take a deep breath and actually define ourselves, so if you look at computer science in general, a lot of the traditional roles have sort of had 10 or 20 years of history, and so thorough the hiring process, and the development of those spaces, we've actually had time to breath and define what those jobs are, so we know what a systems programmer is, and we know what a database administrator is, but we haven't yet had a chance as a community to stop and breath and say, well what do we think these roles are, and so to fill that void, the media creates coinages, and I think this is the risk we've got now that the concept of a data scientist was just a term that was coined to fill a void, because no one quite knew what to call somebody who didn't come from a data science background if they were tinkering around data science, and I think that's something that we need to sort of sit up and pay attention to, because if we don't own that and drive it ourselves, then somebody else is going to fill the void and they'll create these very frustrating concepts like data scientist, which drives us all crazy. >> James: Miriam's next. >> So I wanted to comment, I agree with both of the previous comments, but in terms of a citizen data scientist, and I think whether or not you're citizen data scientist or an actual data scientist whatever that means, I think one of the most important things you can have is a sense of skepticism, right? Because you can get spurious correlations and it's like wow, my predictive model is so excellent, you know? And being aware of things like leaks from the future, right? This actually isn't predictive at all, it's a result of the thing I'm trying to predict, and so I think one thing I know that we try and do is if something really looks too good, we need to go back in and make sure, did we not look at the data correctly? Is something missing? Did we have a problem with the ETL? And so I think that a healthy sense of skepticism is important to make sure that you're not taking a spurious correlation and trying to derive some significant meaning from it. >> I think there's a Dilbert cartoon that I saw that described that very well. Joe, did you have a comment? >> I think that in order for citizen data scientists to really exist, I think we do need to have more maturity in the tools that they would use. My vision is that the BI tools of today are all going to be replaced with natural language processing and searching, you know, just be able to open up a search bar and say give me sales by region, and to take that one step into the future even further, it should actually say what are my sales going to be next year? And it should trigger a simple linear regression or be able to say which features of the televisions are actually affecting sales and do a clustering algorithm, you know I think hopefully that will be the future, but I don't see anything of that today, and I think in order to have a true citizen data scientist, you would need to have that, and that is pretty sophisticated stuff. >> I think for me, the idea of citizen data scientist I can relate to that, for instance, when I was in graduate school, I started doing some research on FDA data. It was an open source data set about 4.2 million data points. Technically when I graduated, the paper was still not published, and so in some sense, you could think of me as a citizen data scientist, right? I wasn't getting funding, I wasn't doing it for school, but I was still continuing my research, so I'd like to hope that with all the new data sources out there that there might be scientists or people who are maybe kept out of a field people who wanted to be in STEM and for whatever life circumstance couldn't be in it. That they might be encouraged to actually go and look into the data and maybe build better models or validate information that's out there. >> So Justin, I'm sorry you had one comment? >> It seems data science was termed before academia adopted formalized training for data science. But yeah, you can make, like Dez said, you can make data work for whatever problem you're trying to solve, whatever answer you see, you want data to work around it, you can make it happen. And I kind of consider that like in project management, like data creep, so you're so hyper focused on a solution you're trying to find the answer that you create an answer that works for that solution, but it may not be the correct answer, and I think the crossover discussion works well for that case. >> So but the term comes up 'cause there's a frustration I guess, right? That data science skills are not plentiful, and it's potentially a bottleneck in an organization. Supposedly 80% of your time is spent on cleaning data, is that right? Is that fair? So there's a problem. How much of that can be automated and when? >> I'll have a shot at that. So I think there's a shift that's going to come about where we're going to move from centralized data sets to data at the edge of the network, and this is something that's happening very quickly now where we can't just hold everything back to a central spot. When the internet of things actually wakes up. Things like the Boeing Dreamliner 787, that things got 6,000 sensors in it, produces half a terabyte of data per flight. There are 87,400 flights per day in domestic airspace in the U.S. That's 43.5 petabytes of raw data, now that's about three years worth of disk manufacturing in total, right? We're never going to copy that across one place, we can't process, so I think the challenge we've got ahead of us is looking at how we're going to move the intelligence and the analytics to the edge of the network and pre-cook the data in different tiers, so have a look at the raw material we get, and boil it down to a slightly smaller data set, bring a meta data version of that back, and eventually get to the point where we've only got the very minimum data set and data points we need to make key decisions. Without that, we're already at the point where we have too much data, and we can't munch it fast enough, and we can't spin off enough tin even if we witch the cloud on, and that's just this never ending deluge of noise, right? And you've got that signal versus noise problem so then we're now seeing a shift where people looking at how do we move the intelligence back to the edge of network which we actually solved some time ago in the securities space. You know, spam filtering, if an emails hits Google on the west coast of the U.S. and they create a check some for that spam email, it immediately goes into a database, and nothing gets on the opposite side of the coast, because they already know it's spam. They recognize that email coming in, that's evil, stop it. So we've already fixed its insecurity with intrusion detection, we've fixed it in spam, so we now need to take that learning, and bring it into business analytics, if you like, and see where we're finding patterns and behavior, and brew that out to the edge of the network, so if I'm seeing a demand over here for tickets on a new sale of a show, I need to be able to see where else I'm going to see that demand and start responding to that before the demand comes about. I think that's a shift that we're going to see quickly, because we'll never keep up with the data munching challenge and the volume's just going to explode. >> David: We just have a couple minutes. >> That does sound like a great topic for a future Cube panel which is data science on the edge of the fog. >> I got a hundred questions around that. So we're wrapping up here. Just got a couple minutes. Final thoughts on this conversation or any other pieces that you want to punctuate. >> I think one thing that's been really interesting for me being on this panel is hearing all of my co-panelists talking about common themes and things that we are also experiencing which isn't a surprise, but it's interesting to hear about how ubiquitous some of the challenges are, and also at the announcement earlier today, some of the things that they're talking about and thinking about, we're also talking about and thinking about. So I think it's great to hear we're all in different countries and different places, but we're experiencing a lot of the same challenges, and I think that's been really interesting for me to hear about. >> David: Great, anybody else, final thoughts? >> To echo Dez's thoughts, it's about we're never going to catch up with the amount of data that's produced, so it's about transforming big data into smart data. >> I could just say that with the shift from normal data, small data, to big data, the answer is automate, automate, automate, and we've been talking about advanced algorithms and machine learning for the science for changing the business, but there also needs to be machine learning and advanced algorithms for the backroom where we're actually getting smarter about how we ingestate and how we fix data as it comes in. Because we can actually train the machines to understand data anomalies and what we want to do with them over time. And I think the further upstream we get of data correction, the less work there will be downstream. And I also think that the concept of being able to fix data at the source is gone, that's behind us. Right now the data that we're using to analyze to change the business, typically we have no control over. Like Dez said, they're coming from censors and machines and internet of things and if it's wrong, it's always going to be wrong, so we have to figure out how to do that in our laboratory. >> Eaves, final thoughts? >> I think it's a mind shift being a data scientist if you look back at the time why did you start developing or writing code? Because you like to code, whatever, just for the sake of building a nice algorithm or a piece of software, or whatever, and now I think with the spirit of a data scientist, you're looking at a problem and say this is where I want to go, so you have more the top down approach than the bottom up approach. And have the big picture and that is what you really need as a data scientist, just look across technologies, look across departments, look across everything, and then on top of that, try to apply as much skills as you have available, and that's kind of unicorn that they're trying to look for, because it's pretty hard to find people with that wide vision on everything that is happening within the company, so you need to be aware of technology, you need to be aware of how a business is run, and how it fits within a cultural environment, you have to work with people and all those things together to my belief to make it very difficult to find those good data scientists. >> Jim? Your final thought? >> My final thoughts is this is an awesome panel, and I'm so glad that you've come to New York, and I'm hoping that you all stay, of course, for the the IBM Data First launch event that will take place this evening about a block over at Hudson Mercantile, so that's pretty much it. Thank you, I really learned a lot. >> I want to second Jim's thanks, really, great panel. Awesome expertise, really appreciate you taking the time, and thanks to the folks at IBM for putting this together. >> And I'm big fans of most of you, all of you, on this session here, so it's great just to meet you in person, thank you. >> Okay, and I want to thank Jeff Frick for being a human curtain there with the sun setting here in New York City. Well thanks very much for watching, we are going to be across the street at the IBM announcement, we're going to be on the ground. We open up again tomorrow at 9:30 at Big Data NYC, Big Data Week, Strata plus the Hadoop World, thanks for watching everybody, that's a wrap from here. This is the Cube, we're out. (techno music)

Published Date : Sep 28 2016

SUMMARY :

Brought to you by headline sponsors, and this is a cube first, and we have some really but I want to hear them. and appreciate you organizing this. and the term data mining Eves, I of course know you from Twitter. and you can do that on a technical level, How many people have been on the Cube I always like to ask that question. and that was obviously Great, thank you Craig, and I'm also on the faculty and saw that snake swallow a basketball and with the big paradigm Great, thank you. and I came to data science, Great, thank you. and so what I think about data science Great, and last but not least, and the scale at which I'm going to go off script-- You guys have in on the front. and one of the CDOs, she said that 25% and I think certainly, that's and so I think this is a great opportunity and the first question talk about the theme now and does that data scientist, you know, and you can just advertise and from the clients I mean they need to have and it's been, the transition over time but I have a feeling that the paradise and the company's product and they really have to focus What is the right division and one of the reasons I You dream in equations, right? and you have no interest in learning but I think you need to and the curiosity you and there's a lot to be and I like to use the analogy, and the reason I mentioned that is that the right breakdown of roles? and the code behind the analytics, And not the other way around. Why is that? idea of the aspects of code, of the reasons for that I think Miriam, had a comment? and someone from the chief data office and one of the things that an operational function as opposed to and so most of the time and five minutes on the solution, right? that code that the data but if I'm listening to you, that in the real world? the data that you have or so that shows you that and the nirvana was maybe that the customers can see and a couple of the authors went away and actually cater the of the data itself, like in a mobile app, I love Claudia Emahoff too, of that notion, citizen data scientist. and have access to more people, to mean incompetent as opposed to and at some point then ask and the development of those spaces, and so I think one thing I think there's a and I think in order to have a true so I'd like to hope that with all the new and I think So but the term comes up and the analytics to of the fog. or any other pieces that you want to and also at the so it's about transforming big data and machine learning for the science and now I think with the and I'm hoping that you and thanks to the folks at IBM so it's great just to meet you in person, This is the Cube, we're out.

ENTITIES

Entity	Category	Confidence
Jennifer	PERSON	0.99+
Jennifer Shin	PERSON	0.99+
Miriam Fridell	PERSON	0.99+
Greg Piateski	PERSON	0.99+
Justin	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Jeff Frick	PERSON	0.99+
2015	DATE	0.99+
Joe Caserta	PERSON	0.99+
James Cubelis	PERSON	0.99+
James	PERSON	0.99+
Miriam	PERSON	0.99+
Jim	PERSON	0.99+
Joe	PERSON	0.99+
Claudia Emahoff	PERSON	0.99+
NVIDIA	ORGANIZATION	0.99+
Hillary	PERSON	0.99+
New York	LOCATION	0.99+
Hillary Mason	PERSON	0.99+
Justin Sadeen	PERSON	0.99+
Greg	PERSON	0.99+
Dave	PERSON	0.99+
55 minutes	QUANTITY	0.99+
Trump	PERSON	0.99+
2016	DATE	0.99+
Craig	PERSON	0.99+
Dave Valante	PERSON	0.99+
George	PERSON	0.99+
Dez Blanchfield	PERSON	0.99+
UK	LOCATION	0.99+
Ford	ORGANIZATION	0.99+
Craig Brown	PERSON	0.99+
10	QUANTITY	0.99+
8 Path Solutions	ORGANIZATION	0.99+
CISCO	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
two	QUANTITY	0.99+
30 years	QUANTITY	0.99+
Kirk	PERSON	0.99+
25%	QUANTITY	0.99+
Marine Corp	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
43.5 petabytes	QUANTITY	0.99+
Boston	LOCATION	0.99+
Data Robot	ORGANIZATION	0.99+
10 people	QUANTITY	0.99+
Hal Varian	PERSON	0.99+
Einstein	PERSON	0.99+
New York City	LOCATION	0.99+
Nielsen	ORGANIZATION	0.99+
first question	QUANTITY	0.99+
Friday	DATE	0.99+
Ralph Timbal	PERSON	0.99+
U.S.	LOCATION	0.99+
6,000 sensors	QUANTITY	0.99+
UC Berkeley	ORGANIZATION	0.99+
Sergey Brin	PERSON	0.99+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for ETL: