Daniel Heacock, Etix & Adam Haines, Federated Sample - AWS Re:Invent 2013 - #awsreinvent #theCUBE
hi everybody we are live at AWS reinvents in Las Vegas I'm Jeff Kelly with Wikibon org you're watching the cube silicon angles premiere live broadcast we go out to the technology events and as John foyer likes to say extract the signal from the noise so being here at the AWS show we were talk we're going to talk to a lot of AWS customers here a lot about what they're doing in in this case around analytics data warehousing and data integration so for this segment I'm joined by two customers Daniel heacock senior business systems analyst with a tix and Adam Cain's who's a data architect with federated sample welcome guys thanks for joining us on the cube Thanks your first time so we'll promise we'll make this as painless as possible so so you guys have a couple things in common we were talking beforehand some of the workflows are similar you work your you're using Amazon Web Services redshift platform for data warehousing you're using attunity for some of the data integration to bring that in from your for your operational transactional databases and using a bi tool on top to kind of tease out some of the insights from that data but why don't we get started Daniel we'll start with you tell us a little bit about etix kind of what you guys do and then we'll just kind of get into the use cases and talk to use AWS and the tuner need some of the other technologies you use it sure yeah so the company I work for is etix we are a primary market ticketing company in the entertainment industry we provide a box office solutions to venues and venue owners all types of events casinos fairs festivals pretty much you name and we sell some tickets in that industry we we provide a software solution that enables those menu owners to engage their customers and sell tickets so could kind of a competitor to something like ticketmaster the behemoth in the industry and you're definitely so Ticketmaster would be the behemoth in the industry and we are we consider ourselves a smaller sexier version that more friendly to the customer customer friendly more agile absolutely so Adam tell us a little bit about better a sample sure federated sample is a technology company in the market research industry and we aim to do is add an exchange layer between buyers and sellers so we facilitate the transaction between when a buyer or a company like coke would say hey we need to do a survey we will negotiate pricing and route our respondents to their surveys try to make that a more seamless process so they don't have to go out and find your very respond right everything online and right right absolutely got it so so let's talk a little bit about let's start with AWS so obviously we're here to reinvent a big show 9,000 people here so you guys you know talk about agile talk about cloud enabling kind of innovation and I'm gonna start with you what kind of brought you to AWS are you using red shift and I think you mentioned you're all in the cloud right just give us your impressions of the show in AWS and what that's meant your business right shows been great so far as to we were originally on-premise entirely at data center out in California and it just didn't meet our rapid growth we're a smaller company startup so we couldn't handle the growth so we need something more elastic more agile so we ended up moving our entire infrastructure into amazon web services so then we found that we had a need to actually perform analytics on that data and that's when we started the transition to you know redshift and so the idea being you're moving data from your transactional system which is also on AWS into redshift so using attunity for that they're clapping solution talk a little bit about that and and you know how that is differentiate from some of the other integration methods you could have chosen right so we started with a more conventional integration method a homegrown solution to move our data from our production sequel server into redshift and it worked but it was not optimal didn't have all the bells and whistles and it was prone to bad management being like not many people could configure it know how to use it so then we saw cloud being from attunity and they offered a native solution using secret survey replication that could tie into our native sequel server and then push that data directly into cloud being at a very fast rate so moving that data from from the sequel server it is essentially a real-time replication so that yes that's moving that data into redshifts of the year analysts can actually write when they're doing there the reporting or doing some real ad hoc kind of queries they can be confident they've got the most up-to-date data from your secret service right actual system right yeah nearly real-time and just to put in perspective the reports that we were running on our other system we're taking you know 10 15 minutes to run in redshift we're running those same reports in minutes 1 12 minutes right and if you're running those reports so quickly you know the people sometimes forget when you're talking about you know real time or interactive queries and reporting it's somewhat only as good as the data timeliness that you've got that you by Dave the timeless of the data you've got in that database because right trying to make some real-time decisions you've got a lag of depending on the workload and your use case even 15 minutes to an hour back might really impact you're ready to make those decisions so Adam talk a little bit about your use case is it is a similar cloud cloud architecture are you moving from upside Daniel moving from on-premise to so you're actually working with an on-premise data center it's an Oracle database and so we've basically we we ran into two limitations one regarding to our current reporting infrastructure and then to kind of our business intelligence capabilities and so as an analyst I've been kind of tasked with creating internal feedback loops within our organization as far as delivering certain types of KPIs and metrics to you know inform our our different teams or operations teams our marketing teams so that has been one of the kind of BI lms that we've been able to achieve because of the replication and the redshift and then the the other is actually making our reporting more I guess comprehensive we're able to run now that we're using redshift we're able to run reports that we were previously not be able to do to run on our on-premise transactional database so really we just are kind of embracing the power of redshift and it's enabling us and a lot of different types of ways yeah i mean we're hearing a lot about red shift at the show it's the amazon says the fastest-growing service AWS has had from a revenue perspective and it's six seven year history so clearly there's a lot of power in that platform it removes a lot of the concerns around having to manage that infrastructure obviously but the performance you know that's that's something I think when people are have their own data centers their own databases tuning those for the type of performance you're looking for is can be a challenge is that one of the drivers to kind of your move to redshift oh for sure the performance i I'm trying to think of a good example of a metric to compare but it's basically enabled us to develop a product or to develop products that would not have been possible otherwise there were certain i guess the ability to crunch data like you said in a specific time frame is very important for reporting purposes and if you're not able to meet a certain time frame then certain type of report is just not going to be useful so it's opening the door for new types of products within our organization well let's dig into that a little bit the different data types we're talking about so you've got a tea tix you're talking about customer transactions your custom are you talking about profiles of different types of customers tell us about some of the data sources that you're moving from your transactional system which i think is an Oracle database to to red shift and then you know what are some of those types of analytic workloads what kind of insights are you looking for sure so you know we're in the business of selling tickets and so one of our you know main concerns or I guess you should say we're in the business of helping our customers sell tickets and so we're always trying to figure out ways to improve their marketing efforts and so marketing segmentation is one of the huge ones appending data from large data services in order to get customer demographic information is something as you know easy to do in red shift and so we're able to use that information transaction information customer information I guess better engage our fans and likewise Adam could you maybe walk us through kind of a use case maybe your types of data you're looking at right that you're moving into red ship with attunity and then you know what kind of analytics are you doing on top of that what kind of insights are you gathering right so are our date is a little bit different than then ticketing but what we ultimately capture is is a respondent answers to questions so we try to find the value in a particular set of answers so we can determine the quality of the supply that's sent from suppliers so if they say that a person meets a certain demographic that we can actually verify that that person reads that demographic and then we can actually help them improve their supply that they push down to that respondent to it everybody makes more money because the completion rates go up so overall just business and analysis on that type of information so that we can help our customers and help ourselves so I wonder if we could talk a little bit about kind of the BI layer on top as well I think you're both using jaspersoft but you know beyond that you know one of the topics we've been covering on the cube another and on Wikibon is this whole analytics for all movement and we've been hearing about self service business intelligence for 20-plus years from some of the more incumbent vendors like business objects and cognos that others but really I mean if you look at a typical enterprise business intelligence usage or adoption rate kind of stalls out by eighteen percent twenty percent talk about how you've seen this kind of industry evolve a little bit maybe talk about jaspersoft specifically but what are some of the things that you think have to happen or some of the types of tools that are needed to really make business intelligence more consumable for analysts and more business use people who are not necessarily trained in statistics aren't data scientists Adam we start yes so one of the things that we're doing is with our jaspersoft we're trying to figure out you know certain we have a pis and we have traditional you know client server applications which ones our customers want to use the most because we're trying to push everybody towards an API oriented so we're trying to put that data into redshift with Jasper soft and kind of flip that data and look at it year-to-date or over a period of time to see where all of our money's coming from where others are rather than getting driven from and our business users are now empowered with jaspersoft to do that themselves they don't rely on us to pull data from they could just tie right into jaspersoft grab the data they need for whatever period of time they want and look at it in a nice pretty chart as a similar experience you're having any text definitely and I think one of the things I should emphasize about our use of Jasper's off and basically really any bi tool you choose to use in the Amazon platform is just the ability to launch it almost immediately and be able to play with data within 5-10 minutes of trying to launch it yeah it's pretty amazing what how quickly things can come from just a thought into action so well that's a good point because I mean you think about not just bitten telligence but the whole datawarehousing world it was you know the traditional method is you you know the business user a business unit goes to IT they say here are some of the requirements of the metrics we want on these reports IT then gun it goes away and builds it comes back six months later 12 months later here you go here's the report and next thing you know the business doesn't remember what they asked for this isn't necessarily going to serve our needs anymore and you've just essentially it's not a particularly useful model and Amazon really helps you kind of shorten that time frame significantly it sounds like between what you can do with redshift and some of their other database products and whatever bi to used to use is that kind of how you see this evolving oh definitely and the options I guess the the kind of plug and play workflow is is pretty pretty amazing and it's a it's given us the flexibility in our organization to be able to say well we can use this tool for now and there's a there's a chance we may decide there's something different in the future that we want to use and plugin in its place we're confident that that product will be there whenever the you know whenever the need is there right well that's the other thing you can you can start to use a tool and if it doesn't meet your need you can stop using it move to another tool so I think that puts you know vendors like jaspersoft than others puts them on their toes they've got to continually innovate and make their product useful otherwise you know they know that you know there were AWS customers can simply press the button stop using it press another button stop start using another tool so I think it's good in that sense but kind of you know when you talk about cloud and especially around data you get questions around privacy about data ownership who owns the data if it's in amazon's cloud is your data but you know it's on there in their data centers how do you feel about that Adam is there any concerns around either privacy or data ownership when it comes to using the cloud I mean you guys are all in in the cloud so right yeah so we've isolated a lot of our data into virtual private clouds so with that segment of the network we feel much more comfortable putting our data in a public space because we do feel like it's secure enough for our type of data so that was one of the major concerns up front but you know after talking with Amazon and going through the whole process of migrating to we kind of feel way more comfortable with that if you expand on that a little so you've got a private instance essentially in amazon's rep right so we have a private subnet so it's a segmented piece of their network that's just for us okay so we're not you can't access this publicly only within our VPN client or within our infrastructure itself so we're segmented we're away from that everybody else interesting so they offer that kind of type of service when there's more privacy concern as a security concern definitely and of course a lot depends on the type of data i mean how sensitive that data is if it you know but personally identifiable data obviously is going to be more sensitive than if it's just a general market data that anyone could potentially access daniel is we'll talk about your concerns around that or did you have concerns definitely a more of a governance people process question than a technology question I think well I definitely a technology question to a certain extent I mean as a as a transaction based business we were obviously very concerned with security and our CTO is very adamant about that and so that was one of the first first issues that we address whenever we decided to go this route and I'm obviously AWS has has taken all the precautions we have a very similar set up to what Adam is describing as far as our security we are very much confident that it is a very robust solution so looking forward how do you see your use of both the cloud and kind of analytics evolving you know one of the things we've been covering a lot is the as use case to get more complex your kind of you've got to orchestrate more data flows you've got to move data for more places you mentioned you're using attunity to do some of that replication from your transactional database and some red shift you know what are some of the other potential data integration challenges you see fate you see yourselves facing as you kind of potentially get more complex deployments we've got more data maybe you start using more services on Amazon how do you look to tackle some of those eight integration challenges let me start that's a good question one of the things we're trying to do inside of you know our organization is I guess bring data from all the different sources that we have together we have you know we use Salesforce for our sales team we collect information from MailChimp from our digital marketing agency that that we'd like to tile that information together and so that's something we're working on attunity has been a great help there and they're you know they're their product development as far as their capabilities of bringing in information from other sources is growing so that's a you know we're confident that the demand is there and that the product will develop as we as we move forward well I mean it's interesting that we've got you know you two gentlemen up here one with a kind of a on premise to cloud deployment and one all in the cloud so I'm clearly tuning you can kind of gap both those right on premise and cloud roll but also work in the cloud environment Adam when we if you could talk a little bit about how you see this kind of evolving as you get more complex maybe bring in more systems are you looking to bring in more data sources maybe even third-party data sources outside data sources how are you how do you look at this evolve right President Lee we do have a Mongo database so we have other sources that we're doing now there's talks of even trying to stick that in dynamo DB which is a reg amazon offering and that ties directly into redshift so we could load that data directly into that using that key pair or however we want to use that type of data data Mart but one of the things that we're trying to work out right now is just distribution and you know being agile you know elasticity which I work those issues with our growing database so so our database grows rather large each month so working on scalability is our primary focus but other data sources so we look into other database technologies that we can leverage in addition to sequel server to help distribute that load you so we've got time just for one more question I wonder I always like to ask when we get customers and users on if you can give some advice to other practitioners for watching so I mean if you can give one piece of advice to somebody who might be in your position they're looking at maybe they've got an on-premise data warehouse or maybe they're just trying to figure out a way to to get make better use of their data I mean what would the we the one thing would it be a technology piece of advice maybe you know looked at something like red shift or and solutions like attunity but maybe it would be more of a you know cultural question around the use of data and I'm I instead of making data-driven decisions but with that kind of one piece of ice big I could put you on the spot okay I would say don't try to do it yourself when the experts have done it for I couldn't put it any more simpler than that very succinct but very powerful but for me my biggest takeaway would be just redshift I was kind of apprehensive to use it at first I was so used to other technologies but we can do so much with redshift now add you know half the cost so your good works pretty compelling all right fantastic well Adam pains Daniel heacock thank you so much for joining us on the cube appreciate it we'll be right back with our next guests we're live here at AWS reinvent in Las Vegas you're watching the cube the cute
**Summary and Sentiment Analysis are not been shown because of improper transcript**
ENTITIES
Entity | Category | Confidence |
---|---|---|
Daniel heacock | PERSON | 0.99+ |
Jeff Kelly | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Daniel Heacock | PERSON | 0.99+ |
Adam Cain | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
California | LOCATION | 0.99+ |
amazon | ORGANIZATION | 0.99+ |
Daniel heacock | PERSON | 0.99+ |
Daniel | PERSON | 0.99+ |
20-plus years | QUANTITY | 0.99+ |
Adam | PERSON | 0.99+ |
jaspersoft | ORGANIZATION | 0.99+ |
Adam Haines | PERSON | 0.99+ |
two customers | QUANTITY | 0.99+ |
15 minutes | QUANTITY | 0.99+ |
eighteen percent | QUANTITY | 0.99+ |
etix | TITLE | 0.99+ |
9,000 people | QUANTITY | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
twenty percent | QUANTITY | 0.99+ |
Amazon Web Services | ORGANIZATION | 0.98+ |
first time | QUANTITY | 0.98+ |
Oracle | ORGANIZATION | 0.98+ |
10 15 minutes | QUANTITY | 0.98+ |
each month | QUANTITY | 0.97+ |
John foyer | PERSON | 0.97+ |
one | QUANTITY | 0.97+ |
one more question | QUANTITY | 0.96+ |
5-10 minutes | QUANTITY | 0.96+ |
coke | ORGANIZATION | 0.96+ |
two gentlemen | QUANTITY | 0.96+ |
Ticketmaster | ORGANIZATION | 0.95+ |
President | PERSON | 0.95+ |
both | QUANTITY | 0.95+ |
six months later | DATE | 0.94+ |
12 months later | DATE | 0.93+ |
six seven year | QUANTITY | 0.92+ |
eight integration | QUANTITY | 0.91+ |
12 minutes | QUANTITY | 0.89+ |
ticketmaster | ORGANIZATION | 0.88+ |
lot of our data | QUANTITY | 0.88+ |
agile | TITLE | 0.87+ |
first | QUANTITY | 0.85+ |
one piece | QUANTITY | 0.85+ |
red shift | TITLE | 0.85+ |
two limitations | QUANTITY | 0.85+ |
one piece of advice | QUANTITY | 0.82+ |
Jasper | TITLE | 0.8+ |
Wikibon | ORGANIZATION | 0.78+ |
minutes 1 | QUANTITY | 0.77+ |
lot | QUANTITY | 0.75+ |
first issues | QUANTITY | 0.74+ |
2021 AWSSQ2 054 AWS Mike Tarselli and Michelle Bradbury
>> Announcer: From theCUBE studios in Palo Alto and Boston, connecting with thought leaders all around the world. This is a CUBE Conversation. >> Hello. Welcome to today's session of the AWS Startup Showcase, The Next Big Thing in AI, Security & Life Sciences. Today featuring TetraScience for the life sciences track. I'm your host Natalie Erlich, and now we are joined by our special guests, Michelle Bradbury, VP of Product at TetraScience, as well as Mike Tarselli, the Chief Scientific Officer at TetraScience. We're going to talk about the R&D Data Cloud movement in life sciences, unlocking experimental data to accelerate discovery. Thank you both very much for joining us today. >> Thank you for having us. >> Yeah, thank you. Great to be here. >> Well, while traditionally slower to adopt cloud technology in R&D, global pharmas are now launching digital lab initiatives to improve time to market for therapeutics. Now, can you discuss some of the key challenges still facing big pharma in terms of digital transformation? >> Sure. I guess I'll start in. The big pharma sort of organization that we have today happens to work very well in its particular way, i.e., they have some architecture they've installed, usually on-premises. They are sort of tentatively sticking their foot into the cloud. They're learning how to move forward into that, and in order to process and automate their data streams. However, we would argue they haven't done enough fast enough and that they need to get there faster in order to deliver patient value and efficiencies to their businesses. >> Well, how specifically, now for Michelle, can R&D Data Cloud help big pharma in this digital transformation? >> So the big thing that large pharmas face is a couple different things. So the ecosystem within large pharma is a lot of diverse data types, a lot of diverse file types. So that's one thing that the data cloud handles very well to be able to parse through, harmonize, and bring together your data so that it can be leveraged for things like AI and machine learning at large-scale, which is sort of the other part where I think one of the large sort of challenges that pharma faces is sort of a proliferation of data. And what cloud offers, specifically, is a better way to store, more scalable storage, better ability to even tier your storage while still making it searchable, maintainable, and offer a lot of flexibility to the actual pharma companies. >> And what about security and compliance, or even governance? What are those implications? >> Sure. I'll jump into that one. So security and compliance, every large pharma is a regulated industry. Everyone watching this probably is aware of that. And so we therefore have to abide by the same tenets that they would. So 21 CFR Part 11 compliance, getting ready for GXP ready systems, And in fact, doing extra certifications around a SOC 2 Type 2, ISO 9001, really every single regulation that would allow our cloud solution to be quality, ready, inspectable, and really performant for what needs to be done for an eventual FDA submission. >> And can you also speak about some of the advances that we're seeing in machine learning and artificial intelligence, and how that will impact pharma, and what your role is in that at TetraScience? >> Sure. I'll pass this one to Michelle first. >> I was going to say I can take that one. So one of the things that we're seeing in terms of where AI and ML will go with large pharma is their ability to not only search and build models against the data that they have access to right now, which is very limited in the way they search, but the ability to go through the historical amount of data, the ability to leverage mass parallel compute on top of these giant data clusters, and what that means in terms of not only faster time to market for drugs, but also, I think, more accurate and precise testing coming in the future. So I think there's so much opportunity for this really data-rich vertical and industry to leverage in a lot of the modern tooling that it hasn't been able to leverage so far. >> And Mike, what would you say are the benefits that a fully automated lab could bring with increased fairness and data liquidity? >> Yeah, sure. Let's go five years into the future. I am a bench chemist, and I'm trying to get some results in, and it's amazing because I can look up everything the rest of my colleagues have ever done on this particular project with a single click of a button in a simple term set in natural language. I can then find and retrieve those results, easily visualize them in our platform or in any other platform I choose to use. And then I can inspect those, interrogate those, and say, "Actually, I'm going to be able to set up this automation cascade." I'll probably have it ready by the afternoon. All the data that's returned to me through this is going to be easily integratable, harmonized, and you're going to be able to find it, obviously. You're going to interoperate it with any system, so if I suddenly decide that I need to send a report over to another division in their preferred vis tool or data system of choice, great! I click three buttons, configure it. Boom. There goes that report to them. This should be a simple vision to achieve even faster than five years. And that data liquidity that enables you to sort of pass results around outside of your division, and outside of even your sort of company or division, to other who are able to see it should be fairly easy to achieve if all that data is ingested the right way. >> Well, I'd love to ask this next question to both of you. What is your defining contribution to the future of cloud scale? >> Mike, you want to go first? >> (chuckles) I would love to. So right now the pharmaceutical and life sciences companies, they aren't seeing data increase linearly. They're seeing it increase exponentially, right? We are living in the exabyte era, and really have on the internet since about 2016. It's only going to get bigger, and it's going to get bigger in a power law, right? So you're going to see, as sequencing comes on, as larger form microscopy comes on, and as more and more companies are taking on more and more data about each individual sample, retaining that data for longer, doing more analytics of that data, and also doing personalized medicine, right, more data about a specific patient, or animal, or cell line. You're just going to see this absolute data explosion. And because of that, the only thing you can really do to keep up with that is be in the cloud. On-prem, you will be buying disk drives and out of physical materials before you're going to outstrip the data. Michelle? >> Yeah. And, I think, to go along with not just the data storage scale, I think the compute scale. Mike is absolutely right. We're seeing personalized drugs. We're seeing customers that want to, within a matter of three, four hours, get to a personalized drug for patients. And that kind of scale on a compute basis not just requires a ton of data, but requires mass compute ability to be able to get it right, right? And so it really becomes this marriage of getting a huge amount of data, and getting the mass compute to be able to really leverage that per patient. And then the one thing that... Sort of enabling that ecosystem to come centrally together across such a diverse dataset is sort of that driving force. If you can get the data together but you can't compute it, if you can compute it but you can't get it together, it all needs to come together. Otherwise it just doesn't work. >> Yeah. Well, on your website you have all these great case studies, and I'd love it if you could outline some of your success stories for us, some specific, concrete examples. >> Sure. I'll take one first, and then they'll pass to Michelle. One really great concrete example is we were able to take data format processing for a biotech that had basically previously had instruments sitting off in a corner that they could not connect, were integratable for a high throughput screening cascade. We were able to bring them online. We were able to get the datasets interpretable, and get literally their processing time for these screens from the order of weeks to the order of minutes. So they could basically be doing probably a couple hundred more screens per year than they could have otherwise. Michelle? >> We have one customer that is in the process of automating their entire lab, even using robotics arms. So it's a huge mix of being able to ingest IoT data, send experiment data to them, understand sampling, getting the results back, and really automating that whole process, which when they even walked me through it, I was like, "Wow," and I'm like, "so cool." (chuckles) And there's a lot of... I think a lot of pharma companies want, and life science companies, want to move forward in innovation and do really creative and cool things for patients. But at the end of it, you sort of have to also realize it's like their core competency is focusing on drugs, and getting that to market, and making patients better. And we're just one part of that, really helping to enable that process and that ecosystem come to life, so it's really cool to watch. >> Right, right. And I mean, in this last year we've seen how critical the healthcare sector is to people all over the world. Now, looking forward, what do you anticipate some of the big innovations in the sector will be in the next five years, and where do you see TetraScience's role in that? >> So I think some of the larger innovations are... Mike mentioned one of them already. It's going to be sort of the personalized drugs the personalized health care. I think it is absolutely going to go to full lab automation to some degree, because who knows when the next pandemic will hit, right? And we're all going to have to go home, right? I think the days of trying to move around data manually and trying to work through that is just... If we don't plan for that to be a thing of the past, I think we're all going to do ourselves a disservice. So I think you'll see more automation. I think you'll see more personalization, and you'll see more things that leverage larger amounts of data. I think where we hope to sit is really at the ecosystem enablement part of that. We want to remain open. That's one of the cornerstones. We're not a single partner platform. We're not tied to any vendors. We really want to become that central aid and the ecosystem enabler for the labs. >> Yeah, to that point- >> And I'd also love to get your insight. >> Oh! Sorry. (chuckles) Thank you. To that point, we're really trying to unlock discovery, right? Many other horizontal cloud players will do something like you can upload files, or you can do some massive compute, but they won't have the vertical expertise that we do, right? They won't have the actual deep life sciences dedication. We have several PhDs, postdocs, et cetera, on staff who have done this for a living and can do this going forward. So you're going to see the realization of something that was really exciting in sort of 2005, 2006, that is fully automated experimentation. So get a robot to about an experiment, design it, have a human operator assist with putting together all the automation, and then run that over and over again cyclically until you get the result you want. I don't think that the compute was ready for that at the time. I don't think that the resources were up to snuff, but now you can do it, and you can do it with any tool, instrument, technique you want, because to Michelle's point, we're a vendor-agnostic partner networked platform. So you can actually assemble this learning automation cascade and have it run in the background while you go home and sleep. >> Yeah, and we often hear about automation, but tell us a little bit more specifically what is the harmonizing effect of TetraScience? I mean, that's not something that we usually hear, so what's unique about that? >> You want to take that, or you want me to go? >> You go, please. (chuckles) >> All right. So, really, it's about... It's about normalizing and harmonizing the data. And what does that... What that means is that whether you're a chromatography machine from, let's say Waters, or another vendor, ideally you'd like to be able to leverage all of your chromatography data and do research across all of it. Most of our customers have machinery that is of same sort from different customers, or sorry, from different vendors. And so it's really the ability to bring that data together, and sometimes it's even diverse instrumentation. So if I track a molecule, or a project, or a sample through one piece, one set of instrumentation, and I want to see how it got impacted in another set of instrumentation, or what the results were, I'm able to quickly and easily be able to sort of leverage that harmonized data and come to those results quickly. Mike, I'm sure you have a- >> May I offer a metaphor from something outside of science? Hopefully that's not off par for this, but let's say you had a parking lot, right, filled with different kinds of cars. And let's say you said at the beginning of that parking lot, "No, I'm sorry. We only have space right here for a Ford Fusion 2019 black with leather interior and this kind of tires." That would be crazy. You would never put that kind of limitation on who could park in a parking lot. So why do specific proprietary data systems put that kind of limitation on how data can be processed? We want to make it so that any car, any kind of data, can be processed and considered together in that same parking lot. >> Fascinating. Well, thank you both so much for your insights. Really appreciate it. Wonderful to hear about R&D Data Cloud movement in big pharma, and that of course is Michelle Bradbury, VP of Product at TetraScience, as well as Mike Tarselli, the Chief Scientific Officer at TetraScience. Thanks again very much for your insights. I'm your host for theCUBE, Natalie Erlich. Catch us again for the next session of the AWS Startup Session. Thank you. (smooth music)
SUMMARY :
leaders all around the world. We're going to talk about Great to be here. to improve time to and that they need to get there faster to be able to parse through, harmonize, our cloud solution to be one to Michelle first. but the ability to go through There goes that report to them. Well, I'd love to ask this and it's going to get bigger and getting the mass compute and I'd love it if you could outline and then they'll pass to Michelle. and getting that to market, and where do you see I think it is absolutely going to go to get your insight. and have it run in the background (chuckles) and come to those results quickly. beginning of that parking lot, and that of course is Michelle Bradbury,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Natalie Erlich | PERSON | 0.99+ |
Michelle Bradbury | PERSON | 0.99+ |
Mike Tarselli | PERSON | 0.99+ |
Mike | PERSON | 0.99+ |
Michelle | PERSON | 0.99+ |
three | QUANTITY | 0.99+ |
TetraScience | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
2005 | DATE | 0.99+ |
Boston | LOCATION | 0.99+ |
2006 | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Today | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
one piece | QUANTITY | 0.99+ |
one set | QUANTITY | 0.99+ |
pandemic | EVENT | 0.98+ |
One | QUANTITY | 0.98+ |
last year | DATE | 0.98+ |
one customer | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
five years | QUANTITY | 0.97+ |
one part | QUANTITY | 0.97+ |
four hours | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
first | QUANTITY | 0.96+ |
Ford | ORGANIZATION | 0.96+ |
one thing | QUANTITY | 0.96+ |
Fusion 2019 | COMMERCIAL_ITEM | 0.96+ |
single | QUANTITY | 0.93+ |
SOC 2 Type | TITLE | 0.9+ |
ISO 9001 | TITLE | 0.9+ |
next five years | DATE | 0.89+ |
single partner platform | QUANTITY | 0.87+ |
21 CFR Part 11 | OTHER | 0.84+ |
single click | QUANTITY | 0.84+ |
GXP | ORGANIZATION | 0.83+ |
three buttons | QUANTITY | 0.8+ |
each individual sample | QUANTITY | 0.79+ |
theCUBE | ORGANIZATION | 0.78+ |
Startup Showcase | EVENT | 0.76+ |
a ton of data | QUANTITY | 0.76+ |
FDA | ORGANIZATION | 0.74+ |
couple hundred more screens | QUANTITY | 0.73+ |
2016 | DATE | 0.71+ |
Session | EVENT | 0.65+ |
Waters | ORGANIZATION | 0.62+ |
2021 | OTHER | 0.6+ |
instrumentation | QUANTITY | 0.56+ |
2 | OTHER | 0.49+ |
AWSSQ2 054 | OTHER | 0.4+ |
TetraScience | TITLE | 0.4+ |
UNLIST TILL 4/2 - Vertica Database Designer - Today and Tomorrow
>> Jeff: Hello everybody and thank you for joining us today for the Virtual VERTICA BDC 2020. Today's breakout session has been titled, "VERTICA Database Designer Today and Tomorrow." I'm Jeff Healey, Product VERTICA Marketing, I'll be your host for this breakout session. Joining me today is Yuanzhe Bei, Senior Technical Manager from VERTICA Engineering. But before we begin, (clearing throat) I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions, as we're able to during that time, any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VERTICA forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums, to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button at the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you Yuanzhe. >> Yuanzhe: Thanks Jeff. Hi everyone, my name is Yuanzhe Bei, I'm a Senior Technical Manager at VERTICA Server RND Group. I run the query optimizer, catalog and the disaggregated engine team. Very glad to be here today, to talk about, the "VERTICA Database Designer Today and Tomorrow". This presentation will be organized as the following; I will first refresh some knowledge about, VERTICA fundamentals such as Tables and Projections, which will bring to the question, "What is Database Designer?" and "Why we need this tool?". Then I will take you through a deep dive, into a Database Designer or we call DBD, and see how DBD's internals works, after that I'll show you some exciting DBD improvements, we have planned for 10.0 release and lastly, I will share with you, some DBD future roadmap we planned next. As most of you should already know, VERTICA is built on a columnar architecture. That means, data is stored column wise. Here we can see a very simple example, of table with four columns, and the many of you may also know, table in VERTICA is a virtual concept. It's just a logical representation of data, which means user can write SQL query, to reference the table names and column, just like other relational database management system, but the actual physical storage of data, is called Projection. A Projection can reference a subset, or all of the columns all to its anchor table, and must be sorted by at least one column. Each table need at least one C for projection which reference all the columns to the table. If you load data to a table with no projection, and automated, auto production will be created, which will be arbitrarily assorted by, the first couple of columns in the table. As you can imagine, even though such other production, can be used to answer any query, the performance is not optimized in most cases. A common practice in VERTICA, is to create multiple projections, contain difference step of column, and sorted in different ways on the same table. When query is sent to the server, the optimizer will pick the projection, that can answer the query in the most efficient way. For example, here you can say, let's say you have a query, that select columns B, D, C and sorted by B and D, the third projection will be ideal, because the data is already sorted, so you can save the sorting costs while executing the query. Basically when you choose the design of the projection, you need to consider four things. First and foremost, of course the sort order. The data already sorted in the right way, can benefit quite a lot of the query actually, like Ordered by, Group By, Analytics, Merge, Join, Predicates and so on. The select column group is also important, because the projection must contain, all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, VERTICA is the distributed database, and allow projection to be segmented, based on the hash of a set of columns, which is beneficial if the segmentation merged, the join keys or group keys. And finally encoding of each per columns is also part of the design, because the data is sorted in different way, may completely change the optimal encoding for each column. This example only show the benefit of the first two, but you can imagine the rest too are also important. But even for that, it doesn't sound that hard, right? Well I hope you change your mind already when you see this, at least I do. These machine generated queries, really beats me. It will probably take an experienced DBA hours, to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries, in the regular work logs in the real world. So what can we do? That's why we need DBD. DBD is a tool integrated in the VERTICA server, that it can help DBA to perform an access, on their work log query, tabled schema and data, and then automatically figure out, the most optimized projection design for their workload. In addition, DBD also a sophisticated tool, that can take customize by a user, by sending a lot of parameters objectives and so on. And lastly, DBD has access to the optimizer, so DB knows what kind of attribute, the projection need to have, in order to have the optimizer to benefit from them. DBD has been there for years, and I'm sure there are plenty of materials available online, to show you how DBD can be used in different scenarios, whether to achieve the query optimize, or load optimize, whether it's the comprehensive design, or the incremental design, whether it's a dumping deployment script, and manual deployment later, or let the DBD do the order deployment for you, and the many other options. I'm not planning to talk about this today, instead, I will take the opportunity today, to open this black box DBD, and show you what exactly hide inside. DBD is a complex tool and I have tried my best to summarize the DBD design process into seven steps; Extract, Permute, Prune, Build, Score, Identify and Encode. What do they mean? Don't worry, I will show you step by step. The first step is Extract. Extract Interesting Columns. In this step, DBD pass the design queries, and figure out the operations that can be benefited, by the potential projection design, and extract the corresponding columns, as interesting columns. So Predicates, Group By, Order By, Joint Condition, and analytics are all interesting Column to the DBD. As you can see this three simple sample queries, DBD can extract the interest in column sets on the right. Some of these column sets are unordered. For example, the green one for Group By a1 and b1, the DBD extracts the interesting column set, and put them in the own orders set, because either data sorted by a1 first or b1 first, can benefit from this Group By operation. Some of the other sets are ordered, and the best example is here, order by clause a2 and b2, and obviously you cannot sort it by b2 and then a2. These interesting columns set will be used as if, to extend to actual projection sort order candidates. The next step is Permute, once DBD extract all the C's, it will enumerate sort order using C, and how does DBD do that? I'm starting with a very simple example. So here you can see DBD can enumerate two sort orders, by extending d1 with the unordered set a1, b1, and the derived at two sort order candidates, d1, a1, b1, and d1, b1, a1. This sort order can benefit queries with predicate on d1, and also benefit queries by Group By a1, b1, when a1, sorry when d1 is constant. So with the same idea, DBD will try to extend other States with each other, and populate more sort order permutations. You can imagine that how many of them, there could be many of them, these candidates, based on how many queries you have in the design and that can be handled of the sort order candidates. That comes to the third step, which is Pruning. This step is to limit the candidates sort order, so that the design won't be running forever. DBD uses very simple capping mechanism. It sorts all the, sort all the candidates, are ranked by length, and only a certain number of the sort order, with longest length, will be moved forward to the next step. And now we have all the sort orders candidate, that we want to try, but whether this sort order candidate, will be actually be benefit from the optimizer, DBD need to ask the optiizer. So this step before that happens, this step has to build those projection candidate, in the catalog. So this step will build, will generates the projection DBL's, surround the sort order, and create this projection in the catalog. These projections won't be loaded with real data, because that takes a lot of time, instead, DBD will copy over the statistic, on existing projections, to this projection candidates, so that the optimizer can use them. The next step is Score. Scoring with optimizer. Now projection candidates are built in the catalog. DBD can send a work log queries to optimizer, to generate a query plan. And then optimizer will return the query plan, DBD will go through the query plan, and investigate whether, there are certain benefits being achieved. The benefits list have been growing over time, when optimizer add more optimizations. Let's say in this case because the projection candidates, can be sorted by the b1 and a1, it is eligible for Group By Pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries, will be aggregated and then recorded, for each projection candidate. We are almost there. Now we have all the total benefit score, for the projection candidates, we derived on the work log queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner d1, b1 and a1. Sometimes you need to find more winners, because the chosen winner may only benefit a subset, of the work log query you provided to the DBD. So in order to have the rest of the queries, to be also benefit, you need more projections. So in this case, DBD will go to the next iteration, and let's say in this case find to another winner, d1, c1, to benefit the work log queries, that cannot be benefit by d1, b1 and a1. The number of iterations and thus the winner outcome, DBD really depends on the design objective that uses that. It can be load optimized, which means that only one, super projection winner will be selected, or query optimized, where DBD try to create as many projections, to cover most of the work log queries, or somewhat balance an objective in the middle. The last step is to decide encoding, for each projection columns, for the projection winners. Because the data are sorted differently, the encoding benefits, can be very different from the existing projection. So choose the right projection encoding design, will save the disk footprint a significant factor. So it's worth the effort, to find out the best thing encoding. DBD picks the encoding, based on the actual sampling the data, and measure the storage footprint. For example, in this case, the projection winner has three columns, and say each column has a few encoding options. DBD will write the sample data in the way this projection is sorted, and then you can see with different encoding, the disk footprint is different. DBD will then compare the disk footprint of each, of different options for each column, and pick the best encoding options, based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basic that how DBD internal works, of course, I think we've heard it quite a lot. For example, I didn't mention how the DBD handles segmentation, but the idea is similar to analyze the sort order. But I hope this section gave you some basic idea, about DBD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DBD in many ways. In this talk I will highlight four issues in old DBD and describe how the 10.0 version new DBD, will address those issues. The first issue is that a DBD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday, with the new or different projection can help speed it up? However, to answer a simple question like this using DBD, user will be very likely to have the documentation open on the side, because they have to go through it's whole complex flow, from creating a projection, run the design, get outputs and then create a design in the end. And that's not there yet, for each step, there are several functions user need to call in order. So adding these up, user need to write the quite long script with dozens of functions, it's just too complicated, and most of you may find it annoying. They either manually tune the projection to themselves, or simply live with the performance and come back, when it gets really slow again, and of course in most situations, they never come back to use the DBD. In 10.0 VERTICA support the new simplified API, to run DBD easily. There will be just one function designer_single_run and one argument, the interval that you think, your query was slow. In this case, user complained about it yesterday. So what does this user to need to do, is just specify one day, as argument and run it. The user don't need to provide anything else, because the DBD will look up his query or history, within that time window and automatically populate design, run design and export the projection design, and the clean up, no user intervention needed. No need to have the documentation on the side and carefully write a script, and a debug, just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes to another issue. To fully utilize this single round function, users are encouraged to run DBD on the production cluster. However, in fact, VERTICA used to not recommend, to run a design on a production cluster. One of the reasons issue, is that DBD picks massive locks, both table locks and catalog locks, which will badly interfere the running workload, on a production cluster. As of 10.0, we eliminated all the table and ten catalog locks from DBD. Yes, we eliminate 100% of them, simple improvement, clear win. The third issue, which user may not be aware of, is that DBD writes intermediate result. into real VERTICA tables, the real DBD have to do that is, DBD is the background task. So the intermediate results, some user needs to monitor it, the progress of the DBD in concurrent session. For complex design, the intermediate result can be quite massive, and as a result, many lost files will be created, and written to the disk, and we should both stress, the catalog, and that the disk can slow down the design. For ER mode, it's even worse because, the table are shared on communal storage. So writing to the regular table, means that it has to upload the data, to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer, and make this shared in memory data structure. Monitoring queries will go directly look up, in memory data structure, and go through the system table, and return the results. No Intermediate Results files will be written anymore. Another expensive lubidge of local disk for DBD is encoding design, as I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DBD need to actually write down, the sample data to the disk, using the different encoding options, and to find out which ones have the smallest footprint, or pick it as the best choice. These written sample data will be useless after this, and it will be wiped out right away, and you can imagine this is a huge waste of the system resource. In 10.0 we improve this process. So instead of writing, the different encoded data on the disk, and then read the file size, DBD aggregate the data block size on-the-fly. The data block will not be written to the disk, so the overall encoding and design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put a significant amount of the resource on the improving the DBD in 10.0, is because the VERTICA DBD, as essential component of the out of box performance design campaign. To simply illustrate the timeline, we are now on the second step, where we significantly reduced, the running overhead of the DBD, so that user will no longer fear, to run DBD on their production cluster. Please be noted that as of 10.0, we haven't really started changing, how DBD design algorithm works, so that what we have discussed in the deep dive today, still holds. For the next phase of DBD, we will briefly make the design process smarter, and this will include better enumeration mechanism, so that the pruning is more intelligence rather than brutal, then that will result in better design quality, and also faster design. The longer term is to make DBD to achieve the automation. What entail automation and what I really mean is that, instead of having user to decide when to use DBD, until their query is slow, VERTICA have to know, detect this event, and have have DBD run automatically for users, and suggest the better projections design, if the existing projection is not good enough. Of course, there will be a lot of work that need to be done, before we can actually fully achieve the automation. But we are working on that. At the end of day, what the user really wants, is the fast database, right? And thank you for listening to my presentation. so I hope you find it useful. Now let's get ready for the Q&A.
SUMMARY :
at the end of the presentation. and the many of you may also know,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Yuanzhe Bei | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
100% | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
one day | QUANTITY | 0.99+ |
second step | QUANTITY | 0.99+ |
third step | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
third issue | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
Each benefit | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
third projection | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
b2 | OTHER | 0.99+ |
each column | QUANTITY | 0.99+ |
first issue | QUANTITY | 0.99+ |
one column | QUANTITY | 0.99+ |
three columns | QUANTITY | 0.99+ |
VERTICA Engineering | ORGANIZATION | 0.99+ |
Yuanzhe | PERSON | 0.99+ |
each step | QUANTITY | 0.98+ |
Each table | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
DBD | TITLE | 0.98+ |
DBD | ORGANIZATION | 0.98+ |
seven steps | QUANTITY | 0.98+ |
DBL | ORGANIZATION | 0.98+ |
each | QUANTITY | 0.98+ |
one argument | QUANTITY | 0.98+ |
VERTICA | TITLE | 0.98+ |
each projection | QUANTITY | 0.97+ |
first two | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
hundreds | QUANTITY | 0.97+ |
one function | QUANTITY | 0.97+ |
clause a2 | OTHER | 0.97+ |
one | QUANTITY | 0.97+ |
each per columns | QUANTITY | 0.96+ |
Tomorrow | DATE | 0.96+ |
both | QUANTITY | 0.96+ |
four issues | QUANTITY | 0.95+ |
VERTICA | ORGANIZATION | 0.95+ |
b1 | OTHER | 0.95+ |
single round | QUANTITY | 0.94+ |
4/2 | DATE | 0.94+ |
first couple of columns | QUANTITY | 0.92+ |
VERTICA Database Designer Today and Tomorrow | TITLE | 0.91+ |
Vertica | ORGANIZATION | 0.91+ |
10.0 | QUANTITY | 0.89+ |
one function call | QUANTITY | 0.89+ |
a1 | OTHER | 0.89+ |
four things | QUANTITY | 0.88+ |
c1 | OTHER | 0.87+ |
two sort order | QUANTITY | 0.85+ |
UNLIST TILL 4/2 - Model Management and Data Preparation
>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.
SUMMARY :
and thank you for joining us today and the limit, you can hit that in terms of cost,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Vertica | ORGANIZATION | 0.99+ |
Waqas Dhillon | PERSON | 0.99+ |
70 days | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
two points | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
seven days | QUANTITY | 0.99+ |
one week | QUANTITY | 0.99+ |
five days | QUANTITY | 0.99+ |
Sunday | DATE | 0.99+ |
two parts | QUANTITY | 0.99+ |
second part | QUANTITY | 0.99+ |
Saturday | DATE | 0.99+ |
Excel | TITLE | 0.99+ |
50 columns | QUANTITY | 0.99+ |
4/2 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
each | QUANTITY | 0.99+ |
each node | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
first factor | QUANTITY | 0.99+ |
less than 0.27 seconds | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Friday | DATE | 0.99+ |
Monday | DATE | 0.99+ |
second aspect | QUANTITY | 0.99+ |
eighth | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
one day | QUANTITY | 0.99+ |
two data points | QUANTITY | 0.99+ |
third consideration | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
first step | QUANTITY | 0.98+ |
first part | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
zero values | QUANTITY | 0.98+ |
second | QUANTITY | 0.98+ |
both approaches | QUANTITY | 0.98+ |
about 4.6 seconds | QUANTITY | 0.98+ |
third thing | QUANTITY | 0.98+ |
Secondly | QUANTITY | 0.98+ |
one tool | QUANTITY | 0.98+ |
zero | QUANTITY | 0.98+ |
each mode | QUANTITY | 0.98+ |
One | QUANTITY | 0.97+ |
figure B | OTHER | 0.97+ |
figure C | OTHER | 0.97+ |
4.6 value | QUANTITY | 0.97+ |
R | TITLE | 0.97+ |
Machine Learning with Vertica, Data Preparation and Model Management | TITLE | 0.97+ |
Waqas | PERSON | 0.97+ |
each model | QUANTITY | 0.97+ |
two main options | QUANTITY | 0.97+ |
80% | QUANTITY | 0.97+ |
two components | QUANTITY | 0.96+ |
around 90% | QUANTITY | 0.96+ |
two | QUANTITY | 0.96+ |
later this week | DATE | 0.95+ |
Scott Howser, Hadapt - MIT Information Quality 2013 - #MIT #CDOIQ #theCUBE
>> wait. >> Okay, We're back. We are in Cambridge, Massachusetts. This is Dave Volante. I'm here with Jeff Kelly. Where with Wicked Bond. This is the Cube Silicon Angles production. We're here at the Mighty Information Quality Symposium in the heart of database design and development. We've had some great guests on Scott Hauser is here. He's the head of marketing at Adapt Company that we've introduced to our community. You know, quite some time ago, Um, really bringing multiple channels into the Duke Duke ecosystem and helping make sense out of all this data bringing insights to this data. Scott, welcome back to the Cube. >> Thanks for having me. It's good to be here. >> So this this notion of data quality, the reason why we asked you to be on here today is because first of all, you're a practitioner. Umm, you've been in the data warehousing world for a long, long time. So you've struggled with this issue? Um, people here today, uh, really from the world of Hey, we've been doing big data for a long time. This whole big data theme is nothing new to us. Sure, but there's a lot knew. Um, and so take us back to your days as a zoo. A data practitioner. Uh, data warehousing, business intelligence. What were some of the data quality issues that you faced and how did you deal with him? So >> I think a couple of points to raise in that area are no. One of things that we like to do is try and triangulate on user to engage them. And every channel we wanted to go and bring into the fold, creating unique dimension of how do we validate that this is the same person, right? Because each channel that you engage with has potentially different requirements of, um, user accreditation or, ah, guarantee of, you know, single user fuel. That's why I think the Holy Grail used to be in a lot of ways, like single sign on our way to triangulate across the spirit systems, one common identity or person to make that world simple. I don't think that's a reality in the in the sense that when you look at, um, a product provider or solution provider and a customer that's external, write those those two worlds Avery spirit and there was a lot of channels and pitch it potentially even third party means that I might want to engage this individual by. And every time I want to bring another one of those channels online, it further complicates. Validating who? That person eighty. >> Okay, so So when you were doing your data warehouse thing again as an I t practitioner, Um, you have you You try to expand the channels, but every time he did that and complex if I hide the data source So how did you deal with that problem? So just create another database and stole five Everything well, >> unfortunately, absolutely creates us this notion of islands of information throughout the enterprise. Because, as you mentioned, you know, we define a schema effectively a new place, Um, data elements into that schema of how you identified how you engage in and how you rate that person's behaviors or engagement, etcetera. And I think what you'd see is, as you'd bring on new sources that timeto actually emerge those things together wasn't in the order of days or weeks. It's on months and years. And so, with every new channel that became interesting, you further complicate the problem and effectively, What you do is, you know, creating these pools of information on you. Take extracts and you try and do something to munch the data and put in a place where you give access to an analyst to say, Okay, here's it. Another, um, Sample said a day to try and figure out of these things. Align and you try and create effectively a new schema that includes all the additional day that we just added. >> So it's interesting because again, one of the themes that we've been hearing a lot of this conference and hear it a lot in many conferences, not the technology. It's the people in process around the technology. That's certainly any person person would agree with that. But at the same time, the technology historically has been problematic, particularly data. Warehouse technology has been challenging you. So you've had toe keep databases relatively small and despair, and you had to build business processes around those that's right a basis. So you've not only got, you know, deficient technology, if you will, no offense, toe data, warehousing friends, but you've got ah, process creep that's actually fair. That's occurred, and >> I think you know what is happening is it's one of the things that's led to sort of the the revolution it's occurring in the market right now about you know, whether it's the new ecosystem or all the tangential technologies around that. Because what what's bound not some technology issues in the past has been the schema right. As important as that is because it gives people a very easy way to interact with the data. It also creates significant challenges when you want to bring on these unique sources of information. Because, you know, as you look at things that have happened over the last decade, the engagement process for either a consumer, a prospect or customer have changed pretty dramatically, and they don't all have the same stringent requirements about providing information to become engaged that way. So I think where the schema has, you know, has value you obviously, in the enterprise, it also has a lot of, um, historical challenges that brings along with >> us. So this jump movement is very disruptive to the traditional market spaces. Many folks say it isn't traditional guy, say, say it isn't but clearly is, particularly as you go Omni Channel. I threw that word out earlier on the channels of discussion that we had a dupe summit myself. John Ferrier, Hobby lobby meta and as your and this is something that you guys are doing that bringing in data to allow your customers to go Omni Channel. As you do that, you start again. Increase the complexity of the corpus of data at the same time. A lot of a lot of times into do you hear about scheme alight ski, but less so how do you reconcile the Omni Channel? The scheme of less It's their scheme alight. And the data quality >> problems, Yes, I think for, you know, particular speaking about adapt one of things that we do is we give customers the ability to take and effectively dump all that data into one common repository that is HD if s and do and leverage some of those open source tools and even their own, you know, inventions, if you will, you know, with m R code pig, whatever, and allow them to effectively normalized data through it orations and to do and then push that into tables effectively that now we can give access to the sequel interface. Right? So I think for us the abilities you're absolutely right. The more channels. You, Khun, give access to write. So this concept of anomie channel where Irrespective of what way we engaged with a customer what way? They touch us in some way. Being able to provide those dimensions of data in one common repository gives the marketeer, if you will, an incredible flexibility and insights that were previous, Who'd be discoverable >> assuming that data qualities this scene >> right of all these So so that that was gonna be my question. So what did the data quality implications of using something like HD FSB. You're essentially scheme unless you're just dumping data and essentially have a raw format and and it's raw format. So now you've gotto reconcile all these different types of data from different sources on build out that kind of single view of a customer of a product, Whatever, whatever is yours. You're right. >> So how do you go >> about doing that in that kind of scenario? So I think the repository in Hindu breach defense himself gives you that one common ground toa workin because you've got, you know, no implications of schema or any other preconceived notions about how you're going toe to toe massage weight if you will, And it's about applying logic and looking for those universal ides. There are a bunch of tools around that are focused on this, but applying those tools and it means that doesn't, um, handy captain from the start by predisposing them to some structure. And you want them to decipher or call out that through whether it's began homegrown type scripts, tools that might be upstairs here and then effectively normalizing the data and moving it into some structure where you can interact with it on in a meaningful way. So that really the kind the old way of trying to bring, you know, snippets of the data from different sources into ah, yet another database where you've got a play structure that takes time, months and years in some cases. And so Duke really allows you to speed up that process significantly by basically eliminating that that part of the equation. Yeah, I think there's and there's a bunch of dimensions we could talk about things like even like pricing exercises, right quality of triangulating on what that pricing should be per product for geography, for engagement, etcetera. I think you see that a lot of those types of work. Let's have transitioned from, you know, mainframe type environments, environments of legacy to the Duke ecosystem. And we've seen cases where people talk about they're going from eight month, you know, exercises to a week. And I think that's where the value of this ecosystem in you know, the commodity scalability really provides you with flexibility. That was just previously you unachievable. >> So could you provide some examples either >> you know, your own from your own career or from some customers you're seeing in terms of the data quality implications of the type of work they're doing. So one of our kind of *** is that you know the data quality measures required for any given, uh, use case various, in some cases, depending on the type of case. You know, in depending on the speed that you need, the analysis done, uh, the type of data quality or the level data qualities going is going to marry. Are you seeing that? And if >> so, can you give some examples of the different >> types of way data quality Gonna manifest itself in a big data were close. Sure. So I think that's absolutely fair. And you know. Obviously there's there's gonna be some trade off between accuracy and performance, right? And so you have to create some sort of confidence coefficient part, if you will, that you know, within some degree of probability this is good enough, right? And there's got to be some sort of balance between that actor Jerseyan time Um, some of the things that you know I've seen a lot of customers being interested in is it is a sort of market emerging around providing tools for authenticity of engagement. So it's an example. You know, I may be a large brand, and I have very, um, open channels that I engage somebody with my B e mail might be some Web portal, etcetera, and there's a lot of fishing that goes on out there, right? And so people fishing for whether it's brands and misrepresenting themselves etcetera. And there's a lot of, you know, desire to try and triangulate on data quality of who is effectively positioned themselves as me, who's really not me and being able to sort of, you know, take a cybersecurity spin and started to block those things down and alleviate those sort of nefarious activities. So We've seen a lot of people using our tool to effectively understand and be able to pinpoint those activities based upon behavior's based upon, um, out liars and looking at examples of where the engagement's coming from that aren't authentic if that >> makes you feel any somewhat nebulous but right. So using >> analytics essentially to determine the authenticity of a person of intensity, of an engagement rather than taking more rather than kind of looking at the data itself using pattern detection to determine. But it also taking, you know, there's a bunch of, um, there's a bunch of raw data that exists out there that needs you when you put it together again. Back to this notion of this sort of, you know, landing zone, if you will, or Data Lake or whatever you wanna call it. You know, putting all of this this data into one repository where now I can start to do you know, analytics against it without any sort of pre determined schema. And start to understand, you know, are these people who are purporting to be, you know, firm X y Z are there really from X y Z? And if they're not, where these things originating and how, when we start to put filters or things in place to alleviate those sort of and that could apply, it sounds like to certainly private industry. But, I mean, >> it sounds like >> something you know, government would be very interested in terms ofthe, you know, in the news about different foreign countries potentially being the source of attacks on U. S. Corporations are part of the, uh, part of our infrastructure and trying to determine where that's coming from and who these people are. And >> of course, people were trying to get >> complicated because they're trying to cover up their tracks, right? Certainly. But I think that the most important thing in this context is it's not necessarily about being able to look at it after the fact, but it's being able to look at a set of conditions that occur before these things happen and identify those conditions and put controls in place to alleviate the action from taking place. I think that's where when you look at what is happening from now an acceleration of these models and from an acceleration of the quality of the data gathering being able to put those things into place and put effective controls in place beforehand is changing. You know the loss prevention side of the business and in this one example. But you're absolutely right. From from what I see and from what our customers were doing, it is, you know, it's multi dimensional in that you know this cyber security. That's one example. There's pricing that could be another example. There's engagements from, ah, final analysis or conversion ratio that could be yet another example. So I think you're right in it and that it is ubiquitous. >> So when you think about the historical role of the well historical we had Stewart on earlier, he was saying, the first known chief data officer we could find was two thousand three. So I guess that gives us a decade of history. But if you look back at the hole, I mean data quality. We've been talking about that for many, many decades. So if you think about the traditional or role of an organization, trying tio achieved data quality, single version of the truth, information, quality, information value and you inject it with this destruction of a dupe that to me anyway, that whole notion of data quality is changing because in certain use, cases inference just fine. Um, in false positives are great. Who cares? That's right. Now analyzing Twitter data from some cases and others like healthcare and financial services. It's it's critical. But so how do you see the notion of data quality evolving and adapting to this >> new world? Well, I think one of these you mentioned about this, you know, this single version of the truth was something that was, you know, when I was on the other side of the table, >> they were beating you over the head waken Do this, We >> can do this, and it's It's something that it sounds great on paper. But when you look at the practical implications of trying to do it in a very finite or stringent controlled way, it's not practical for the business >> because you're saying that the portions of your data that you can give a single version of the truth on our so small because of the elapsed time That's right. I think there's that >> dimension. But there's also this element of time, right and the time that it takes to define something that could be that rigid and the structure months. It's months, and by that time a lot of the innovations that business is trying to >> accomplish. The eyes have changed. The initiatives has changed. Yeah, you lost the sale. Hey, but we got the data. It would look here. Yeah, I think that's your >> right. And I think that's what's evolving. I think there's this idea that you know what Let's fail fast and let's do a lot of it. Orations and the flexibility it's being provided out in that ecosystem today gives people an opportunity. Teo iterated failed fast, and you write that you set some sort of, you know confidence in that for this particular application. We're happy with you in a percent confidence. Go fish. You are something a little >> bit, but it's good enough. So having said that now, what can we learn from the traditional date? A quality, you know, chief data officer, practitioners, those who've been very dogmatic, particularly in certain it is what can we learn from them and take into this >> new war? I think from my point of view on what my experience has always been is that those individuals have an unparalleled command of the business and have an appreciation for the end goal that the business is trying to accomplish. And it's taking that instinct that knowledge and applying that to the emergence of what's happening in the technology world and bringing those two things together. I think it's It's not so much as you know, there's a practical application in that sense of Okay, here's the technology options that we have to do these, you know, these desired you engaged father again. It's the pricing engagement, the cyber security or whatever. It's more. How could we accelerate what the business is trying to accomplish and applying this? You know, this technology that's out there to the business problem. I think in a lot of ways, you know, in the past it's always been here. But this really need technology. How can I make it that somewhere? And now I think those folks bring a lot of relevance to the technology to say Hey, here's a problem. Trying to solve legacy methodologies haven't been effective. Haven't been timely. Haven't been, uh, scaleable. Whatever hock me. Apply what's happening. The market today to these problems. >> Um, you guys adapt in particular to me any way a good signal of the maturity model and with the maturity of a dupe, it's It's starting to grow up pretty rapidly, you know, See, due to two auto. And so where are we had? What do you see is the progression, Um, and where we're going. >> So, you know, I mentioned it it on the cue for the last time it So it and I said, I believe that you know who do busy operating system of big data. And I believe that, you know, there's a huge transition taking place that was there were some interesting response to that on Twitter and all the other channels, but I stand behind that. I think that's really what's happening. Lookit. You know what people are engaging us to do is really start to transition away from the legacy methodologies and they're looking at. He's not just lower cost alternatives, but also more flexibility. And we talked about, you know, its summit. The notion of that revenue curve right and cost takeouts great on one side of the coin, and I are one side of the defense here. But I think equally and even more importantly, is the change in the revenue curve and the insights that people they're finding because of these unique channels of the Omni Challenge you describe being able to. So look at all these dimensions have dated one. Unified place is really changing the way that they could go to market. They could engage consumers on DH that they could provide access to the analyst. Yeah. I mean, ultimately, that's the most >> we had. Stewart Madness con who's maybe got written textbooks on operating systems. We probably use them. I know I did. Maybe they were gone by the time you got there, but young, but the point being, you know, a dupe azan operating system. The notion of a platform is really it's changing dramatically. So, um, I think you're right on that. Okay. So what's what's next for you guys? Uh, we talked about, you know, customer attraction and proof points. You're working. All right on that. I know. Um, you guys got a great tech, amazing team. Um, what's next for >> you? So I think it's it's continuing toe. Look at the market in being flexible with the market around as the Hughes case is developed. So, you know, obviously is a startup We're focused in a couple of key areas where we see a lot of early adoption and a lot of pain around the problem that we can solve. But I think it's really about continuing to develop those use cases, um, and expanded the market to become more of a, you know, a holistic provider of Angelique Solutions on top of a >> house. Uh, how's Cambridge working out for you, right? I mean, the company moved up from the founders, moved up from New Haven and chose shows the East Coast shows cameras were obviously really happy about. That is East Coast people. You don't live there full time, but I might as well. So how's that working out talent pool? You know, the vibrancy of the community, the the you know, the young people that you're able to tap. So >> I see there's a bunch of dimensions around that one. It's hot. It's really, really hot >> in human, Yes, but it's been actually >> fantastic. And if you look it not just a town inside the team, but I think around the team. So if you look at our board right Jet Saxena. Chris Lynch, I've been very successful. The database community over decades of experience, you know, and getting folks like that onto the board fell. The Hardiman has been, you know, in this space as well for a long time. Having folks like that is, you know, advisors and providing guidance to the team. Absolutely incredible. Hack Reduce is a great facility where we do things like hackathons meet ups get the community together. So I think there's been a lot of positive inertia around the company just being here in Cambridge. But, you know, from AA development of resource or recruiting one of you. It's also been great because you've got some really exceptional database companies in this area, and history will show you like there's been a lot of success here, not only an incubating technology, but building real database companies. And, you know, we're on start up on the block that people are very interested in, and I think we show a lot of, you know, dynamics that are changing in the market and the way the markets moving. So the ability for us to recruit talent is exceptional, right? We've got a lot of great people to pick from. We've had a lot of people joined from no other previously very successful database companies. The team's growing, you know, significantly in the engineering space right now. Um, but I just you know, I can't say enough good things about the community. Hack, reduce and all the resource is that we get access to because we're here in Cambridge. >> Is the hacker deuces cool? So you guys are obviously leveraging that you do how to bring people into the Sohag produces essentially this. It's not an incubator. It's really more of a an idea cloud. It's a resource cloud really started by Fred Lan and Chris Lynch on DH. Essentially, people come in, they share ideas. You guys I know have hosted a number of how twos and and it's basically open. You know, we've done some stuff there. It's it's very cool. >> Yeah, you know, I think you know, it's even for us. It's also a great place to recruit, right. We made a lot of talented people there, and you know what? The university participation as well We get a lot of talent coming in, participate in these activities, and we do things that aren't just adapt related, that we've had people teach had obsessions and just sort of evangelize what's happening in the ecosystem around us. And like I said, it's just it's been a great resource pool to engage with. And, uh, I think it's been is beneficial to the community, as it has been to us. So very grateful for that. >> All right. Scott has always awesome. See, I knew you were going to have some good practitioner perspectives on data. Qualities really appreciate you stopping by. My pleasure. Thanks for having to see you. Take care. I keep right to everybody right back with our next guest. This is Dave a lot. They would. Jeff Kelly, this is the Cube. We're live here at the MIT Information Quality Symposium. We'LL be right back.
SUMMARY :
the Duke Duke ecosystem and helping make sense out of all this data bringing insights to It's good to be here. So this this notion of data quality, the reason why we asked you to be on here today is because first of all, I don't think that's a reality in the in the sense that when you look at, um, that became interesting, you further complicate the problem and effectively, What you do is, databases relatively small and despair, and you had to build business processes around those it's occurring in the market right now about you know, whether it's the new ecosystem or all the A lot of a lot of times into do you hear about scheme alight ski, but less so problems, Yes, I think for, you know, particular speaking about adapt one of things that we do is we So what did the data quality implications of using And I think that's where the value of this ecosystem in you know, the commodity scalability So one of our kind of *** is that you know the data quality that you know, within some degree of probability this is good enough, right? makes you feel any somewhat nebulous but right. And start to understand, you know, are these people who are purporting something you know, government would be very interested in terms ofthe, you know, in the news about different customers were doing, it is, you know, it's multi dimensional in that you know this cyber security. So if you think about the traditional or But when you look at the practical of the truth on our so small because of the elapsed time That's right. could be that rigid and the structure months. Yeah, you lost the sale. I think there's this idea that you know what Let's fail fast and A quality, you know, chief data officer, practitioners, those who've been very dogmatic, here's the technology options that we have to do these, you know, these desired you engaged you know, See, due to two auto. And I believe that, you know, there's a huge transition taking place Uh, we talked about, you know, customer attraction and proof points. um, and expanded the market to become more of a, you know, a holistic provider the the you know, the young people that you're able to tap. I see there's a bunch of dimensions around that one. on the block that people are very interested in, and I think we show a lot of, you know, dynamics that are changing in So you guys are obviously leveraging that you do how to bring people into the Sohag Yeah, you know, I think you know, it's even for us. Qualities really appreciate you stopping by.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Kelly | PERSON | 0.99+ |
Scott | PERSON | 0.99+ |
Omni Channel | ORGANIZATION | 0.99+ |
Chris Lynch | PERSON | 0.99+ |
Scott Howser | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Cambridge | LOCATION | 0.99+ |
five | QUANTITY | 0.99+ |
eight month | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Angelique Solutions | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
John Ferrier | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
Fred Lan | PERSON | 0.99+ |
Scott Hauser | PERSON | 0.99+ |
Sohag | ORGANIZATION | 0.99+ |
New Haven | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Cambridge, Massachusetts | LOCATION | 0.99+ |
two thousand | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Stewart | PERSON | 0.99+ |
eighty | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
one example | QUANTITY | 0.98+ |
each channel | QUANTITY | 0.98+ |
one side | QUANTITY | 0.98+ |
single | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
2013 | DATE | 0.97+ |
Hughes | PERSON | 0.97+ |
a week | QUANTITY | 0.96+ |
two | QUANTITY | 0.96+ |
one repository | QUANTITY | 0.96+ |
#CDOIQ | ORGANIZATION | 0.96+ |
East Coast | LOCATION | 0.96+ |
two worlds | QUANTITY | 0.95+ |
a decade | QUANTITY | 0.94+ |
one common repository | QUANTITY | 0.93+ |
Hack Reduce | ORGANIZATION | 0.92+ |
#MIT | ORGANIZATION | 0.91+ |
one common repository | QUANTITY | 0.91+ |
Wicked Bond | ORGANIZATION | 0.91+ |
Cube | ORGANIZATION | 0.91+ |
one common | QUANTITY | 0.89+ |
MIT Information Quality | EVENT | 0.89+ |
Mighty Information Quality Symposium | EVENT | 0.88+ |
Khun | PERSON | 0.87+ |
MIT Information Quality | ORGANIZATION | 0.86+ |
single version | QUANTITY | 0.86+ |
a day | QUANTITY | 0.85+ |
twos | QUANTITY | 0.85+ |
Teo | PERSON | 0.85+ |
Sample | PERSON | 0.82+ |
Duke Duke | ORGANIZATION | 0.81+ |
one side of | QUANTITY | 0.8+ |
single sign | QUANTITY | 0.8+ |
Duke | ORGANIZATION | 0.76+ |
Jet Saxena | PERSON | 0.75+ |
Hobby | ORGANIZATION | 0.75+ |
last decade | DATE | 0.74+ |
Data Lake | LOCATION | 0.72+ |
themes | QUANTITY | 0.7+ |
Adapt Company | ORGANIZATION | 0.65+ |
Cube Silicon Angles | ORGANIZATION | 0.62+ |
Hindu | OTHER | 0.61+ |
Duke | LOCATION | 0.6+ |
Hadapt | ORGANIZATION | 0.58+ |
Hardiman | PERSON | 0.57+ |
three | QUANTITY | 0.52+ |
Symposium | ORGANIZATION | 0.51+ |
points | QUANTITY | 0.5+ |
#theCUBE | ORGANIZATION | 0.49+ |
Stewart Madness | PERSON | 0.49+ |
U. S. | ORGANIZATION | 0.48+ |
couple | QUANTITY | 0.47+ |