UNLIST TILL 4/2 - Autonomous Log Monitoring

>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.

Published Date : Mar 30 2020

SUMMARY :

Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
David Gill	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
five times	QUANTITY	0.99+
Larry	PERSON	0.99+
S3	TITLE	0.99+
three minutes	QUANTITY	0.99+
six times	QUANTITY	0.99+
Sue	PERSON	0.99+
100 services	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
today	DATE	0.99+
three	QUANTITY	0.99+
five years	QUANTITY	0.99+
Today	DATE	0.99+
yesterday	DATE	0.99+
both	QUANTITY	0.99+
Kubernetes	TITLE	0.99+
one	QUANTITY	0.99+
thousands	QUANTITY	0.99+
two	QUANTITY	0.99+
SQL	TITLE	0.99+
one customer	QUANTITY	0.98+
three lines	QUANTITY	0.98+
three tables	QUANTITY	0.98+
each event	QUANTITY	0.98+
hundreds	QUANTITY	0.98+
first people	QUANTITY	0.98+
1,000 log streams	QUANTITY	0.98+
20 years ago	DATE	0.98+
eight incidents	QUANTITY	0.98+
tens of thousands of customers	QUANTITY	0.97+
later this week	DATE	0.97+
thousands of users	QUANTITY	0.97+
Stonebraker	ORGANIZATION	0.96+
each occurrence	QUANTITY	0.96+
Postgres	ORGANIZATION	0.96+
One thing	QUANTITY	0.95+
three event types	QUANTITY	0.94+
million	QUANTITY	0.94+
Vertica	TITLE	0.94+
one thing	QUANTITY	0.93+
4/2	DATE	0.92+
English	OTHER	0.92+
four function names	QUANTITY	0.86+
day one	QUANTITY	0.84+
Prometheus	TITLE	0.83+
one-stop	QUANTITY	0.82+
Berkeley	LOCATION	0.82+
Confluence	ORGANIZATION	0.79+
double arrow	QUANTITY	0.79+
last couple of months	DATE	0.79+
one of	QUANTITY	0.76+
cStor	ORGANIZATION	0.75+
a billion	QUANTITY	0.73+
Atlassian Stack	ORGANIZATION	0.72+
Eon	ORGANIZATION	0.71+
Bitbucket	ORGANIZATION	0.68+
couple more examples	QUANTITY	0.68+
Litmus	TITLE	0.65+

UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives

>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.

Published Date : Mar 30 2020

SUMMARY :

Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.

ENTITIES

Entity	Category	Confidence
Tom Wall	PERSON	0.99+
Sue LeClaire	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Roger Huebner	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Tom	PERSON	0.99+
Python 2	TITLE	0.99+
August 2019	DATE	0.99+
2019	DATE	0.99+
Python 3	TITLE	0.99+
two	QUANTITY	0.99+
Sue	PERSON	0.99+
Python	TITLE	0.99+
python	TITLE	0.99+
SQL	TITLE	0.99+
late 2018	DATE	0.99+
First	QUANTITY	0.99+
end of 2019	DATE	0.99+
Vertica	TITLE	0.99+
today	DATE	0.99+
Java	TITLE	0.99+
Spark	TITLE	0.99+
C++	TITLE	0.99+
JavaScript	TITLE	0.99+
vertica-python	TITLE	0.99+
Today	DATE	0.99+
first time	QUANTITY	0.99+
11 different releases	QUANTITY	0.99+
UDXs	TITLE	0.99+
Kafka	TITLE	0.99+
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives	TITLE	0.98+
Grafana	ORGANIZATION	0.98+
PyODBC	TITLE	0.98+
first	QUANTITY	0.98+
UDX	TITLE	0.98+
vertica 10	TITLE	0.98+
ODBC	TITLE	0.98+
10	DATE	0.98+
Postgres	TITLE	0.98+
DataDog	ORGANIZATION	0.98+
40 customer reported issues	QUANTITY	0.97+
both	QUANTITY	0.97+

UNLIST TILL 4/2 - Vertica @ Uber Scale

>> Sue: Hi, everybody. Thank you for joining us today, for the Virtual Vertica BDC 2020. This breakout session is entitled "Vertica @ Uber Scale" My name is Sue LeClaire, Director of Marketing at Vertica. And I'll be your host for this webinar. Joining me is Girish Baliga, Director I'm sorry, user, Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q and A session, at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can also Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button, in the lower right corner of the slides. And yet, this virtual session is being recorded, and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Girish over to you. >> Girish: Thanks a lot Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real time analytics teams at Uber. Vertica is one of the main platforms that we support, and Vertica powers a lot of core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases, across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability, using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I will be sharing an easy way for you to take advantage of, many of the ideas and solutions that I'm going to present today, that you can apply to your own Vertica deployments in your companies. So stick around and put on your seat belts, and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems, and enabling people all over the world to solve their local problems, their local needs, their local issues, in a manner that's efficient, fast and reliable. As our CEO Dara has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So, across our various business lines, we have over 110 million monthly users, who use our rides, services, or eat services, and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the ride business, have over 20 million trips per day. And that each business is also catching up, particularly during the recent times that we've been having. And so, I hope these numbers give you a scale of the amount of data, that we process each and every day. And support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So, Uber to describe it very briefly, is a lot like Amazon. We are largely an operation and logistics company. And employee work based reflects that. So over 70% of our employees work in teams, which come under the umbrella of Community Operations and Centers of Excellence. So these are all folks working in various cities and towns that we operate around the world, and run the Uber businesses, as somewhat local businesses responding to local needs, local market conditions, local regulation and so forth. And Vertica is one of the most important tools, that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going, to deeply into any issues that they want to triage , to generate reports, to plan for the future, a whole lot of use cases. The second big class of users, are in our marketplace team. So marketplace is the engineering team, that backs our ride shared business. And as part of this, running this business, a key problem that they have to solve, is how to determine what prices to set, for particular rides, so that we have a good match between supply and demand. So obviously the real time pricing decisions they're made by serving systems, with very detailed and well crafted machine learning models. However, the training data that goes into this models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store, and serve out of Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and back-end systems, to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines, is dashboards and reporting. So we have a lot of dashboards, which are built by core data analysts teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run, periodically say once a week or once a day even, depending on the frequency of data that they need. And many of these are powered by the data, and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas, doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives, or marketing support, and so forth. So the analysis that backs a lot of these decisions, is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is, how we provide best in class algorithms, pricing, and matching. And a lot of the analysis that goes into, figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real time decisions, are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica usage spans a whole bunch of organizations and users, all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5000 weekly active, people who run queries at least once a week, to do some critical business role or problem to solve, that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps, and request for a ride or order food delivery on each platform, the apps are talking to our serving systems. And the serving systems use online storage systems, to store the data as the trips and eat orders are getting processed in real time. So for this, we primarily use an in house built, key value storage system called Schemaless, and an open source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support serving systems. So all of this operations generates a lot of data, that we then want to process and analyze, and use for our operational improvements. So, we have ingestion systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber a data lake is powered by Hadoop, with files stored on HDFS clusters. So once the raw data lines on the data lake, we then have ETL jobs that process these raw datasets, and generate, modeled and customize datasets which we then use for further analysis. So once these model datasets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer. So with internal tools, like QueryBuilder, which is a UI interface to write queries, and look at results. And it read over the front-end sites, and Dashbuilder, which is a dash, board building tool, and report management tool. So these are all various tools that we have built within Uber. And these can talk to Vertica and run SQL queries to power, whatever, dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights, that we show on dashboards as folks use, and it also powers reports that we run periodically. But more importantly, we have some core, properties and core feature sets that Vertica provides, which allows us to support many of these use cases, very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables onto Vertica. The core fact tables are all the trips, all the each orders and all these other line items for various businesses from Uber, stored as partitioned tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, career partners and so forth. So we have both these two kinds of datasets, which will load into Vertica. And we have full historical data, all the way since we launched these businesses to today. So that folks can do deeper longitudinal analysis, so they can look at patterns, like how the business has grown from month to month, year to year, the same month, over a year, over multiple years, and so forth. And, the really powerful thing about Vertica, is that most of these queries, you run the deep longitudinal queries, run very, very fast. And that's really why we love Vertica. Because we see query latency P90s. That is 90 percentile of all queries that we run on our platform, typically finish in under a minute. So that's very important for us because Vertica is used, primarily for interactive analytics use cases. And providing SQL query execution times under a minute, is critical for our users and business owners to get the most out of analytics and Big Data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important set of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards, that rely very heavily on being able to restrict datasets by geographic areas, cities, source destination pairs, heat maps, and so forth. And Vertica has a rich array of functions that we use very heavily. We also have, support for custom projections in Vertica. And this really helps us, have very good performance for critical datasets. So for instance, in some of our core fact tables, we have done a lot of query and analysis to figure out, how users run their queries, what kind of columns they use, what combination of columns they use, and what joints they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica, is very valuable for us. So we've also had some very successful collaborations, with the Vertica engineering team. About a year and a half back, we had open-sourced a Python Client, that we had built in house to talk to Vertica. We were using this Python Client in our business intelligence layer that I'd shown on the previous slide. And we had open-sourced it after working closely with Eng team. And now Vertica formally supports the Python Client as an open-source project, which you can download to and integrate into your systems. Another more recent example of collaboration is the Vertica Eon mode on GCP. So as most of or at least some of you know, Vertica Eon mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So Vertica team hustled on this, and provided us early preview version, which we've been testing out to see how performance, is impacted by running on the Cloud, and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this very soon. So here I have a visualization of an internal dashboard, that is powered solely by data and queries running on Vertica. So this GIF has sequence have different visualizations supported by this tool. So for instance, here you see a heat map, downgrading heat map of source of traffic demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lines. And then you can see how demand moves around. So, as the cycles through the various animations, you can basically see all the different kinds of insights, and query shapes that we send to Vertica, which powers this critical business dashboard for our operations teams. All right, so now how do we do all of this at scale? So, we started off with a single Vertica cluster, a few years back. So we had our data lake, the data would land into Vertica. So these are the core fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer, right? So this is a very simple, and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that, Uber is a pretty big company at this point, with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course, on queueing theory, you will realize that basically, even though you could have all the query is processed through a single serving system. You will tend to see larger and larger queue wait times, as the number of queries pile up. And what this means in practice for end users, is that they are basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself, is probably less than a minute, their query sitting in the queue for a bunch of minutes, and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point of failure. Now Vertica can handle single node failures very gracefully, and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon, you will see that, when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime, is when you're upgrading software in your clusters. So, essentially we're a global company, and we have users all around the world, we really cannot afford to have downtime, even for one hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we we might need to upgrade our machines, or we might need to replace storage or memory due to issues with the hardware in there, due to normal wear and tear, or due to abnormal issues. And so because of all of these things, having a single point of failure, having a single cluster was not really practical for us. So the next thing we did, was we set up multiple clusters, right? So we had a bunch of identities clusters, all of which have the same datasets. So then we would basically load data using ingestion pipelines from our data lake, onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Anytime we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster, if we had to, or we would just have some spare nodes, which would rotate into our production clusters and so forth. However, having multiple clusters, led to a new set of issues. So the first problem was that since we have multiple clusters, you would end up with inconsistent schema. So one of the things to understand about our platform, is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on Vertica clusters. So we have dataset owners and publishers, who manage their own datasets. Now exposing multiple clusters to these dataset owners. Turns out, it's not a great idea, right? Because they are not really aware of, the importance of having consistency of schemas and datasets across different clusters. So over time, what we saw was that the schema for the same tables would basically get out of order, because they were all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables in one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where, we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had ingestion pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the ingestion pipeline into cluster B failed, then the data there would be older than clusters A and C. So, when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results, than you would if you went to a or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts. But then that would lead to a bad situation for them where they would not able to fully trust the data that was, and the results and insights that were being returned by the SQL queries and Vertica systems. And then the third problem was, we had a lot of extra replication. So the 20/80 Rule, or maybe even the 90/10 Rule, applies to datasets on our clusters as well. So less than 10% of our datasets, for instance, in 90% of the queries, right? And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this set up where we had to do that, was obviously very suboptimal for us. So then what we did, was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing datasets or dataset owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that, all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that, when queries come in from the BI layer, the query was forwarded, smartly and with knowledge and data about which cluster up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent, almost single system view of our entire Vertica deployment. And the third bit, we had put in place, was the data manifest, which were the communication mechanism between ingestion and proxy. So the data manifest basically is a listing of, which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra replication problem. So now we basically have some big clusters, where all the core tables, and all the tables, in fact, are served. So any query that hits 90%, less so tables, goes to the big clusters. And most of the queries which hit 10% heavily queried important tables, can also be served by many other small clusters, so much more efficient use of resources. So this basically is the view that we have today, of Vertica within Uber, so external to our team, folks, just have an endpoint, where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries. And they are so to a proxy layer. So let's get a little more into details, about each of these layers. So, on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle, so the list of cities list of drivers, the list of users and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these datasets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each orders and so forth. So these are partition. So we have one partition roughly per day, for the last couple of years, and then we have more of a hierarchical partitions set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that, is because not all our data comes in at the same time. So we have updates for trips, going over the past two or three days, for instance, where people add ratings to their trips, or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last few days to make sure we capture all those updates. And we also update older partitions, if for instance, records were deleted for retention purposes, or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow dataset owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that, we take advantage of the data being clustered by day, so that we don't have to update all the data at once. So when we are recovering from an cluster event, like a version upgrade or software upgrade, or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, and copying all the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we, add this cluster to our serving pool, and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here, is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables, in all the clusters in phase one. And then when all the clusters give us access signals, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load, using Vertica Data Copy. So what this means is earlier, in a parallel pipelines scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica Data Copy, we just load it data into one cluster and then much more efficiently copy it, to the other clusters. So this has significantly reduced our ingestion overheads, and speed it up our load process. And as I mentioned as the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date, by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first, what we do is we stage these changes to make sure that these are correct. So this catches errors that are trying to do, an incompatible update, like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency. And provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for, replicated clusters to all our users. So the way we handle that is, as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query, and route the query to the appropriate cluster based on the manifest information. Also the proxy is aware of the health of individual clusters. So if for some reason a cluster is down for maintenance or upgrades, the proxy is aware of this information. And it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters, and do some load balancing to ensure that we award hotspots on various clusters. So the key takeaways that I have from the stock, are primarily these. So we started off with single cluster mode on Vertica, and we ran into a bunch of issues around scaling and availability due to cluster downtime. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. Then we run into issues around schema consistency, data staleness, and data replication. So we built an entire ecosystem around Vertica, with abstraction layers around data management and ingestion, and proxy. And with this setup, we were able to enforce consistency and improve storage utilization. So, hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber, and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have a interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem, is to use the Vertica Eon mode. So the Vertica Eon mode, allows you to set up multiple clusters with consistent data updates, and set them up at various different sizes to handle different query loads. And it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been, trying it out on DCP, and initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new. And hopefully you took away something that you can also apply to your systems. We have a few more time for some questions. So I'll pause for now and take any questions.

Published Date : Mar 30 2020

SUMMARY :

Any questions that we don't address, So the first issue that we have is that,

ENTITIES

Entity	Category	Confidence
Girish Baliga	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Girish	PERSON	0.99+
10%	QUANTITY	0.99+
one hour	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
90%	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Sue	PERSON	0.99+
two	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Dara	PERSON	0.99+
first issue	QUANTITY	0.99+
less than a minute	QUANTITY	0.99+
MySQL	TITLE	0.99+
First	QUANTITY	0.99+
first problem	QUANTITY	0.99+
third problem	QUANTITY	0.99+
third bit	QUANTITY	0.99+
less than 10%	QUANTITY	0.99+
each platform	QUANTITY	0.99+
second	QUANTITY	0.99+
one cluster	QUANTITY	0.99+
one	QUANTITY	0.99+
second issue	QUANTITY	0.99+
Python	TITLE	0.99+
today	DATE	0.99+
second phase	QUANTITY	0.99+
two kinds	QUANTITY	0.99+
over 10,000 cities	QUANTITY	0.99+
over 70%	QUANTITY	0.99+
each business	QUANTITY	0.99+
second thing	QUANTITY	0.98+
second problem	QUANTITY	0.98+
Vertica	TITLE	0.98+
both	QUANTITY	0.98+
Vertica Data Manager	TITLE	0.98+
two-phase	QUANTITY	0.98+
first	QUANTITY	0.98+
90 percentile	QUANTITY	0.98+
once a week	QUANTITY	0.98+
each	QUANTITY	0.98+
single point	QUANTITY	0.97+
SQL	TITLE	0.97+
once a day	QUANTITY	0.97+
Redis	TITLE	0.97+
one partition	QUANTITY	0.97+
under a minute	QUANTITY	0.97+
@ Uber Scale	ORGANIZATION	0.96+

UNLIST TILL 4/2 - Model Management and Data Preparation

>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.

Published Date : Mar 30 2020

SUMMARY :

and thank you for joining us today and the limit, you can hit that in terms of cost,

ENTITIES

Entity	Category	Confidence
Vertica	ORGANIZATION	0.99+
Waqas Dhillon	PERSON	0.99+
70 days	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
two points	QUANTITY	0.99+
two days	QUANTITY	0.99+
Sue	PERSON	0.99+
seven days	QUANTITY	0.99+
one week	QUANTITY	0.99+
five days	QUANTITY	0.99+
Sunday	DATE	0.99+
two parts	QUANTITY	0.99+
second part	QUANTITY	0.99+
Saturday	DATE	0.99+
Excel	TITLE	0.99+
50 columns	QUANTITY	0.99+
4/2	DATE	0.99+
First	QUANTITY	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
each node	QUANTITY	0.99+
Today	DATE	0.99+
first factor	QUANTITY	0.99+
less than 0.27 seconds	QUANTITY	0.99+
Vertica	TITLE	0.99+
first	QUANTITY	0.99+
Friday	DATE	0.99+
Monday	DATE	0.99+
second aspect	QUANTITY	0.99+
eighth	QUANTITY	0.99+
today	DATE	0.99+
one day	QUANTITY	0.99+
two data points	QUANTITY	0.99+
third consideration	QUANTITY	0.99+
one	QUANTITY	0.99+
first step	QUANTITY	0.98+
first part	QUANTITY	0.98+
first one	QUANTITY	0.98+
zero values	QUANTITY	0.98+
second	QUANTITY	0.98+
both approaches	QUANTITY	0.98+
about 4.6 seconds	QUANTITY	0.98+
third thing	QUANTITY	0.98+
Secondly	QUANTITY	0.98+
one tool	QUANTITY	0.98+
zero	QUANTITY	0.98+
each mode	QUANTITY	0.98+
One	QUANTITY	0.97+
figure B	OTHER	0.97+
figure C	OTHER	0.97+
4.6 value	QUANTITY	0.97+
R	TITLE	0.97+
Machine Learning with Vertica, Data Preparation and Model Management	TITLE	0.97+
Waqas	PERSON	0.97+
each model	QUANTITY	0.97+
two main options	QUANTITY	0.97+
80%	QUANTITY	0.97+
two components	QUANTITY	0.96+
around 90%	QUANTITY	0.96+
two	QUANTITY	0.96+
later this week	DATE	0.95+

UNLIST TILL 4/2 The Data-Driven Prognosis

>> Narrator: Hi, everyone, thanks for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled toward Zero Unplanned Downtime of Medical Imaging Systems using Big Data. My name is Sue LeClaire, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Mauro Barbieri, lead architect of analytics at Philips. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. And we'll answer as many questions as we're able to during that time. Any questions that we don't get to we'll do our best to answer them offline. Alternatively, you can also visit the vertical forums to post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And yes, this virtual session is being recorded, and we'll be available to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Mauro, over to you. >> Thank you, good day everyone. So medical imaging systems such as MRI scanners, interventional guided therapy machines, CT scanners, the XR system, they need to provide hospitals, optimal clinical performance but also predictable cost of ownership. So clinicians understand the need for maintenance of these devices, but they just want to be non intrusive and scheduled. And whenever there is a problem with the system, the hospital suspects Philips services to resolve it fast and and the first interaction with them. In this presentation you will see how we are using big data to increase the uptime of our medical imaging systems. I'm sure you have heard of the company Phillips. Phillips is a company that was founded in 129 years ago in actually 1891 in Eindhoven in Netherlands, and they started by manufacturing, light bulbs, and other electrical products. The two brothers Gerard and Anton, they took an investment from their father Frederik, and they set up to manufacture and sale light bulbs. And as you may know, a key technology for making light bulbs is, was glass and vacuum. So when you're good at making glass products and vacuum and light bulbs, then there is an easy step to start making radicals like they did but also X ray tubes. So Philips actually entered very early in the market of medical imaging and healthcare technology. And this is what our is our core as a company, and it's also our future. So, healthcare, I mean, we are in a situation now in which everybody recognize the importance of it. And and we see incredible trends in a transition from what we call Volume Based Healthcare to Value Base, where, where the clinical outcomes are driving improvements in the healthcare domain. Where it's not enough to respond to healthcare challenges, but we need to be involved in preventing and maintaining the population wellness and from a situation in which we episodically are in touch with healthcare we need to continuously monitor and continuously take care of populations. And from healthcare facilities and technology available to a few elected and reach countries we want to make health care accessible to everybody throughout the world. And this of course, has poses incredible challenges. And this is why we are transforming the Philips to become a healthcare technology leader. So from Philips has been a concern realizing and active in many sectors in many sectors and realizing what kind of technologies we've been focusing on healthcare. And we have been transitioning from creating and selling products to making solutions to addresses ethical challenges. And from selling boxes, to creating long term relationships with our customers. And so, if you have known the Philips brand from from Shavers from, from televisions to light bulbs, you probably now also recognize the involvement of Philips in the healthcare domain, in diagnostic imaging, in ultrasound, in image guided therapy and systems, in digital pathology, non invasive ventilation, as well as patient monitoring intensive care, telemedicine, but also radiology, cardiology and oncology informatics. Philips has become a powerhouse of healthcare technology. To give you an idea of this, these are the numbers for, from 2019 about almost 20 billion sales, 4% comparable sales growth with respect to the previous year and about 10% of the sales are reinvested in R&D. This is also shown in the number of patents rights, last year we filed more than 1000 patents in, in the healthcare domain. And the company is about 80,000 employees active globally in over 100 countries. So, let me focus now on the type of products that are in the scope of this presentation. This is a Philips Magnetic Resonance Imaging Scanner, also called Ingenia 3.0 Tesla is an incredible machine. Apart from being very beautiful as you can see, it's a it's a very powerful technology. It can make high resolution images of the human body without harmful radiation. And it's a, it's a, it's a complex machine. First of all, it's massive, it weights 4.6 thousand kilograms. And it has superconducting magnets cooled with liquid helium at -269 degrees Celsius. And it's actually full of software millions and millions of lines of code. And it's occupied three rooms. What you see in this picture, the examination room, but there is also a technical room which is full of of of equipment of custom hardware, and machinery that is needed to operate this complex device. This is another system, it's an interventional, guided therapy system where the X ray is used during interventions with the patient on the table. You see on the left, what we call C-arm, a robotic arm that moves and can take images of the patient while it's been operated, it's used for cardiology intervention, neurological intervention, cardiovascular intervention. There's a table that moves in very complex ways and it again it occupies two rooms, this room that we see here and but also a room full of cabinets and hardwood and computers. This is another another characteristic of this machine is that it has to operate it as it is used during medical interventions, and so it has to interact with all kind of other equipment. This is another system it's a, it's a, it's a Computer Tomography Scanner Icon which is a unique, it is unique due to its special detection technology. It has an image resolution up to 0.5 millimeters and making thousand by thousand pixel images. And it is also a complex machine. This is a picture of the inside of a compatible device not really an icon, but it has, again three rotating, which waits two and a half turn. So, it's a combination of X ray tube on top, high voltage generators to power the extra tube and in a ray of detectors to create the images. And this rotates at 220 right per minutes, making 50 frames per second to make 3D reconstruction of the of the body. So a lot of technology, complex technology and this technology is made for this situation. We make it for clinicians, who are busy saving people lives. And of course, they want optimal clinical performance. They want the best technology to treat the patients. But they also want predictable cost of ownership. They want predictable system operations. They want their clinical schedules not interrupted. So, they understand these machines are complex full of technology. And these machines may have, may require maintenance, may require software update, sometimes may even say they require some parts, horrible parts to be replaced, but they don't want to have it unplanned. They don't want to have unplanned downtime. They would hate send, having to send patients home and to have to reschedule visits. So they understand maintenance. They just want to have a schedule predictable and non intrusive. So already a number of years ago, we started a transition from what we call Reactive Maintenance services of these devices to proactive. So, let me show you what we mean with this. Normally, if a system has an issue system on the field, and traditional reactive workflow would be that, this the customer calls a call center, reports the problem. The company servicing the device would dispatch a field service engineer, the field service engineer would go on site, do troubleshooting, literally smell, listen to noise, watch for lights, for, for blinking LEDs or other unusual issues and would troubleshoot the issue, find the root cause and perhaps decide that the spare part needs to be replaced. He would order a spare part. The part would have to be delivered at the site. Either immediately or the engineer would would need to come back another day when the part is available, perform the repair. That means replacing the parts, do all the needed tests and validations. And finally release the system for clinical use. So as you can see, there is a lot of, there are a lot of steps, and also handover of information from one to between different people, between different organizations even. Would it be better to actually keep monitoring the installed base, keep observing the machine and actually based on the information collected, detect or predict even when an issue is is going to happen? And then instead of reacting to a customer calling, proactively approach the customer scheduling, preventive service, and therefore avoid the problem. So this is actually what we call Corrective Service. And this is what we're being transitioning to using Big Data and Big Data is just one ingredient. In fact, there are more things that are needed. The devices themselves need to be designed for reliability and predictability. If the device is a black box does not communicate to the outside world the status, if it does not transmit data, then of course, it is not possible to observe and therefore, predict issues. This of course requires a remote service infrastructure or an IoT infrastructure as it is called nowadays. The passivity to connect the medical device with a data center in enterprise infrastructure, collect the data and perform the remote troubleshooting and the predictions. Also the right processes and the right organization is to be in place, because an organization that is, you know, waiting for the customer to call and then has a number of few service engineers available and a certain amount of spare parts and stock is a different organization from an organization that actually is continuously observing the installed base and is scheduling actions to prevent issues. And in other pillar is knowledge management. So in order to realize predictive models and to have predictive service action, it's important to manage knowledge about failure modes, about maintenance procedures very well to have it standardized and digitalized and available. And last but not least, of course, the predictive models themselves. So we talked about transmitting data from the installed base on the medical device, to an enterprise infrastructure that would analyze the data and generate predictions that's predictive models are exactly the last ingredient that is needed. So this is not something that I'm, you know, I'm telling you for the first time is actually a strategic intent of Philips, where we aim for zero unplanned downtime. And we market it that way. We also is not a secret that we do it by using big data. And, of course, there could be other methods to to achieving the same goal. But we started using big data already now well, quite quite many years ago. And one of the reasons is that our medical devices already are wired to collect lots of data about the functioning. So they collect events, error logs that are sensor connecting sensor data. And to give you an idea, for example, just as an order of magnitudes of size of the data, the one MRI scanner can log more than 1 million events per day, hundreds of thousands of sensor readings and tens of thousands of many other data elements. And so this is truly big data. On the other hand, this data was was actually not designed for predictive maintenance, you have to think a medical device of this type of is, stays in the field for about 10 years. Some a little bit longer, some of it's shorter. So these devices have been designed 10 years ago, and not necessarily during the design, and not all components were designed, were designed with predictive maintenance in mind with IoT, and with the latest technology at that time, you know, progress, will not so forward looking at the time. So the actual the key challenge is taking the data which is already available, which is already logged by the medical devices, integrating it and creating predictive models. And if we dive a little bit more into the research challenges, this is one of the Challenges. How to integrate diverse data sources, especially how to automate the costly process of data provisioning and cleaning? But also, once you have the data, let's say, how to create these models that can predict failures and the degradation of performance of a single medical device? Once you have these models and alerts, another challenge is how to automatically recommend service actions based on the probabilistic information on these possible failures? And once you have the insights even if you can recommend action still recommending an action should be done with the goal of planning, maintenance, for generating value. That means balancing costs and benefits, preventing unplanned downtimes without of course scheduling and unnecessary interventions because every intervention, of course, is a disruption for the clinical schedule. And there are many more applications that can be built off such as the optimal management of spare parts supplies. So how do you approach this problem? Our approach was to collect into one database Vertica. A large amount of historical data, first of all historical data coming from the medical devices, so event logs, parameter value system configuration, sensor readings, all the data that we have at our disposal, that in the same database together with records of failures, maintenance records, service work orders, part replacement contracts, so basically the evidence of failures and once you have data from the medical devices, and data from the failures in the same database, it becomes possible to correlate event logs, errors, signal sensor readings with records of failures and records of part replacement and maintenance operations. And we did that also with a specific approach. So we, we create integrated teams, and every integrated team at three figures, not necessarily three people, they were actually multiple people. But there was at least one business owner from a service organization. And this business owner is the person who knows what is relevant, which use case are relevant to solve for a particular type of product or a particular market. What basically is generating value or is worthwhile tackling as an organization. And we have data scientists, data scientists are the one who actually can manipulate data. They can write the queries, they can write the models and robust statistics. They can create visualization and they are the ones who really manipulate the data. Last but not least, very important is subject matter experts. Subject Matter Experts are the people who know the failure modes, who know about the functioning of the medical devices, perhaps they're even designed, they come from the design side, or they come from the service innovation side or even from the field. People who have been servicing the machines in real life for many, many years. So, they are familiar with the failure models, but also familiar with the type of data that is logged and the processes and how actually the systems behave, if you if you if you if you allow me in, in the wild in the in the field. So the combination of these three secrets was a key. Because data scientist alone, just statisticians basically are people who can all do machine learning. And they're not very effective because the data is too complicated. That's why you more than too complex, so they will spend a huge amount of time just trying to figure out the data. Or perhaps they will spend the time in tackling things that are useless, because it's such an interesting knows much quicker which data points are useful, which phenomenon can be found in the data or probably not found. So the combination of subject matter experts and data scientists is very powerful and together gathered by a business owner, we could tackle the most useful use cases first. So, this teams set up to work and they developed three things mainly, first of all, they develop insights on the failure modes. So, by looking at the data, and analyzing information about what happened in the field, they find out exactly how things fail in a very pragmatic and quantitative way. Also, they of course, set up to develop the predictive model with associated alerts and service actions. And a predictive model is just not an alert is just not a flag. Just not a flag, only flag that turns on like a like a traffic light, you know, but there's much more than that. It's such an alert is to be interpreted and used by highly skilled and trained engineer, for example, in a in a call center, who needs to evaluate that error and plan a service action. Service action may involve the ordering a replacement of an expensive part, it may involve calling up the customer hospital and scheduling a period of downtime, downtime to replace a part. So it has an impact on the clinical practice, could have an impact. So, it is important that the alert is coupled with sufficient evidence and information for such a highly skilled trained engineer to plan the service session efficiently. So, it's it's, it's a lot of work in terms of preparing data, preparing visualizations, and making sure that old information is represented correctly and in a compact form. Additionally, These teams develop, get insight into the failure modes and so they can provide input to the R&D organization to improve the products. So, to summarize these graphically, we took a lot of historical data from, coming from the medical devices from the history but also data from relational databases, where the service, work orders, where the part replacement, the contact information, we integrated it, and we set up to the data analytics. From there we don't have value yet, only value starts appearing when we use the insights of data analytics the model on live data. When we process live data with the module we can generate alerts, and the alerts can be used to plan the maintenance and the maintenance therefore the plant maintenance replaces replacing downtime is creating value. To give an idea of the, of the type of I cannot show you the details of these modules, all of these predictive models. But to give you an idea, this is just a picture of some of the components of our medical device for which we have models for which we have, for which we call the failure modes, hard disk, clinical grade monitoring, monitors, X ray tubes, and so forth. This is for MRI machines, a lot of custom hardware and other types of amplifiers and electronics. The alerts are then displayed in a in a dashboard, what we call a Remote monitoring dashboard. We have a team of remote monitoring engineers that basically surveyors the install base, looks at this dashboard picks up these alerts. And an alert as I said before is not just one flag, it contains a lot of information about the failure and about the medical device. And the remote monitor engineer basically will pick up these alerts, they review them and they create cases for the markets organization to handle. So, they see an alert coming in they create a case. So that the particular call center in in some country can call the customer and schedule and make an appointment to schedule a service action or it can add it preventive action to the schedule of the field service engineer who's already supposed to go to visit the customer for example. This is a picture and high-level picture of the overall data person architecture. On the bottom we have install base install base is formed by all our medical devices that are connected to our Philips and more service network. Data is transmitted in a in a secure and in a secure way to our enterprise infrastructure. Where we have a so called Data Lake, which is basically an archive where we store the data as it comes from, from the customers, it is scrubbed and protected. From there, we have a processes ETL, Extract, Transform and Load that in parallel, analyze this information, parse all these files and all this data and extract the relevant parameters. All this, the reason is that the data coming from the medical device is very verbose, and in legacy formats, sometimes in binary formats in strange legacy structures. And therefore, we parse it and we structure it and we make it magically usable by data science teams. And the results are stored in a in a vertica cluster, in a data warehouse. In the same data warehouse, where we also store information from other enterprise systems from all kinds of databases from SQL, Microsoft SQL Server, Tera Data SAP from Salesforce obligations. So, the enterprise IT system also are connected to vertica the data is inserted into vertica. And then from vertica, the data is pulled by our predictive models, which are Python and Rscripts that run on our proprietary environment helps with insights. From this proprietary environment we generate the alerts which are then used by the remote monitoring application. It's not the only application this is the case of remote monitoring. We also have applications for particular remote service. So whenever we cannot prevent or predict we cannot predict an issue from happening or we cannot prevent an issue from happening and we need to react on a customer call, then we can still use the data to very quickly troubleshoot the system, find the root cause and advice or the best service session. Additionally, there are reliability dashboards because all this data can also be used to perform reliability studies and improve the design of the medical devices and is used by R&D. And the access is with all kinds of tools. So Vertica gives the flexibility to connect with JDBC to connect dashboards using Power BI to create dashboards and click view or just simply use RM Python directly to perform analytics. So little summary of the, of the size of the data for the for the moment we have integrated about 500 terabytes worth of data tables, about 30 trillion data points. More than eighty different data sources. For our complete connected install base, including our customer relation management system SAP, we also have connected, we have integrated data from from the factory for repair shops, this is very useful because having information from the factory allows to characterize components and devices when they are new, when they are still not used. So, we can model degradation, excuse me, predict failures much better. Also, we have many years of historical data and of course 24/7 live feeds. So, to get all this going, we we have chosen very simple designs from the very beginning this was developed in the back the first system in 2015. At that time, we went from scratch to production eight months and is also very stable system. To achieve that, we apply what we call Exhaustive Error Handling. When you process, most of people attending this conference probably know when you are dealing with Big Data, you have probably you face all kinds of corner cases you feel that will never happen. But just because of the sheer volume of the data, you find all kinds of strange things. And that's what you need to take care of, if you want to have a stable, stable platform, stable data pipeline. Also other characteristic is that, we need to handle live data, but also be able to, we need to be able to reprocess large historical datasets, because insights into the data are getting generated over time by the team that is using the data. And very often, they find not only defects, but also they have changed requests for new data to be extracted to distract in a different way to be aggregated in a different way. So basically, the platform is continuously crunching data. Also, components have built-in monitoring capabilities. Transparent transparency builds trust by showing how the platform behaves. People actually trust that they are having all the data which is available, or if they don't see the data or if something is not functioning they can see why and where the processing has stopped. A very important point is documentation of data sources every data point as a so called Data Provenance Fields. That is not only the medical device where it comes from, with all this identifier, but also from which file, from which moment in time, from which row, from which byte offset that data point comes. This allows to identify and not only that, but also when this data point was created, by whom, by whom meaning which version of the platform and of the ETL created a data point. This allows us to identify issues and also to fix only the subset of when an issue is identified and fixed. It's possible then to fix only subset of the data that is impacted by that issue. Again, this grid trusts in data to essential for this type of applications. We actually have different environments in our analytic solution. One that we call data science environment is more or less what I've shown so far, where it's deployed in our Philips private cloud, but also can be deployed in in in public cloud such as Amazon. It contains the years of historical data, it allows interactive data exploration, human queries, therefore, it is a highly viable load. It is used for the training of machine learning algorithms and this design has been such that we it is for allowing rapid prototyping and for large data volumes. In other environments is the so called Production Environment where we actually score the models with live data from generation of the alerts. So this environment does not require years of data just months, because a model to make a prediction does not need necessarily years of data, but maybe some model even a couple of weeks or a few months, three months, six months depending on the type of data on the failure which has been predicted. And this has highly optimized queries because the applications are stable. It only only change when we deploy new models or new versions of the models. And it is designed optimized for low latency, high throughput and reliability is no human intervention, no human queries. And of course, there are development staging environments. And one of the characteristics. Another characteristic of all this work is that what we call Data Driven Service Innovation. In all this work, we use the data in every step of the process. The First business case creation. So, basically, some people ask how did you manage to find the unlocked investment to create such a platform and to work on it for years, you know, how did you start? Basically, we started with a business case and the business case again for that we use data. Of course, you need to start somewhere you need to have some data, but basically, you can use data to make a quantitative analysis of the current situation and also make it as accurate as possible estimate quantitative of value creation, if you have that basically, is you can justify the investments and you can start building. Next to that data is used to decide where to focus your efforts. In this case, we decided to focus on the use cases that had the maximum estimated business impact, with business impact meaning here, customer value, as well as value for the company. So we want to reduce unplanned downtime, we want to give value to our customers. But it would be not sustainable, if for creating value, we would start replacing, you know, parts without any consideration for the cost of it. So it needs to be sustainable. Also, then we use data to analyze the failure modes to actually do digging into the data understanding of things fail, for visualization, and to do reliability analysis. And of course, then data is a key to do feature engineering for the development of the predictive models for training the models and for the validation with historical data. So data is all over the place. And last but not least, again, these models is architecture generates new data about the alerts and about the how good the alerts are, and how well they can predict failures, how much downtime is being saved, how money issues have been prevented. So this also data that needs to be analyzed and provides insights on the performance of this, of this models and can be used to improve the models found. And last but not least, once you have performance of the models you can use data to, to quantify as much as possible the value which is created. And it is when you go back to the first step, you made the business value you you create the first business case with estimates. Can you, can you actually show that you are creating value? And the more you can, have this fitness feedback loop closed and quantify the better it is for having more and more impact. Among the key elements that are needed for realizing this? So I want to mention one about data documentation is the practice that we started already six years ago is proven to be very valuable. We document always how data is extracted and how it is stored in, in data model documents. Data Model documents specify how data goes from one place to the other, in this case from device logs, for example, to a table in vertica. And it includes things such as the finish of duplicates, queries to check for duplicates, and of course, the logical design of the tables below the physical design of the table and the rationale. Next to it, there is a data dictionary that explains for each column in the data model from a subject matter expert perspective, what that means, such as its definition and meaning is if it's, if it's a measurement, the use of measure and the range. Or if it's a, some sort of, of label the spec values, or whether the value is raw or or calculated. This is essential for maximizing the value of data for allowing people to use data. Last but not least, also an ETL design document, it explains how the transformation has happened from the source to the destination including very important the failure and the strategy. For example, when you cannot parse part of a file, should you load only what you can parse or drop the entire file completely? So, import best effort or do all or nothing or how to populate records for which there is no value what are the default values and you know, how to have the data is normalized or transform and also to avoid duplicates. This again is very important to provide to the users of the data, if full picture of all the data itself. And this is not just, this the formal process the documents are reviewed and approved by all the stakeholders into the subject matter experts and also the data scientists from a function that we have started called Data Architect. So to, this is something I want to give about, oh, yeah and of course the the documents are available to the end users of the data. And we even have links with documents of the data warehouse. So if you are, if you get access to the database, and you're doing your research and you see a table or a view, you think, well, it could be that could be interesting. It looks like something I could use for my research. Well, the data itself has a link to the document. So from the database while you're exploring data, you can retrieve a link to the place where the document is available. This is just the quick summary of some of the of the results that I'm allowed to share at this moment. This is about image guided therapy, using our remote service infrastructure for remotely connected system with the right contracts. We can achieve we have we have reduced downtime by 14% more than one out of three of cases are resolved remotely without an engineer having to go outside. 82% is the first time right fixed rate that means that the issue is fixed either remotely or if a visit at the site is needed, that visit only one visit is needed. So at that moment, the engineer we decided the right part and fix this straightaway. And this result on average on 135 hours more operational availability per year. This therefore, the ability to treat more patients for the same costs. I'd like to conclude with citing some nice testimonials from some of our customers, showing that the value that we've created is really high impact and this concludes my presentation. Thanks for your attention so far. >> Thank you Morrow, very interesting. And we've got a number of questions that we that have come in. So let's get to them. The first one, how many devices has Philips connected worldwide? And how do you determine which related center data workloads get analyzed with protocols? >> Okay, so this is just two questions. So the first question how many devices are connected worldwide? Well, actually, I'm not allowed to tell you the precise number of connected devices worldwide, but what I can tell is that we are in the order of tens of thousands of devices. And of all types actually. And then, how would we determine which related sensor gets analyzed with vertica well? And a little bit how I set In the in the presentation is a combination of two approaches is a data driven approach and the knowledge driven approach. So a knowledge driven approach because we make maximum use of our knowledge of the failure modes, and the behavior of the medical devices and of their components to select what we think are promising data points and promising features. However, from that moment on data science kicks in, and it's actually data science is used to look at the actual data and come up with quantitative information of what is really happening. So, it could be that an expert is convinced that the particular range of value of a sensor are indicative of a particular failure. And it turns out that maybe it was too optimistic on the other way around that in practice, there are many other situations situation he was not aware of. That could happen. So thanks to the data, then we, you know, get a better understanding of the phenomenon and we get the better modeling. I bet I answered that, any question? >> Yeah, we have another question. Do you have plans to perform any analytics at the edge? >> Now that's a good question. So I can't disclose our plans on this right now, but at the edge devices are certainly one of the options we look at to help our customers towards Zero Unplanned Downtime. Not only that, but also to facilitate the integration of our solution with existing and future hospital IT infrastructure. I mean, we're talking about advanced security, privacy and guarantee that the data is always safe remains. patient data and clinical data remains does not go outside the parameters of the hospital of course, while we want to enhance our functionality provides more value with our services. Yeah, so edge definitely very interesting area of innovation. >> Another question, what are the most helpful vertica features that you rely on? >> I would say, the first that comes to mind, to me at this moment is ease of integration. Basically, with vertica, we will be able to load any data source in a very easy way. And also it really can be interfaced very easily with old type of ions as an application. And this, of course, is not unique to vertica. Nevertheless, the added value here is that this is coupled with an incredible speed, incredible speed for loading and for querying. So it's basically a very versatile tool to innovate fast for data science, because basically we do not end up another thing is multiple projections, advanced encoding and compression. So this allows us to perform the optimizations only when we need it and without having to touch applications or queries. So if we want to achieve high performance, we Basically spend a little effort on improving the projection. And now we can achieve very often dramatic increases in performance. Another feature is EO mode. This is great for for cloud for cloud deployment. >> Okay, another question. What is the number one lesson learned that you can share? >> I think that would my advice would be document control your entire data pipeline, end to end, create positive feedback loops. So I hear that what I hear often is that enterprises I mean Philips is one of them that are not digitally native. I mean, Philips is 129 years old as a company. So you can imagine the the legacy that we have, we will not, you know, we are not born with Web, like web companies are with with, you know, with everything online and everything digital. So enterprises that are not digitally native, sometimes they struggle to innovate in big data or into to do data driven innovation, because, you know, the data is not available or is in silos. Data is controlled by different parts of the organ of the organization with different processes. There is not as a super strong enterprise IT system, providing all the data, you know, for everybody with API's. So my advice is to, to for the very beginning, a creative creating as soon as possible, an end to end solution, from data creation to consumption. That creates value for all the stakeholders of the data pipeline. It is important that everyone in the data pipeline from the producer of the data to the to the consumers, basically in order to pipeline everybody gets a piece of value, piece of the cake. When the value is proven to all stakeholders, everyone would naturally contribute to keep the data pipeline running, and to keep the quality of the data high. That's the students there. >> Yeah, thank you. And in the area of machine learning, what types of innovations do you plan to adopt to help with your data pipeline? >> So, in the error of machine learning, we're looking at things like automatically detecting the deterioration of models to trigger improvement action, as well as connected with active learning. Again, focused on improving the accuracy of our predictive models. So active learning is when the additional human intervention labeling of difficult cases is triggered. So the machine learning classifier may not be able to, you know, classify correctly all the time and instead of just randomly picking up some cases for a human to review, you, you want the costly humans to only review the most valuable cases, from a machine learning point of view, the ones that would contribute the most in improving the classifier. Another error is is deep learning and was not working on it, I mean, but but also applications of more generic anomaly detection algorithms. So the challenge of anomaly detection is that we are not only interested in finding anomalies but also in the recommended proper service actions. Because without a proper service action, and alert generated because of an anomaly, the data loses most of its value. So, this is where I think we, you know. >> Go ahead. >> No, that's, that's it, thanks. >> Okay, all right. So that's all the time that we have today for questions. I want to thank the audience for attending Mauro's presentation and also for your questions. If you weren't able to, if we weren't able to answer your question today, I'd ask let we'll let you know that we'll respond via email. And again, our engineers will be at the vertica, on the vertica quorums awaiting your other questions. It would help us greatly if you could give us some feedback and rate the session before you sign off. Your rating will help us guide us as when we're looking at content to provide for the next vertica BTC. Also, note that a replay of today's event and a PDF copy of the slides will be available on demand, we'll let you know when that'll be by email hopefully later this week. And of course, we invite you to share the content with your colleagues. Again, thank you for your participation today. This includes this breakout session and hope you have a wonderful day. Thank you. >> Thank you

Published Date : Mar 30 2020

SUMMARY :

in the lower right corner of the slide. and perhaps decide that the spare part needs to be replaced. So let's get to them. and the behavior of the medical devices Do you have plans to perform any analytics at the edge? and guarantee that the data is always safe remains. on improving the projection. What is the number one lesson learned that you can share? from the producer of the data to the to the consumers, And in the area of machine learning, what types the deterioration of models to trigger improvement action, and a PDF copy of the slides will be available on demand,

ENTITIES

Entity	Category	Confidence
Mauro Barbieri	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Gerard	PERSON	0.99+
Frederik	PERSON	0.99+
Phillips	ORGANIZATION	0.99+
Sue LeClaire	PERSON	0.99+
2015	DATE	0.99+
two questions	QUANTITY	0.99+
Mauro	PERSON	0.99+
Eindhoven	LOCATION	0.99+
4.6 thousand kilograms	QUANTITY	0.99+
two rooms	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
14%	QUANTITY	0.99+
six months	QUANTITY	0.99+
Anton	PERSON	0.99+
4%	QUANTITY	0.99+
135 hours	QUANTITY	0.99+
three months	QUANTITY	0.99+
2019	DATE	0.99+
Amazon	ORGANIZATION	0.99+
last year	DATE	0.99+
82%	QUANTITY	0.99+
two approaches	QUANTITY	0.99+
eight months	QUANTITY	0.99+
three people	QUANTITY	0.99+
three rooms	QUANTITY	0.99+
today	DATE	0.99+
first question	QUANTITY	0.99+
more than 1000 patents	QUANTITY	0.99+
1891	DATE	0.99+
Today	DATE	0.99+
Power BI	TITLE	0.99+
Netherlands	LOCATION	0.99+
one ingredient	QUANTITY	0.99+
three figures	QUANTITY	0.99+
one	QUANTITY	0.99+
over 100 countries	QUANTITY	0.99+
later this week	DATE	0.99+
tens of thousands	QUANTITY	0.99+
SQL	TITLE	0.98+
about 10%	QUANTITY	0.98+
about 80,000 employees	QUANTITY	0.98+
six years ago	DATE	0.98+
Python	TITLE	0.98+
three	QUANTITY	0.98+
two brothers	QUANTITY	0.98+
millions	QUANTITY	0.98+
first step	QUANTITY	0.98+
about 30 trillion data points	QUANTITY	0.98+
first one	QUANTITY	0.98+
about 500 terabytes	QUANTITY	0.98+
Microsoft	ORGANIZATION	0.98+
first time	QUANTITY	0.98+
each column	QUANTITY	0.98+
hundreds of thousands	QUANTITY	0.98+
this week	DATE	0.97+
Salesforce	ORGANIZATION	0.97+
first	QUANTITY	0.97+
tens of thousands of devices	QUANTITY	0.97+
first system	QUANTITY	0.96+
about 10 years	QUANTITY	0.96+
10 years ago	DATE	0.96+
one visit	QUANTITY	0.95+
Morrow	PERSON	0.95+
up to 0.5 millimeters	QUANTITY	0.95+
More than eighty different data sources	QUANTITY	0.95+
129 years ago	DATE	0.95+
first interaction	QUANTITY	0.94+
one flag	QUANTITY	0.94+
three things	QUANTITY	0.93+
thousand	QUANTITY	0.93+
50 frames per second	QUANTITY	0.93+
First business	QUANTITY	0.93+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Sue LeClaire: