Rich Gaston, Micro Focus | Virtual Vertica BDC 2020
(upbeat music) >> Announcer: It's theCUBE covering the virtual Vertica Big Data Conference 2020 brought to you by Vertica. >> Welcome back to the Vertica Virtual Big Data Conference, BDC 2020. You know, it was supposed to be a physical event in Boston at the Encore. Vertica pivoted to a digital event, and we're pleased that The Cube could participate because we've participated in every BDC since the inception. Rich Gaston this year is the global solutions architect for security risk and governance at Micro Focus. Rich, thanks for coming on, good to see you. >> Hey, thank you very much for having me. >> So you got a chewy title, man. You got a lot of stuff, a lot of hairy things in there. But maybe you can talk about your role as an architect in those spaces. >> Sure, absolutely. We handle a lot of different requests from the global 2000 type of organization that will try to move various business processes, various application systems, databases, into new realms. Whether they're looking at opening up new business opportunities, whether they're looking at sharing data with partners securely, they might be migrating it to cloud applications, and doing migration into a Hybrid IT architecture. So we will take those large organizations and their existing installed base of technical platforms and data, users, and try to chart a course to the future, using Micro Focus technologies, but also partnering with other third parties out there in the ecosystem. So we have large, solid relationships with the big cloud vendors, with also a lot of the big database spenders. Vertica's our in-house solution for big data and analytics, and we are one of the first integrated data security solutions with Vertica. We've had great success out in the customer base with Vertica as organizations have tried to add another layer of security around their data. So what we will try to emphasize is an enterprise wide data security approach, where you're taking a look at data as it flows throughout the enterprise from its inception, where it's created, where it's ingested, all the way through the utilization of that data. And then to the other uses where we might be doing shared analytics with third parties. How do we do that in a secure way that maintains regulatory compliance, and that also keeps our company safe against data breach. >> A lot has changed since the early days of big data, certainly since the inception of Vertica. You know, it used to be big data, everyone was rushing to figure it out. You had a lot of skunkworks going on, and it was just like, figure out data. And then as organizations began to figure it out, they realized, wow, who's governing this stuff? A lot of shadow IT was going on, and then the CIO was called to sort of reign that back in. As well, you know, with all kinds of whatever, fake news, the hacking of elections, and so forth, the sense of heightened security has gone up dramatically. So I wonder if you can talk about the changes that have occurred in the last several years, and how you guys are responding. >> You know, it's a great question, and it's been an amazing journey because I was walking down the street here in my hometown of San Francisco at Christmastime years ago and I got a call from my bank, and they said, we want to inform you your card has been breached by Target, a hack at Target Corporation and they got your card, and they also got your pin. And so you're going to need to get a new card, we're going to cancel this. Do you need some cash? I said, yeah, it's Christmastime so I need to do some shopping. And so they worked with me to make sure that I could get that cash, and then get the new card and the new pin. And being a professional in the inside of the industry, I really questioned, how did they get the pin? Tell me more about this. And they said, well, we don't know the details, but you know, I'm sure you'll find out. And in fact, we did find out a lot about that breach and what it did to Target. The impact that $250 million immediate impact, CIO gone, CEO gone. This was a big one in the industry, and it really woke a lot of people up to the different types of threats on the data that we're facing with our largest organizations. Not just financial data; medical data, personal data of all kinds. Flash forward to the Cambridge Analytica scandal that occurred where Facebook is handing off data, they're making a partnership agreement --think they can trust, and then that is misused. And who's going to end up paying the cost of that? Well, it's going to be Facebook at a tune of about five billion on that, plus some other finds that'll come along, and other costs that they're facing. So what we've seen over the course of the past several years has been an evolution from data breach making the headlines, and how do my customers come to us and say, help us neutralize the threat of this breach. Help us mitigate this risk, and manage this risk. What do we need to be doing, what are the best practices in the industry? Clearly what we're doing on the perimeter security, the application security and the platform security is not enough. We continue to have breaches, and we are the experts at that answer. The follow on fascinating piece has been the regulators jumping in now. First in Europe, but now we see California enacting a law just this year. They came into a place that is very stringent, and has a lot of deep protections that are really far-reaching around personal data of consumers. Look at jurisdictions like Australia, where fiduciary responsibility now goes to the Board of Directors. That's getting attention. For a regulated entity in Australia, if you're on the Board of Directors, you better have a plan for data security. And if there is a breach, you need to follow protocols, or you personally will be liable. And that is a sea change that we're seeing out in the industry. So we're getting a lot of attention on both, how do we neutralize the risk of breach, but also how can we use software tools to maintain and support our regulatory compliance efforts as we work with, say, the largest money center bank out of New York. I've watched their audit year after year, and it's gotten more and more stringent, more and more specific, tell me more about this aspect of data security, tell me more about encryption, tell me more about money management. The auditors are getting better. And we're supporting our customers in that journey to provide better security for the data, to provide a better operational environment for them to be able to roll new services out with confidence that they're not going to get breached. With that confidence, they're not going to have a regulatory compliance fine or a nightmare in the press. And these are the major drivers that help us with Vertica sell together into large organizations to say, let's add some defense in depth to your data. And that's really a key concept in the security field, this concept of defense in depth. We apply that to the data itself by changing the actual data element of Rich Gaston, I will change that name into Ciphertext, and that then yields a whole bunch of benefits throughout the organization as we deal with the lifecycle of that data. >> Okay, so a couple things I want to mention there. So first of all, totally board level topic, every board of directors should really have cyber and security as part of its agenda, and it does for the reasons that you mentioned. The other is, GDPR got it all started. I guess it was May 2018 that the penalties went into effect, and that just created a whole Domino effect. You mentioned California enacting its own laws, which, you know, in some cases are even more stringent. And you're seeing this all over the world. So I think one of the questions I have is, how do you approach all this variability? It seems to me, you can't just take a narrow approach. You have to have an end to end perspective on governance and risk and security, and the like. So are you able to do that? And if so, how so? >> Absolutely, I think one of the key areas in big data in particular, has been the concern that we have a schema, we have database tables, we have CALMS, and we have data, but we're not exactly sure what's in there. We have application developers that have been given sandbox space in our clusters, and what are they putting in there? So can we discover that data? We have those tools within Micro Focus to discover sensitive data within in your data stores, but we can also protect that data, and then we'll track it. And what we really find is that when you protect, let's say, five billion rows of a customer database, we can now know what is being done with that data on a very fine grain and granular basis, to say that this business process has a justified need to see the data in the clear, we're going to give them that authorization, they can decrypt the data. Secure data, my product, knows about that and tracks that, and can report on that and say at this date and time, Rich Gaston did the following thing to be able to pull data in the clear. And that could be then used to support the regulatory compliance responses and then audit to say, who really has access to this, and what really is that data? Then in GDPR, we're getting down into much more fine grained decisions around who can get access to the data, and who cannot. And organizations are scrambling. One of the funny conversations that I had a couple years ago as GDPR came into place was, it seemed a couple of customers were taking these sort of brute force approach of, we're going to move our analytics and all of our data to Europe, to European data centers because we believe that if we do this in the U.S., we're going to violate their law. But if we do it all in Europe, we'll be okay. And that simply was a short-term way of thinking about it. You really can't be moving your data around the globe to try to satisfy a particular jurisdiction. You have to apply the controls and the policies and put the software layers in place to make sure that anywhere that someone wants to get that data, that we have the ability to look at that transaction and say it is or is not authorized, and that we have a rock solid way of approaching that for audit and for compliance and risk management. And once you do that, then you really open up the organization to go back and use those tools the way they were meant to be used. We can use Vertica for AI, we can use Vertica for machine learning, and for all kinds of really cool use cases that are being done with IOT, with other kinds of cases that we're seeing that require data being managed at scale, but with security. And that's the challenge, I think, in the current era, is how do we do this in an elegant way? How do we do it in a way that's future proof when CCPA comes in? How can I lay this on as another layer of audit responsibility and control around my data so that I can satisfy those regulators as well as the folks over in Europe and Singapore and China and Turkey and Australia. It goes on and on. Each jurisdiction out there is now requiring audit. And like I mentioned, the audits are getting tougher. And if you read the news, the GDPR example I think is classic. They told us in 2016, it's coming. They told us in 2018, it's here. They're telling us in 2020, we're serious about this, and here's the finds, and you better be aware that we're coming to audit you. And when we audit you, we're going to be asking some tough questions. If you can't answer those in a timely manner, then you're going to be facing some serious consequences, and I think that's what's getting attention. >> Yeah, so the whole big data thing started with Hadoop, and Hadoop is open, it's distributed, and it just created a real governance challenge. I want to talk about your solutions in this space. Can you tell us more about Micro Focus voltage? I want to understand what it is, and then get into sort of how it works, and then I really want to understand how it's applied to Vertica. >> Yeah, absolutely, that's a great question. First of all, we were the originators of format preserving encryption, we developed some of the core basic research out of Stanford University that then became the company of Voltage; that build-a-brand name that we apply even though we're part of Micro Focus. So the lineage still goes back to Dr. Benet down at Stanford, one of my buddies there, and he's still at it doing amazing work in cryptography and keeping moving the industry forward, and the science forward of cryptography. It's a very deep science, and we all want to have it peer-reviewed, we all want to be attacked, we all want it to be proved secure, that we're not selling something to a major money center bank that is potentially risky because it's obscure and we're private. So we have an open standard. For six years, we worked with the Department of Commerce to get our standard approved by NIST; The National Institute of Science and Technology. They initially said, well, AES256 is going to be fine. And we said, well, it's fine for certain use cases, but for your database, you don't want to change your schema, you don't want to have this increase in storage costs. What we want is format preserving encryption. And what that does is turns my name, Rich, into a four-letter ciphertext. It can be reversed. The mathematics of that are fascinating, and really deep and amazing. But we really make that very simple for the end customer because we produce APIs. So these application programming interfaces can be accessed by applications in C or Java, C sharp, other languages. But they can also be accessed in Microservice Manor via rest and web service APIs. And that's the core of our technical platform. We have an appliance-based approach, so we take a secure data appliance, we'll put it on Prim, we'll make 50 of them if you're a big company like Verizon and you need to have these co-located around the globe, no problem; we can scale to the largest enterprise needs. But our typical customer will install several appliances and get going with a couple of environments like QA and Prod to be able to start getting encryption going inside their organization. Once the appliances are set up and installed, it takes just a couple of days of work for a typical technical staff to get done. Then you're up and running to be able to plug in the clients. Now what are the clients? Vertica's a huge one. Vertica's one of our most powerful client endpoints because you're able to now take that API, put it inside Vertica, it's all open on the internet. We can go and look at Vertica.com/secure data. You get all of our documentation on it. You understand how to use it very quickly. The APIs are super simple; they require three parameter inputs. It's a really basic approach to being able to protect and access data. And then it gets very deep from there because you have data like credit card numbers. Very different from a street address and we want to take a different approach to that. We have data like birthdate, and we want to be able to do analytics on dates. We have deep approaches on managing analytics on protected data like Date without having to put it in the clear. So we've maintained a lead in the industry in terms of being an innovator of the FF1 standard, what we call FF1 is format preserving encryption. We license that to others in the industry, per our NIST agreement. So we're the owner, we're the operator of it, and others use our technology. And we're the original founders of that, and so we continue to sort of lead the industry by adding additional capabilities on top of FF1 that really differentiate us from our competitors. Then you look at our API presence. We can definitely run as a dup, but we also run in open systems. We run on main frame, we run on mobile. So anywhere in the enterprise or one in the cloud, anywhere you want to be able to put secure data, and be able to access the protect data, we're going to be there and be able to support you there. >> Okay so, let's say I've talked to a lot of customers this week, and let's say I'm running in Eon mode. And I got some workload running in AWS, I've got some on Prim. I'm going to take an appliance or multiple appliances, I'm going to put it on Prim, but that will also secure my cloud workloads as part of a sort of shared responsibility model, for example? Or how does that work? >> No, that's absolutely correct. We're really flexible that we can run on Prim or in the cloud as far as our crypto engine, the key management is really hard stuff. Cryptography is really hard stuff, and we take care of all that, so we've all baked that in, and we can run that for you as a service either in the cloud or on Prim on your small Vms. So really the lightweight footprint for me running my infrastructure. When I look at the organization like you just described, it's a classic example of where we fit because we will be able to protect that data. Let's say you're ingesting it from a third party, or from an operational system, you have a website that collects customer data. Someone has now registered as a new customer, and they're going to do E-commerce with you. We'll take that data, and we'll protect it right at the point of capture. And we can now flow that through the organization and decrypt it at will on any platform that you have that you need us to be able to operate on. So let's say you wanted to pick that customer data from the operational transaction system, let's throw it into Eon, let's throw it into the cloud, let's do analytics there on that data, and we may need some decryption. We can place secure data wherever you want to be able to service that use case. In most cases, what you're doing is a simple, tiny little atomic efetch across a protected tunnel, your typical TLS pipe tunnel. And once that key is then cashed within our client, we maintain all that technology for you. You don't have to know about key management or dashing. We're good at that; that's our job. And then you'll be able to make those API calls to access or protect the data, and apply the authorization authentication controls that you need to be able to service your security requirements. So you might have third parties having access to your Vertica clusters. That is a special need, and we can have that ability to say employees can get X, and the third party can get Y, and that's a really interesting use case we're seeing for shared analytics in the internet now. >> Yeah for sure, so you can set the policy how we want. You know, I have to ask you, in a perfect world, I would encrypt everything. But part of the reason why people don't is because of performance concerns. Can you talk about, and you touched upon it I think recently with your sort of atomic access, but can you talk about, and I know it's Vertica, it's Ferrari, etc, but anything that slows it down, I'm going to be a concern. Are customers concerned about that? What are the performance implications of running encryption on Vertica? >> Great question there as well, and what we see is that we want to be able to apply scale where it's needed. And so if you look at ingest platforms that we find, Vertica is commonly connected up to something like Kafka. Maybe streamsets, maybe NiFi, there are a variety of different technologies that can route that data, pipe that data into Vertica at scale. Secured data is architected to go along with that architecture at the node or at the executor or at the lowest level operator level. And what I mean by that is that we don't have a bottleneck that everything has to go through one process or one box or one channel to be able to operate. We don't put an interceptor in between your data and coming and going. That's not our approach because those approaches are fragile and they're slow. So we typically want to focus on integrating our APIs natively within those pipeline processes that come into Vertica within the Vertica ingestion process itself, you can simply apply our protection when you do the copy command in Vertica. So really basic simple use case that everybody is typically familiar with in Vertica land; be able to copy the data and put it into Vertica, and you simply say protect as part of the data. So my first name is coming in as part of this ingestion. I'll simply put the protect keyword in the Syntax right in SQL; it's nothing other than just an extension SQL. Very very simple, the developer, easy to read, easy to write. And then you're going to provide the parameters that you need to say, oh the name is protected with this kind of a format. To differentiate it between a credit card number and an alphanumeric stream, for example. So once you do that, you then have the ability to decrypt. Now, on decrypt, let's look at a couple different use cases. First within Vertica, we might be doing select statements within Vertica, we might be doing all kinds of jobs within Vertica that just operate at the SQL layer. Again, just insert the word "access" into the Vertica select string and provide us with the data that you want to access, that's our word for decryption, that's our lingo. And we will then, at the Vertica level, harness the power of its CPU, its RAM, its horsepower at the node to be able to operate on that operator, the decryption request, if you will. So that gives us the speed and the ability to scale out. So if you start with two nodes of Vertica, we're going to operate at X number of hundreds of thousands of transactions a second, depending on what you're doing. Long strings are a little bit more intensive in terms of performance, but short strings like social security number are our sweet spot. So we operate very very high speed on that, and you won't notice the overhead with Vertica, perse, at the node level. When you scale Vertica up and you have 50 nodes, and you have large clusters of Vertica resources, then we scale with you. And we're not a bottleneck and at any particular point. Everybody's operating independently, but they're all copies of each other, all doing the same operation. Fetch a key, do the work, go to sleep. >> Yeah, you know, I think this is, a lot of the customers have said to us this week that one of the reasons why they like Vertica is it's very mature, it's been around, it's got a lot of functionality, and of course, you know, look, security, I understand is it's kind of table sticks, but it's also can be a differentiator. You know, big enterprises that you sell to, they're asking for security assessments, SOC 2 reports, penetration testing, and I think I'm hearing, with the partnership here, you're sort of passing those with flying colors. Are you able to make security a differentiator, or is it just sort of everybody's kind of got to have good security? What are your thoughts on that? >> Well, there's good security, and then there's great security. And what I found with one of my money center bank customers here in San Francisco was based here, was the concern around the insider access, when they had a large data store. And the concern that a DBA, a database administrator who has privilege to everything, could potentially exfil data out of the organization, and in one fell swoop, create havoc for them because of the amount of data that was present in that data store, and the sensitivity of that data in the data store. So when you put voltage encryption on top of Vertica, what you're doing now is that you're putting a layer in place that would prevent that kind of a breach. So you're looking at insider threats, you're looking at external threats, you're looking at also being able to pass your audit with flying colors. The audits are getting tougher. And when they say, tell me about your encryption, tell me about your authentication scheme, show me the access control list that says that this person can or cannot get access to something. They're asking tougher questions. That's where secure data can come in and give you that quick answer of it's encrypted at rest. It's encrypted and protected while it's in use, and we can show you exactly who's had access to that data because it's tracked via a different layer, a different appliance. And I would even draw the analogy, many of our customers use a device called a hardware security module, an HSM. Now, these are fairly expensive devices that are invented for military applications and adopted by banks. And now they're really spreading out, and people say, do I need an HSM? Well, with secure data, we certainly protect your crypto very very well. We have very very solid engineering. I'll stand on that any day of the week, but your auditor is going to want to ask a checkbox question. Do you have HSM? Yes or no. Because the auditor understands, it's another layer of protection. And it provides me another tamper evident layer of protection around your key management and your crypto. And we, as professionals in the industry, nod and say, that is worth it. That's an expensive option that you're going to add on, but your auditor's going to want it. If you're in financial services, you're dealing with PCI data, you're going to enjoy the checkbox that says, yes, I have HSMs and not get into some arcane conversation around, well no, but it's good enough. That's kind of the argument then conversation we get into when folks want to say, Vertica has great security, Vertica's fantastic on security. Why would I want secure data as well? It's another layer of protection, and it's defense in depth for you data. When you believe in that, when you take security really seriously, and you're really paranoid, like a person like myself, then you're going to invest in those kinds of solutions that get you best in-class results. >> So I'm hearing a data-centric approach to security. Security experts will tell you, you got to layer it. I often say, we live in a new world. The green used to just build a moat around the queen, but the queen, she's leaving her castle in this world of distributed data. Rich, incredibly knowlegable guest, and really appreciate you being on the front lines and sharing with us your knowledge about this important topic. So thanks for coming on theCUBE. >> Hey, thank you very much. >> You're welcome, and thanks for watching everybody. This is Dave Vellante for theCUBE, we're covering wall-to-wall coverage of the Virtual Vertica BDC, Big Data Conference. Remotely, digitally, thanks for watching. Keep it right there. We'll be right back right after this short break. (intense music)
SUMMARY :
Vertica Big Data Conference 2020 brought to you by Vertica. and we're pleased that The Cube could participate But maybe you can talk about your role And then to the other uses where we might be doing and how you guys are responding. and they said, we want to inform you your card and it does for the reasons that you mentioned. and put the software layers in place to make sure Yeah, so the whole big data thing started with Hadoop, So the lineage still goes back to Dr. Benet but that will also secure my cloud workloads as part of a and we can run that for you as a service but can you talk about, at the node to be able to operate on that operator, a lot of the customers have said to us this week and we can show you exactly who's had access to that data and really appreciate you being on the front lines of the Virtual Vertica BDC, Big Data Conference.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Australia | LOCATION | 0.99+ |
Europe | LOCATION | 0.99+ |
Target | ORGANIZATION | 0.99+ |
Verizon | ORGANIZATION | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Dave Vellante | PERSON | 0.99+ |
May 2018 | DATE | 0.99+ |
NIST | ORGANIZATION | 0.99+ |
2016 | DATE | 0.99+ |
Boston | LOCATION | 0.99+ |
2018 | DATE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
Target Corporation | ORGANIZATION | 0.99+ |
$250 million | QUANTITY | 0.99+ |
50 | QUANTITY | 0.99+ |
Rich Gaston | PERSON | 0.99+ |
Singapore | LOCATION | 0.99+ |
Turkey | LOCATION | 0.99+ |
Ferrari | ORGANIZATION | 0.99+ |
six years | QUANTITY | 0.99+ |
2020 | DATE | 0.99+ |
one box | QUANTITY | 0.99+ |
China | LOCATION | 0.99+ |
C | TITLE | 0.99+ |
Stanford University | ORGANIZATION | 0.99+ |
Java | TITLE | 0.99+ |
First | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
U.S. | LOCATION | 0.99+ |
this week | DATE | 0.99+ |
National Institute of Science and Technology | ORGANIZATION | 0.99+ |
Each jurisdiction | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
Rich | PERSON | 0.99+ |
this year | DATE | 0.98+ |
Vertica Virtual Big Data Conference | EVENT | 0.98+ |
one channel | QUANTITY | 0.98+ |
one process | QUANTITY | 0.98+ |
GDPR | TITLE | 0.98+ |
SQL | TITLE | 0.98+ |
five billion rows | QUANTITY | 0.98+ |
about five billion | QUANTITY | 0.97+ |
One | QUANTITY | 0.97+ |
C sharp | TITLE | 0.97+ |
Benet | PERSON | 0.97+ |
first | QUANTITY | 0.96+ |
four-letter | QUANTITY | 0.96+ |
Vertica Big Data Conference 2020 | EVENT | 0.95+ |
Hadoop | TITLE | 0.94+ |
Kafka | TITLE | 0.94+ |
Micro Focus | ORGANIZATION | 0.94+ |
Colin Mahony, Vertica at Micro Focus | Virtual Vertica BDC 2020
>>It's the queue covering the virtual vertical Big Data Conference 2020. Brought to you by vertical. >>Hello, everybody. Welcome to the new Normal. You're watching the Cube, and it's remote coverage of the vertical big data event on digital or gone Virtual. My name is Dave Volante, and I'm here with Colin Mahoney, who's a senior vice president at Micro Focus and the GM of Vertical Colin. Well, strange times, but the show goes on. Great to see you again. >>Good to see you too, Dave. Yeah, strange times indeed. Obviously, Safety first of everyone that we made >>a >>decision to go Virtual. I think it was absolutely the right all made it in advance of how things have transpired, but we're making the best of it and appreciate your time here, going virtual with us. >>Well, Joe and we're super excited to be here. As you know, the Cube has been at every single BDC since its inception. It's a great event. You just you just presented the key note to your to your audience, You know, it was remote. You didn't have that that live vibe. And you have a lot of fans in the vertical community But could you feel the love? >>Yeah, you know, it's >>it's hard to >>feel the love virtually, but I'll tell you what. The silver lining in all this is the reach that we have for this event now is much broader than it would have been a Z you know, you know, we brought this event back. It's been a few years since we've done it. We're super excited to do it, obviously, you know, in Boston, where it was supposed to be on location, but there wouldn't have been as many people that could participate. So the silver lining in all of this is that I think there's there's a lot of love out there we're getting, too. I have a lot of participants who otherwise would not have been able to participate in this. Both live as well. It's a lot of these assets that we're gonna have available. So, um, you know, it's out there. We've got an amazing customers and of practitioners with vertical. We've got so many have been with us for a long time. We've of course, have a lot of new customers as well that we're welcoming, so it's exciting. >>Well, it's been a while. Since you've had the BDC event, a lot of transpired. You're now part of micro focus, but I know you and I know the vertical team you guys have have not stopped. You've kept the innovation going. We've been following the announcements, but but bridge the gap between the last time. You know, we had coverage of this event and where we are today. A lot has changed. >>Oh, yeah, a lot. A lot has changed. I mean, you know, it's it's the software industry, right? So nothing stays the same. We constantly have Teoh keep going. Probably the only thing that stays the same is the name Vertical. Um and, uh, you know, you're not spending 10 which is just a phenomenal released for us. So, you know, overall, the the organization continues to grow. The dedication and commitment to this great form of vertical continues every single release we do as you know, and this hasn't changed. It's always about performance and scale and adding a whole bunch of new capabilities on that front. But it's also about are our main road map and direction that we're going towards. And I think one of the things have been great about it is that we've stayed true that from day one we haven't tried to deviate too much and get into things that are barred to outside your box. But we've really done, I think, a great job of extending vertical into places where people need a lot of help. And with vertical 10 we know we're going to talk more about that. But we've done a lot of that. It's super exciting for our customers, and all of this, of course, is driven by our customers. But back to the big data conference. You know, everybody has been saying this for years. It was one of the best conferences we've been to just so really it's. It's developers giving tech talks, its customers giving talks. And we have more customers that wanted to give talks than we had slots to fill this year at the event, which is another benefit, a little bit of going virtually accommodate a little bit more about obviously still a tight schedule. But it really was an opportunity for our community to come together and talk about not just America, but how to deal with data, you know, we know the volumes are slowing down. We know the complexity isn't slowing down. The things that people want to do with AI and machine learning are moving forward in a rapid pace as well. There's a lot talk about and share, and that's really huge part of what we try to do with it. >>Well, let's get into some of that. Um, your customers are making bets. Micro focus is actually making a bet on one vertical. I wanna get your perspective on one of the waves that you're riding and where are you placing your bets? >>Yeah, No, it's great. So, you know, I think that one of the waves that we've been writing for a long time, obviously Vertical started out as a sequel platform for analytics as a sequel, database engine, relational engine. But we always knew that was just sort of takes that we wanted to do. People were going to trust us to put enormous amounts of data in our platform and what we owe everyone else's lots of analytics to take advantage of that data in the lots of tools and capabilities to shape that data to get into the right format. The operational reporting but also in this day and age for machine learning and from some pretty advanced regressions and other techniques of things. So a huge part of vertical 10 is just doubling down on that commitment to what we call in database machine learning and ai. Um, And to do that, you know, we know that we're not going to come up with the world's best algorithms. Nor is that our focus to do. Our advantage is we have this massively parallel platform to ingest store, manage and analyze the data. So we made some announcements about incorporating PM ML models into the product. We continue to deepen our python integration. Building off of a new open source project we started with uber has been a great customer and partner on This is one of our great talks here at the event. So you know, we're continuing to do that, and it turns out that when it comes to anything analytics machine learning, certainly so much of what you have to do is actually prepare the big shape the data get the data in the right format, apply the model, fit the model test a model operationalized model and is a great platform to do that. So that's a huge bet that were, um, continuing to ride on, taking advantage of and then some of the other things that we've just been seeing. You continue. I'll take object. Storage is an example on, I think Hadoop and what would you point through ultimately was a huge part of this, but there's just a massive disruption going on in the world around object storage. You know, we've made several bets on S three early we created America Yang mode, which separates computing story. And so for us that separation is not just about being able to take care of your take advantage of cloud economics as we do, or the economics of object storage. It's also about being able to truly isolate workloads and start to set the sort of platform to be able to do very autonomous things in the databases in the database could actually start self analysing without impacting many operational workloads, and so that continues with our partnership with pure storage. On premise, we just announced that we're supporting beyond Google Cloud now. In addition to Amazon, we supported on we've got a CFS now being supported by are you on mode. So we continue to ride on that mega trend as well. Just the clouds in general. Whether it's a public cloud, it's a private cloud on premise. Giving our customers the flexibility and choice to run wherever it makes sense for them is something that we are very committed to. From a flexibility standpoint. There's a lot of lock in products out there. There's a lot of cloud only products now more than ever. We're hearing our customers that they want that flexibility to be able to run anywhere. They want the ease of use and simplicity of native cloud experiences, which we're giving them as well. >>I want to stay in that architectural component for a minute. Talk about separating compute from storage is not just about economics. I mean apart Is that you, you know, green, really scale compute separate from storage as opposed to in chunks. It's more efficient, but you're saying there's other advantages to operational and workload. Specificity. Um, what is unique about vertical In this regard, however, many others separate compute from storage? What's different about vertical? >>Yeah, I think you know, there's a lot of differences about how we do it. It's one thing if you're a cloud native company, you do it and you have a shared catalog. That's key value store that all of your customers are using and are on the same one. Frankly, it's probably more of a security concern than anything. But it's another thing. When you give that capability to each customer on their own, they're fully protected. They're not sharing it with any other customers. And that's something that we hear a lot of insights from our customers. They want to be able to separate compute and storage. But they want to be able to do this in their own environment so that they know that in their data catalog there's no one else is. You share in that catalog, there's no single point of failure. So, um, that's one huge advantage that we have. And frankly, I think it just comes from being a company that's operating on premise and, uh, up in the cloud. I think another huge advantages for us is we don't know what object storage platform is gonna win, nor do we necessarily have. We designed the young vote so that it's an sdk. We started with us three, but it could be anything. It's DFS. That's three. Who knows what what object storage formats were going to be there and then finally, beyond just the object storage. We're really one of the only database companies that actually allows our customers to natively operate on data in very different formats, like parquet and or if you're familiar with those in the Hadoop community. So we not only embrace this kind of object storage disruption, but we really embrace the different data formats. And what that means is our customers that have data pipelines that you know, fully automated, putting this information in different places. They don't have to completely reload everything to take advantage of the Arctic analytics. We can go where the data is connected into it, and we offer them a lot of different ways to take advantage of those analytics. So there are a couple of unique differences with verdict, and again, I think are really advance. You know, in many ways, by not being a cloud native platform is that we're very good at operating in different environments with different formats that changing formats over time. And I don't think a lot of the other companies out there that I think many, particularly many of the SAS companies were scrambling. They even have challenges moving from saying Amazon environment to a Microsoft azure environment with their office because they've got so much unique Band Aid. Excuse me in the background. Just holding the system up that is native to any of those. >>Good. I'm gonna summarize. I'm hearing from you your Ferrari of databases that we've always known. Your your object store agnostic? Um, it's any. It's the cloud experience that you can bring on Prem to virtually any cloud. All the popular clouds hybrid. You know, aws, azure, now Google or on Prem and in a variety of different data formats. And that is, I think, you know, you need the combination of those I think is unique in the marketplace. Um, before we get into the news, I want to ask you about data silos and data silos. You mentioned H DFs where you and I met back in the early days of big data. You know, in some respects, you know, Hadoop help break down the silos with distributing the date and leave it in place, and in other respects, they created Data Lakes, which became silos. And so we have. Yet all these other sales people are trying to get to, Ah, digital transformation meeting, putting data at their core virtually obviously, and leave it in place. What's your thoughts on that in terms of data being a silo buster Buster, How does verdict of way there? >>Yeah, so And you're absolutely right, I think if even if you look at his due for all the new data that gets into the do. In many ways, it's created yet another large island of data that many organizations are struggling with because it's separate from their core traditional data warehouse. It's separate from some of the operational systems that they have, and so there might be a lot of data in there, but they're still struggling with How do I break it out of that large silo and or combine it again? I think some some of the things that verdict it doesn't part of the announcement just attend his migration tools to make it really easy. If you do want to move it from one platform to another inter vertical, but you don't have to move it, you can actually take advantage of a lot of the data where it resides with vertical, especially in the Hadoop brown with our external table storage with our building or compartment natively. So we're very pragmatic about how our customers go about this. Very few customers, Many of them tried it with Hadoop and realize that didn't work. But very few customers want a wholesale. Just say we're going to throw everything out. We're gonna get rid of our data warehouse. We're gonna hit the pause button and we're going to go from there. Just it's not possible to do that. So we've spent a lot of time investing in the product, really work with them to go where the data is and then seamlessly migrate. And when it makes sense to migrate, you mentioned the performance of America. Um, and you talked about it is the variety. It definitely is. And one other thing that we're really proud of this is that it actually is not a gas guzzler. Easy either One of the things that we're seeing, a lot of the other cloud databases pound for pound you get on the 10th the hardware vertical running up there. You get over 10 x performance. We're seeing that a lot, so it's Ah, it's not just about the performance, but it's about the efficiency as well. And I think that efficiency is really important when it comes to silos. Because there's there's just only so much horsepower out there. And it's easier for companies to play tricks and lots of servers environment when they start up for so many organizations and cloud and frankly, looking at the bills they're getting from these cloud workloads that are running. They really conscious of that. >>Yeah. The big, big energy companies love the gas guzzlers. A lot of a lot of cloud. Cute. But let's get into the news. Uh, 10 dot io you shared with your the audience in your keynote. One of the one of the highlights of data. What do we need to know? >>Yeah, so, you know, again doubling down on these mega trends, I'll start with Machine Learning and ai. We've done a lot of work to integrate so that you can take native PM ml models, bring them into vertical, run them massively parallel and help shape you know your data and prepare it. Do all the work that we know is required true machine learning. And for all the hype that there is around it, this is really you know, people want to do a lot of unsupervised machine learning, whether it's for healthcare fraud, detection, financial services. So we've doubled down on that. We now also support things like Tensorflow and, you know, as I mentioned, we're not going to come up with the best algorithms. Our job is really to ensure that those algorithms that people coming up with could be incorporated, that we can run them against massive data sets super efficiently. So that's that's number one number two on object storage. We continue to support Mawr object storage platforms for ya mode in the cloud we're expanding to Google G CPI, Google's cloud beyond just Amazon on premise or in the cloud. Now we're also supporting HD fs with beyond. Of course, we continue to have a great relationship with our partners, your storage on premise. Well, what we continue to invest in the eon mode, especially. I'm not gonna go through all the different things here, but it's not just sort of Hey, you support this and then you move on. There's so many different things that we learn about AP I calls and how to save our customers money and tricks on performance and things on the third areas. We definitely continue to build on that flexibility of deployment, which is related to young vote with. Some are described, but it's also about simplicity. It's also about some of the migration tools that we've announced to make it easy to go from one platform to another. We have a great road map on these abuse on security, on performance and scale. I mean, for us. Those are the things that we're working on every single release. We probably don't talk about them as much as we need to, but obviously they're critically important. And so we constantly look at every component in this product, you know, Version 10 is. It is a huge release for any product, especially an analytic database platform. And so there's We're just constantly revisiting you know, some of the code base and figuring out how we can do it in new and better ways. And that's a big part of 10 as well. >>I'm glad you brought up the machine Intelligence, the machine Learning and AI piece because we would agree that it is really one of the things we've noticed is that you know the new innovation cocktail. It's not being driven by Moore's law anymore. It's really a combination of you. You've collected all this data over the last 10 years through Hadoop and other data stores, object stores, etcetera. And now you're applying machine intelligence to that. And then you've got the cloud for scale. And of course, we talked about you bringing the cloud experience, whether it's on Prem or hybrid etcetera. The reason why I think this is important I wanted to get your take on this is because you do see a lot of emerging analytic databases. Cloud Native. Yes, they do suck up, you know, a lot of compute. Yeah, but they also had a lot of value. And I really wanted to understand how you guys play in that new trend, that sort of cloud database, high performance, bringing in machine learning and AI and ML tools and then driving, you know, turning data into insights and from what I'm hearing is you played directly in that and your differentiation is a lot of the things that we talk about including the ability to do that on from and in the cloud and across clouds. >>Yeah, I mean, I think that's a great point. We were a great cloud database. We run very well upon three major clouds, and you could argue some of the other plants as well in other parts of the world. Um, if you talk to our customers and we have hundreds of customers who are running vertical in the cloud, the experience is very good. I think it would always be better. We've invested a lot in taking advantage of the native cloud ecosystem, so that provisioning and managing vertical is seamless when you're in that environment will continue to do that. But vertical excuse me as a cloud platform is phenomenal. And, um, you know, there's a There's a lot of confusion out there, you know? I think there's a lot of marketing dollars spent that won't name many of the companies here. You know who they are, You know, the cloud Native Data Warehouse and it's true, you know their their software as a service. But if you talk to a lot of our customers, they're getting very good and very similar. experiences with Bernie comic. We stopped short of saying where software is a service because ultimately our customers have that control of flexibility there. They're putting verdict on whichever cloud they want to run it on, managing it. Stay tuned on that. I think you'll you'll hear from or more from us about, you know, that going going even further. But, um, you know, we do really well in the cloud, and I think he on so much of yang. And, you know, this has really been a sort of 2.5 years and never for us. But so much of eon is was designed around. The cloud was designed around Cloud Data Lakes s three, separation of compute and storage on. And if you look at the work that we're doing around container ization and a lot of these other elements, it just takes that to the next level. And, um, there's a lot of great work, so I think we're gonna get continue to get better at cloud. But I would argue that we're already and have been for some time very good at being a cloud analytic data platform. >>Well, since you open the door I got to ask you. So it's e. I hear you from a performance and architectural perspective, but you're also alluding two. I think something else. I don't know what you can share with us. You said stay tuned on that. But I think you're talking about Optionality, maybe different consumption models. That am I getting that right and you share >>your difficult in that right? And actually, I'm glad you wrote something. I think a huge part of Cloud is also has nothing to do with the technology. I think it's how you and seeing the product. Some companies want to rent the product and they want to rent it for a certain period of time. And so we allow our customers to do that. We have incredibly flexible models of how you provision and purchase our product, and I think that helps a lot. You know, I am opening the door Ah, a little bit. But look, we have customers that ask us that we're in offer them or, you know, we can offer them platforms, brawl in. We've had customers come to us and say please take over systems, um, and offer something as a distribution as I said, though I think one thing that we've been really good at is focusing on on what is our core and where we really offer offer value. But I can tell you that, um, we introduced something called the Verdict Advisor Tool this year. One of the things that the Advisor Tool does is it collects information from our customer environments on premise or the cloud, and we run through our own machine learning. We analyze the customer's environment and we make some recommendations automatically. And a lot of our customers have said to us, You know, it's funny. We've tried managed service, tried SAS off, and you guys blow them away in terms of your ability to help us, like automatically managed the verdict, environment and the system. Why don't you guys just take this product and converted into a SAS offering, so I won't go much further than that? But you can imagine that there's a lot of innovation and a lot of thoughts going into how we can do that. But there's no reason that we have to wait and do that today and being able to offer our customers on premise customers that same sort of experience from a managed capability is something that we spend a lot of time thinking about as well. So again, just back to the automation that ease of use, the going above and beyond. Its really excited to have an analytic platform because we can do so much automation off ourselves. And just like we're doing with Perfect Advisor Tool, we're leveraging our own Kool Aid or Champagne Dawn. However you want to say Teoh, in fact, tune up and solve, um, some optimization for our customers automatically, and I think you're going to see that continue. And I think that could work really well in a bunch of different wallets. >>Welcome. Just on a personal note, I've always enjoyed our conversations. I've learned a lot from you over the years. I'm bummed that we can't hang out in Boston, but hopefully soon, uh, this will blow over. I loved last summer when we got together. We had the verdict throwback. We had Stone Breaker, Palmer, Lynch and Mahoney. We did a great series, and that was a lot of fun. So it's really it's a pleasure. And thanks so much. Stay safe out there and, uh, we'll talk to you soon. >>Yeah, you too did stay safe. I really appreciate it up. Unity and, you know, this is what it's all about. It's Ah, it's a lot of fun. I know we're going to see each other in person soon, and it's the people in the community that really make this happen. So looking forward to that, but I really appreciate it. >>Alright. And thank you, everybody for watching. This is the Cube coverage of the verdict. Big data conference gone, virtual going digital. I'm Dave Volante. We'll be right back right after this short break. >>Yeah.
SUMMARY :
Brought to you by vertical. Great to see you again. Good to see you too, Dave. I think it was absolutely the right all made it in advance of And you have a lot of fans in the vertical community But could you feel the love? to do it, obviously, you know, in Boston, where it was supposed to be on location, micro focus, but I know you and I know the vertical team you guys have have not stopped. I mean, you know, it's it's the software industry, on one of the waves that you're riding and where are you placing your Um, And to do that, you know, we know that we're not going to come up with the world's best algorithms. I mean apart Is that you, you know, green, really scale Yeah, I think you know, there's a lot of differences about how we do it. It's the cloud experience that you can bring on Prem to virtually any cloud. to another inter vertical, but you don't have to move it, you can actually take advantage of a lot of the data One of the one of the highlights of data. And so we constantly look at every component in this product, you know, And of course, we talked about you bringing the cloud experience, whether it's on Prem or hybrid etcetera. And if you look at the work that we're doing around container ization I don't know what you can share with us. I think it's how you and seeing the product. I've learned a lot from you over the years. Unity and, you know, this is what it's all about. This is the Cube coverage of the verdict.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Colin Mahoney | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
Joe | PERSON | 0.99+ |
Colin Mahony | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
uber | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
python | TITLE | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Ferrari | ORGANIZATION | 0.99+ |
10 | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
2.5 years | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Kool Aid | ORGANIZATION | 0.99+ |
Vertical Colin | ORGANIZATION | 0.99+ |
10th | QUANTITY | 0.99+ |
Both | QUANTITY | 0.99+ |
Micro Focus | ORGANIZATION | 0.98+ |
each customer | QUANTITY | 0.98+ |
Moore | PERSON | 0.98+ |
America | LOCATION | 0.98+ |
this year | DATE | 0.98+ |
one platform | QUANTITY | 0.97+ |
today | DATE | 0.96+ |
One | QUANTITY | 0.96+ |
10 | TITLE | 0.96+ |
Vertica | ORGANIZATION | 0.96+ |
last summer | DATE | 0.95+ |
third areas | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.93+ |
Vertical | ORGANIZATION | 0.92+ |
this year | DATE | 0.92+ |
single point | QUANTITY | 0.92+ |
Big Data Conference 2020 | EVENT | 0.92+ |
Arctic | ORGANIZATION | 0.91+ |
Hadoop | ORGANIZATION | 0.89+ |
three major clouds | QUANTITY | 0.88+ |
H DFs | ORGANIZATION | 0.86+ |
Cloud Data Lakes | TITLE | 0.86+ |
Stone Breaker | ORGANIZATION | 0.86+ |
one huge advantage | QUANTITY | 0.86+ |
Hadoop | TITLE | 0.85+ |
BDC | EVENT | 0.83+ |
day one | QUANTITY | 0.83+ |
Version 10 | TITLE | 0.83+ |
Cube | COMMERCIAL_ITEM | 0.82+ |
Google Cloud | TITLE | 0.82+ |
BDC 2020 | EVENT | 0.81+ |
thing | QUANTITY | 0.79+ |
Bernie | PERSON | 0.79+ |
first | QUANTITY | 0.79+ |
over 10 x | QUANTITY | 0.78+ |
Prem | ORGANIZATION | 0.78+ |
one vertical | QUANTITY | 0.77+ |
Virtual Vertica | ORGANIZATION | 0.77+ |
Verdict | ORGANIZATION | 0.75+ |
SAS | ORGANIZATION | 0.75+ |
Champagne Dawn | ORGANIZATION | 0.73+ |
every single release | QUANTITY | 0.72+ |
Perfect | TITLE | 0.71+ |
years | QUANTITY | 0.7+ |
last 10 years | DATE | 0.69+ |
Palmer | ORGANIZATION | 0.67+ |
Tensorflow | TITLE | 0.65+ |
single release | QUANTITY | 0.65+ |
a minute | QUANTITY | 0.64+ |
Advisor Tool | TITLE | 0.63+ |
customers | QUANTITY | 0.62+ |
Ben White, Domo | Virtual Vertica BDC 2020
>> Announcer: It's theCUBE covering the Virtual Vertica Big Data Conference 2020, brought to you by Vertica. >> Hi, everybody. Welcome to this digital coverage of the Vertica Big Data Conference. You're watching theCUBE and my name is Dave Volante. It's my pleasure to invite in Ben White, who's the Senior Database Engineer at Domo. Ben, great to see you, man. Thanks for coming on. >> Great to be here and here. >> You know, as I said, you know, earlier when we were off-camera, I really was hoping I could meet you face-to-face in Boston this year, but hey, I'll take it, and, you know, our community really wants to hear from experts like yourself. But let's start with Domo as the company. Share with us what Domo does and what your role is there. >> Well, if I can go straight to the official what Domo does is we provide, we process data at BI scale, we-we-we provide BI leverage at cloud scale in record time. And so what that means is, you know, we are a business-operating system where we provide a number of analytical abilities to companies of all sizes. But we do that at cloud scale and so I think that differentiates us quite a bit. >> So a lot of your work, if I understand it, and just in terms of understanding what Domo does, there's a lot of pressure in terms of being real-time. It's not, like, you sometimes don't know what's coming at you, so it's ad-hoc. I wonder if you could sort of talk about that, confirm that, maybe add a little color to it. >> Yeah, absolutely, absolutely. That's probably the biggest challenge it is to being, to operating Domo is that it is an ad hoc environment. And certainly what that means, is that you've got analysts and executives that are able to submit their own queries with out very... With very few limitations. So from an engineering standpoint, that challenge in that of course is that you don't have this predictable dashboard to plan for, when it comes to performance planning. So it definitely presents some challenges for us that we've done some pretty unique things, I think, to address those. >> So it sounds like your background fits well with that. I understand your people have called you a database whisperer and an envelope pusher. What does that mean to a DBA in this day and age? >> The whisperer part is probably a lost art, in the sense that it's not really sustainable, right? The idea that, you know, whatever it is I'm able to do with the database, it has to be repeatable. And so that's really where analytics comes in, right? That's where pushing the envelope comes in. And in a lot of ways that's where Vertica comes in with this open architecture. And so as a person who has a reputation for saying, "I understand this is what our limitations should be, but I think we can do more." Having a platform like Vertica, with such an open architecture, kind of lets you push those limits quite a bit. >> I mean I've always felt like, you know, Vertica, when I first saw the stone breaker architecture and talked to some of the early founders, I always felt like it was the Ferrari of databases, certainly at the time. And it sounds like you guys use it in that regard. But talk a little bit more about how you use Vertica, why, you know, why MPP, why Vertica? You know, why-why can't you do this with RDBMS? Educate us, a little bit, on, sort of, the basics. >> For us it was, part of what I mentioned when we started, when we talked about the very nature of the Domo platform, where there's an incredible amount of resiliency required. And so Vertica, the MPP platform, of course, allows us to build individual database clusters that can perform best for the workload that might be assigned to them. So the open, the expandable, the... The-the ability to grow Vertica, right, as your base grows, those are all important factors, when you're choosing early on, right? Without a real idea of how growth would be or what it will look like. If you were kind of, throwing up something to the dark, you look at the Vertica platform and you can see, well, as I grow, I can, kind of, build with this, right? I can do some unique things with the platform in terms of this open architecture that will allow me to not have to make all my decisions today, right? (mutters) >> So, you're using Vertica, I know, at least in part, you're working with AWS as well, can you describe sort of your environment? Do you give anything on-prem, is everything in cloud? What's your set up look like? >> Sure, we have a hybrid cloud environment where we have a significant presence in public files in our own private cloud. And so, yeah, having said that, we certainly have a really an extensive presence, I would say, in AWS. So, they're definitely the partner of our when it comes to providing the databases and the server power that we need to operate on. >> From a standpoint of engineering and architecting a database, what were some of the challenges that you faced when you had to create that hybrid architecture? What did you face and how did you overcome that? >> Well, you know, some of the... There were some things we faced in terms of, one, it made it easy that Vertica and AWS have their own... They play well together, we'll say that. And so, Vertica was designed to work on AWS. So that part of it took care of it's self. Now our own private cloud and being able to connect that to our public cloud has been a part of our own engineering abilities. And again, I don't want to make little, make light of it, it certainly not impossible. And so we... Some of the challenges that pertain to the database really were in the early days, that you mentioned, when we talked a little bit earlier about Vertica's most recent eon mode. And I'm sure you'll get to that. But when I think of early challenges, some of the early challenges were the architecture of enterprise mode. When I talk about all of these, this idea that we can have unique databases or database clusters of different sizes, or this elasticity, because really, if you know the enterprise architecture, that's not necessarily the enterprise architecture. So we had to do some unique things, I think, to overcome that, right, early. To get around the rigidness of enterprise. >> Yeah, I mean, I hear you. Right? Enterprise is complex and you like when things are hardened and fossilized but, in your ad hoc environment, that's not what you needed. So talk more about eon mode. What is eon mode for you and how do you apply it? What are some of the challenges and opportunities there, that you've found? >> So, the opportunities were certainly in this elastic architecture and the ability to separate in the storage, immediately meant that for some of the unique data paths that we wanted to take, right? We could do that fairly quickly. Certainly we could expand databases, right, quickly. More importantly, now you can reduce. Because previously, in the past, right, when I mentioned the enterprise architecture, the idea of growing a database in itself has it's pain. As far as the time it takes to (mumbles) the data, and that. Then think about taking that database back down and (telephone interference). All of a sudden, with eon, right, we had this elasticity, where you could, kind of, start to think about auto scaling, where you can go up and down and maybe you could save some money or maybe you could improve performance or maybe you could meet demand, At a time where customers need it most, in a real way, right? So it's definitely a game changer in that regard. >> I always love to talk to the customers because I get to, you know, I hear from the vendor, what they say, and then I like to, sort of, validate it. So, you know, Vertica talks a lot about separating compute and storage, and they're not the only one, from an architectural standpoint who do that. But Vertica stresses it. They're the only one that does that with a hybrid architecture. They can do it on-prem, they can do it in the cloud. From your experience, well first of all, is that true? You may or may not know, but is that advantageous to you, and if so, why? >> Well, first of all, it's certainly true. Earlier in some of the original beta testing for the on-prem eon modes that we... I was able to participate in it and be aware of it. So it certainly a realty, they, it's actually supported on Pure storage with FlashBlade and it's quite impressive. You know, for who, who will that be for, tough one. It's probably Vertica's question that they're probably still answering, but I think, obviously, some enterprise users that probably have some hybrid cloud, right? They have some architecture, they have some hardware, that they themselves, want to make use of. We certainly would probably fit into one of their, you know, their market segments. That they would say that we might be the ones to look at on-prem eon mode. Again, the beauty of it is, the elasticity, right? The idea that you could have this... So a lot of times... So I want to go back real quick to separating compute. >> Sure. Great. >> You know, we start by separating it. And I like to think of it, maybe more of, like, the up link. Because in a true way, it's not necessarily separated because ultimately, you're bringing the compute and the storage back together. But to be able to decouple it quickly, replace nodes, bring in nodes, that certainly fits, I think, what we were trying to do in building this kind of ecosystem that could respond to unknown of a customer query or of a customer demand. >> I see, thank you for that clarification because you're right, it's really not separating, it's decoupling. And that's important because you can scale them independently, but you still need compute and you still need storage to run your work load. But from a cost standpoint, you don't have to buy it in chunks. You can buy in granular segments for whatever your workload requires. Is that, is that the correct understanding? >> Yeah, and to, the ability to able to reuse compute. So in the scenario of AWS or even in the scenario of your on-prem solution, you've got this data that's safe and secure in (mumbles) computer storage, but the compute that you have, you can reuse that, right? You could have a scenario that you have some query that needs more analytic, more-more fire power, more memory, more what have you that you have. And so you can kind of move between, and that's important, right? That's maybe more important than can I grow them separately. Can I, can I borrow it. Can I borrow that compute you're using for my (cuts out) and give it back? And you can do that, when you're so easily able to decouple the compute and put it where you want, right? And likewise, if you have a down period where customers aren't using it, you'd like to be able to not use that, if you no longer require it, you're not going to get it back. 'Cause it-it opened the door to a lot of those things that allowed performance and process department to meet up. >> I wonder if I can ask you a question, you mentioned Pure a couple of times, are you using Pure FlashBlade on-prem, is that correct? >> That is the solution that is supported, that is supported by Vertica for the on-prem. (cuts out) So at this point, we have been discussing with them about some our own POCs for that. Before, again, we're back to the idea of how do we see ourselves using it? And so we certainly discuss the feasibility of bringing it in and giving it the (mumbles). But that's not something we're... Heavily on right now. >> And what is Domo for Domo? Tell us about that. >> Well it really started as this idea, even in the company, where we say, we should be using Domo in our everyday business. From the sales folk to the marketing folk, right. Everybody is going to use Domo, it's a business platform. For us in engineering team, it was kind of like, well if we use Domo, say for instance, to be better at the database engineers, now we've pointed Domo at itself, right? Vertica's running Domo in the background to some degree and then we turn around and say, "Hey Domo, how can we better at running you?" So it became this kind of cool thing we'd play with. We're now able to put some, some methods together where we can actually do that, right. Where we can monitor using our platform, that's really good at processing large amounts of data and spitting out useful analytics, right. We take those analytics down, make recommendation changes at the-- For now, you've got Domo for Domo happening and it allows us to sit at home and work. Now, even when we have to, even before we had to. >> Well, you know, look. Look at us here. Right? We couldn't meet in Boston physically, we're now meeting remote. You're on a hot spot because you've got some weather in your satellite internet in Atlanta and we're having a great conversation. So-so, we're here with Ben White, who's a senior database engineer at Domo. I want to ask you about some of the envelope pushing that you've done around autonomous. You hear that word thrown around a lot. Means a lot of things to a lot of different people. How do you look at autonomous? And how does it fit with eon and some of the other things you're doing? >> You know, I... Autonomous and the idea idea of autonomy is something that I don't even know if that I have already, ready to define. And so, even in my discussion, I often mention it as a road to it. Because exactly where it is, it's hard to pin down, because there's always this idea of how much trust do you give, right, to the system or how much, how much is truly autonomous? How much already is being intervened by us, the engineers. So I do hedge on using that. But on this road towards autonomy, when we look at, what we're, how we're using Domo. And even what that really means for Vertica, because in a lot of my examples and a lot of the things that we've engineered at Domo, were designed to maybe overcome something that I thought was a limitation thing. And so many times as we've done that, Vertica has kind of met us. Like right after we've kind of engineered our architecture stuff, that we thought that could help on our side, Vertica has a release that kind of addresses it. So, the autonomy idea and the idea that we could analyze metadata, make recommendations, and then execute those recommendations without innervation, is that road to autonomy. Once the database is properly able to do that, you could see in our ad hoc environment how that would be pretty useful, where with literally millions of queries every hour, trying to figure out what's the best, you know, profile. >> You know for- >> (overlapping) probably do a better job in that, than we could. >> For years I felt like IT folks sometimes were really, did not want that automation, they wanted the knobs to turn. But I wonder if you can comment. I feel as though the level of complexity now, with cloud, with on-prem, with, you know, hybrid, multicloud, the scale, the speed, the real time, it just gets, the pace is just too much for humans. And so, it's almost like the industry is going to have to capitulate to the machine. And then, really trust the machine. But I'm still sensing, from you, a little bit of hesitation there, but light at the end of the tunnel. I wonder if you can comment? >> Sure. I think the light at the end of the tunnel is even in the recent months and recent... We've really begin to incorporate more machine learning and artificial intelligence into the model, right. And back to what we're saying. So I do feel that we're getting closer to finding conditions that we don't know about. Because right now our system is kind of a rule, rules based system, where we've said, "Well these are the things we should be looking for, these are the things that we think are a problem." To mature to the point where the database is recognizing anomalies and taking on pattern (mutters). These are problems you didn't know happen. And that's kind of the next step, right. Identifying the things you didn't know. And that's the path we're on now. And it's probably more exciting even than, kind of, nailing down all the things you think you know. We figure out what we don't know yet. >> So I want to close with, I know you're a prominent member of the, a respected member of the Vertica Customer Advisory Board, and you know, without divulging anything confidential, what are the kinds of things that you want Vertica to do going forward? >> Oh, I think, some of the in dated base for autonomy. The ability to take some of the recommendations that we know can derive from the metadata that already exists in the platform and start to execute some of the recommendations. And another thing we've talked about, and I've been pretty open about talking to it, talking about it, is the, a new version of the database designer, I think, is something that I'm sure they're working on. Lightweight, something that can give us that database design without the overhead. Those are two things, I think, as they nail or basically the database designer, as they respect that, they'll really have all the components in play to do in based autonomy. And I think that's, to some degree, where they're heading. >> Nice. Well Ben, listen, I really appreciate you coming on. You're a thought leader, you're very open, open minded, Vertica is, you know, a really open community. I mean, they've always been quite transparent in terms of where they're going. It's just awesome to have guys like you on theCUBE to-to share with our community. So thank you so much and hopefully we can meet face-to-face shortly. >> Absolutely. Well you stay safe in Boston, one of my favorite towns and so no doubt, when the doors get back open, I'll be coming down. Or coming up as it were. >> Take care. All right, and thank you for watching everybody. Dave Volante with theCUBE, we're here covering the Virtual Vertica Big Data Conference. (electronic music)
SUMMARY :
brought to you by Vertica. of the Vertica Big Data Conference. I really was hoping I could meet you face-to-face And so what that means is, you know, I wonder if you could sort of talk about that, confirm that, is that you don't have this predictable dashboard What does that mean to a DBA in this day and age? The idea that, you know, And it sounds like you guys use it in that regard. that can perform best for the workload that we need to operate on. Some of the challenges that pertain to the database and you like when things are hardened and fossilized and the ability to separate in the storage, but is that advantageous to you, and if so, why? The idea that you could have this... And I like to think of it, maybe more of, like, the up link. And that's important because you can scale them the compute and put it where you want, right? that is supported by Vertica for the on-prem. And what is Domo for Domo? From the sales folk to the marketing folk, right. I want to ask you about some of the envelope pushing and a lot of the things that we've engineered at Domo, than we could. But I wonder if you can comment. nailing down all the things you think you know. And I think that's, to some degree, where they're heading. It's just awesome to have guys like you on theCUBE Well you stay safe in Boston, All right, and thank you for watching everybody.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
AWS | ORGANIZATION | 0.99+ |
Dave Volante | PERSON | 0.99+ |
Ben White | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Atlanta | LOCATION | 0.99+ |
Ferrari | ORGANIZATION | 0.99+ |
Domo | ORGANIZATION | 0.99+ |
Vertica Customer Advisory Board | ORGANIZATION | 0.99+ |
Ben | PERSON | 0.99+ |
two things | QUANTITY | 0.98+ |
this year | DATE | 0.98+ |
Vertica | TITLE | 0.98+ |
theCUBE | ORGANIZATION | 0.97+ |
Vertica Big Data Conference | EVENT | 0.97+ |
Domo | TITLE | 0.97+ |
Domo | PERSON | 0.96+ |
Virtual Vertica Big Data Conference | EVENT | 0.96+ |
Virtual Vertica Big Data Conference 2020 | EVENT | 0.96+ |
first | QUANTITY | 0.95+ |
eon | TITLE | 0.92+ |
one | QUANTITY | 0.87+ |
today | DATE | 0.87+ |
millions of queries | QUANTITY | 0.84+ |
FlashBlade | TITLE | 0.82+ |
Virtual Vertica | EVENT | 0.75+ |
couple | QUANTITY | 0.7+ |
Pure FlashBlade | COMMERCIAL_ITEM | 0.58+ |
BDC 2020 | EVENT | 0.56+ |
MPP | TITLE | 0.55+ |
times | QUANTITY | 0.51+ |
RDBMS | TITLE | 0.48+ |
Joy King, Vertica | Virtual Vertica BDC 2020
>>Yeah, it's the queue covering the virtual vertical Big Data Conference 2020 Brought to You by vertical. >>Welcome back, everybody. My name is Dave Vellante, and you're watching the Cube's coverage of the verdict of Virtual Big Data conference. The Cube has been at every BTC, and it's our pleasure in these difficult times to be covering BBC as a virtual event. This digital program really excited to have Joy King joining us. Joy is the vice president of product and go to market strategy in particular. And if that weren't enough, he also runs marketing and education curve for him. So, Joe, you're a multi tool players. You've got the technical side and the marketing gene, So welcome to the Cube. You're always a great guest. Love to have you on. >>Thank you so much, David. The pleasure, it really is. >>So I want to get in. You know, we'll have some time. We've been talking about the conference and the virtual event, but I really want to dig in to the product stuff. It's a big day for you guys. You announced 10.0. But before we get into the announcements, step back a little bit you know, you guys are riding the waves. I've said to ah, number of our guests that that brick has always been good. It riding the wave not only the initial MPP, but you you embraced, embraced HD fs. You embrace data science and analytics and in the cloud. So one of the trends that you see the big waves that you're writing >>Well, you're absolutely right, Dave. I mean, what what I think is most interesting and important is because verdict is, at its core a true engineering culture founded by, well, a pretty famous guy, right, Dr Stone Breaker, who embedded that very technical vertical engineering culture. It means that we don't pretend to know everything that's coming, but we are committed to embracing the tech. An ology trends, the innovations, things like that. We don't pretend to know it all. We just do it all. So right now, I think I see three big imminent trends that we are addressing. And matters had we have been for a while, but that are particularly relevant right now. The first is a combination of, I guess, a disappointment in what Hadoop was able to deliver. I always feel a little guilty because she's a very reasonably capable elephant. She was designed to be HD fs highly distributed file store, but she cant be an entire zoo, so there's a lot of disappointment in the market, but a lot of data. In HD FM, you combine that with some of the well, not some the explosion of cloud object storage. You're talking about even more data, but even more data silos. So data growth and and data silos is Trend one. Then what I would say Trend, too, is the cloud Reality Cloud brings so many events. There are so many opportunities that public cloud computing delivers. But I think we've learned enough now to know that there's also some reality. The cloud providers themselves. Dave. Don't talk about it well, because not, is it more agile? Can you do things without having to manage your own data center? Of course you can. That the reality is it's a little more pricey than we expected. There are some security and privacy concerns. There's some workloads that can go to the cloud, so hybrid and also multi cloud deployments are the next trend that are mandatory. And then maybe the one that is the most exciting in terms of changing the world we could use. A little change right now is operationalize in machine learning. There's so much potential in the technology, but it's somehow has been stuck for the most part in science projects and data science lab, and the time is now to operationalize it. Those are the three big trends that vertical is focusing on right now. >>That's great. I wonder if I could ask you a couple questions about that. I mean, I like you have a soft spot in my heart for the and the thing about the Hadoop that that was, I think, profound was it got people thinking about, you know, bringing compute to the data and leaving data in place, and it really got people thinking about data driven cultures. It didn't solve all the problems, but it collected a lot of data that we can now take your third trend and apply machine intelligence on top of that data. And then the cloud is really the ability to scale, and it gives you that agility and that it's not really that cloud experience. It's not not just the cloud itself, it's bringing the cloud experience to wherever the data lives. And I think that's what I'm hearing from you. Those are the three big super powers of innovation today. >>That's exactly right. So, you know, I have to say I think we all know that Data Analytics machine learning none of that delivers real value unless the volume of data is there to be able to truly predict and influence the future. So the last 7 to 10 years has been correctly about collecting the data, getting the data into a common location, and H DFS was well designed for that. But we live in a capitalist world, and some companies stepped in and tried to make HD Fs and the broader Hadoop ecosystem be the single solution to big data. It's not true. So now that the key is, how do we take advantage of all of that data? And now that's exactly what verdict is focusing on. So as you know, we began our journey with vertical back in the day in 2007 with our first release, and we saw the growth of the dupe. So we announced many years ago verdict a sequel on that. The idea to be able to deploy vertical on Hadoop nodes and query the data in Hadoop. We wanted to help. Now with Verdict A 10. We are also introducing vertical in eon mode, and we can talk more about that. But Verdict and Ian Mode for HDs, This is a way to apply it and see sequel database management platform to H DFS infrastructure and data in each DFS file storage. And that is a great way to leverage the investment that so many companies have made in HD Fs. And I think it's fair to the elephant to treat >>her well. Okay, let's get into the hard news and auto. Um, she's got, but you got a mature stack, but one of the highlights of append auto. And then we can drill into some of the technologies >>Absolutely so in well in 2018 vertical announced vertical in Deon mode is the separation of compute from storage. Now this is a great example of vertical embracing innovation. Vertical was designed for on premises, data centers and bare metal servers, tightly coupled storage de l three eighties from Hewlett Packard Enterprises, Dell, etcetera. But we saw that cloud computing was changing fundamentally data center architectures, and it made sense to separate compute from storage. So you add compute when you need compute. You add storage when you need storage. That's exactly what the cloud's introduced, but it was only available on the club. So first thing we did was architect vertical and EON mode, which is not a new product. Eight. This is really important. It's a deployment option. And in 2018 our customers had the opportunity to deploy their vertical licenses in EON mode on AWS in September of 2019. We then broke an important record. We brought cloud architecture down to earth and we announced vertical in eon mode so vertical with communal or shared storage, leveraging pure storage flash blade that gave us all the advantages of separating compute from storage. All of the workload, isolation, the scale up scale down the ability to manage clusters. And we did that with on Premise Data Center. And now, with vertical 10 we are announcing verdict in eon mode on HD fs and vertically on mode on Google Cloud. So what we've got here, in summary, is vertical Andy on mode, multi cloud and multiple on premise data that storage, and that gives us the opportunity to help our customers both with the hybrid and multi cloud strategies they have and unifying their data silos. But America 10 goes farther. >>Well, let me stop you there, because I just wanna I want to mention So we talked to Joe Gonzalez and past Mutual, who essentially, he was brought in. And one of this task was the lead into eon mode. Why? Because I'm asking. You still had three separate data silos and they wanted to bring those together. They're investing heavily in technology. Joe is an expert, though that really put data at their core and beyond Mode was a key part of that because they're using S three and s o. So that was Ah, very important step for those guys carry on. What else do we need to know about? >>So one of the reasons, for example, that Mass Mutual is so excited about John Mode is because of the operational advantages. You think about exactly what Joe told you about multiple clusters serving must multiple use cases and maybe multiple divisions. And look, let's be clear. Marketing doesn't always get along with finance and finance doesn't necessarily get along with up, and I t is often caught the middle. Erica and Dion mode allows workload, isolation, meaning allocating the compute resource is that different use cases need without allowing them to interfere with other use cases and allowing everybody to access the data. So it's a great way to bring the corporate world together but still protect them from each other. And that's one of the things that Mass Mutual is going to benefit from, as well, so many of >>our other customers I also want to mention. So when I saw you, ah, last last year at the Pure Storage Accelerate conference just today we are the only company that separates you from storage that that runs on Prem and in the cloud. And I was like I had to think about it. I've researched. I still can't find anybody anybody else who doesn't know. I want to mention you beat actually a number of the cloud players with that capability. So good job and I think is a differentiator, assuming that you're giving me that cloud experience and the licensing and the pricing capability. So I want to talk about that a little >>bit. Well, you're absolutely right. So let's be clear. There is no question that the public cloud public clouds introduced the separation of compute storage and these advantages that they do not have the ability or the interest to replicate that on premise for vertical. We were born to be software only. We make no money on underlying infrastructure. We don't charge as a package for the hardware underneath, so we are totally motivated to be independent of that and also to continuously optimize the software to be as efficient as possible. And we do the exact same thing to your question about life. Cloud providers charge for note indignance. That's how they charge for their underlying infrastructure. Well, in some cases, if you're being, if you're talking about a use case where you have a whole lot of data, but you don't necessarily have a lot of compute for that workload, it may make sense to pay her note. Then it's unlimited data. But what if you have a huge compute need on a relatively small data set that's not so good? Vertical offers per node and four terabyte for our customers, depending on their use case, we also offer perpetual licenses for customers who want capital. But we also offer subscription for companies that they Nope, I have to have opt in. And while this can certainly cause some complexity for our field organization, we know that it's all about choice, that everybody in today's world wants it personalized just for me. And that's exactly what we're doing with our pricing in life. >>So just to clarify, you're saying I can pay by the drink if I want to. You're not going to force me necessarily into a term or Aiken choose to have, you know, more predictable pricing. Is that, Is that correct? >>Well, so it's partially correct. The first verdict, a subscription licensing is a fixed amount for the period of the subscription. We do that so many of our customers cannot, and I'm one of them, by the way, cannot tell finance what the budgets forecast is going to be for the quarter after I spent you say what it's gonna be before, So our subscription facing is a fixed amount for a period of time. However, we do respect the fact that some companies do want usage based pricing. So on AWS, you can use verdict up by the hour and you pay by the hour. We are about to launch the very same thing on Google Cloud. So for us, it's about what do you need? And we make it happen natively directly with us or through AWS and Google Cloud. >>So I want to send so the the fixed isn't some floor. And then if you want a surge above that, you can allow usage pricing. If you're on the cloud, correct. >>Well, you actually license your cluster vertical by the hour on AWS and you run your cluster there. Or you can buy a license from vertical or a fixed capacity or a fixed number of nodes and deploy it on the cloud. And then, if you want to add more nodes or add more capacity, you can. It's not usage based for the license that you bring to the cloud. But if you purchase through the cloud provider, it is usage. >>Yeah, okay. And you guys are in the marketplace. Is that right? So, again, if I want up X, I can do that. I can choose to do that. >>That's awesome. Next usage through the AWS marketplace or yeah, directly from vertical >>because every small business who then goes to a salesforce management system knows this. Okay, great. I can pay by the month. Well, yeah, Well, not really. Here's our three year term in it, right? And it's very frustrating. >>Well, and even in the public cloud you can pay for by the hour by the minute or whatever, but it becomes pretty obvious that you're better off if you have reserved instance types or committed amounts in that by vertical offers subscription. That says, Hey, you want to have 100 terabytes for the next year? Here's what it will cost you. We do interval billing. You want to do monthly orderly bi annual will do that. But we won't charge you for usage that you didn't even know you were using until after you get the bill. And frankly, that's something my finance team does not like. >>Yeah, I think you know, I know this is kind of a wonky discussion, but so many people gloss over the licensing and the pricing, and I think my take away here is Optionality. You know, pricing your way of That's great. Thank you for that clarification. Okay, so you got Google Cloud? I want to talk about storage. Optionality. If I found him up, I got history. I got I'm presuming Google now of you you're pure >>is an s three compatible storage yet So your story >>Google object store >>like Google object store Amazon s three object store HD fs pure storage flash blade, which is an object store on prim. And we are continuing on this theft because ultimately we know that our customers need the option of having next generation data center architecture, which is sort of shared or communal storage. So all the data is in one place. Workloads can be managed independently on that data, and that's exactly what we're doing. But what we already have in two public clouds and to on premise deployment options today. And as you said, I did challenge you back when we saw each other at the conference. Today, vertical is the only analytic data warehouse platform that offers that option on premise and in multiple public clouds. >>Okay, let's talk about the ah, go back through the innovation cocktail. I'll call it So it's It's the data applying machine intelligence to that data. And we've talked about scaling at Cloud and some of the other advantages of Let's Talk About the Machine Intelligence, the machine learning piece of it. What's your story there? Give us any updates on your embracing of tooling and and the like. >>Well, quite a few years ago, we began building some in database native in database machine learning algorithms into vertical, and the reason we did that was we knew that the architecture of MPP Columbia execution would dramatically improve performance. We also knew that a lot of people speak sequel, but at the time, not so many people spoke R or even Python. And so what if we could give act us to machine learning in the database via sequel and deliver that kind of performance? So that's the journey we started out. And then we realized that actually, machine learning is a lot more as everybody knows and just algorithms. So we then built in the full end to end machine learning functions from data preparation to model training, model scoring and evaluation all the way through to fold the point and all of this again sequel accessible. You speak sequel. You speak to the data and the other advantage of this approach was we realized that accuracy was compromised if you down sample. If you moved a portion of the data from a database to a specialty machine learning platform, you you were challenged by accuracy and also what the industry is calling replica ability. And that means if a model makes a decision like, let's say, credit scoring and that decision isn't anyway challenged, well, you have to be able to replicate it to prove that you made the decision correctly. And there was a bit of, ah, you know, blow up in the media not too long ago about a credit scoring decision that appeared to be gender bias. But unfortunately, because the model could not be replicated, there was no way to this Prove that, and that was not a good thing. So all of this is built in a vertical, and with vertical 10. We've taken the next step, just like with with Hadoop. We know that innovation happens within vertical, but also outside of vertical. We saw that data scientists really love their preferred language. Like python, they love their tools and platforms like tensor flow with vertical 10. We now integrate even more with python, which we have for a while, but we also integrate with tensorflow integration and PM ML. What does that mean? It means that if you build and train a model external to vertical, using the machine learning platform that you like, you can import that model into a vertical and run it on the full end to end process. But run it on all the data. No more accuracy challenges MPP Kilometer execution. So it's blazing fast. And if somebody wants to know why a model made a decision, you can replicate that model, and you can explain why those are very powerful. And it's also another cultural unification. Dave. It unifies the business analyst community who speak sequel with the data scientist community who love their tools like Tensorflow and Python. >>Well, I think joy. That's important because so much of machine intelligence and ai there's a black box problem. You can't replicate the model. Then you do run into a potential gender bias. In the example that you're talking about there in their many you know, let's say an individual is very wealthy. He goes for a mortgage and his wife goes for some credit she gets rejected. He gets accepted this to say it's the same household, but the bias in the model that may be gender bias that could be race bias. And so being able to replicate that in and open up and make the the machine intelligence transparent is very, very important, >>It really is. And that replica ability as well as accuracy. It's critical because if you're down sampling and you're running models on different sets of data, things can get confusing. And yet you don't really have a choice. Because if you're talking about petabytes of data and you need to export that data to a machine learning platform and then try to put it back and get the next at the next day, you're looking at way too much time doing it in the database or training the model and then importing it into the database for production. That's what vertical allows, and our customers are. So it right they reopens. Of course, you know, they are the ones that are sort of the Trailblazers they've always been, and ah, this is the next step. In blazing the ML >>thrill joint customers want analytics. They want functional analytics full function. Analytics. What are they pushing you for now? What are you delivering? What's your thought on that? >>Well, I would say the number one thing that our customers are demanding right now is deployment. Flexibility. What? What the what the CEO or the CFO mandated six months ago? Now shout Whatever that thou shalt is is different. And they would, I tell them is it is impossible. No, what you're going to be commanded to do or what options you might have in the future. The key is not having to choose, and they are very, very committed to that. We have a large telco customer who is multi cloud as their commit. Why multi cloud? Well, because they see innovation available in different public clouds. They want to take advantage of all of them. They also, admittedly, the that there's the risk of lock it right. Like any vendor, they don't want that either, so they want multi cloud. We have other customers who say we have some workloads that make sense for the cloud and some that we absolutely cannot in the cloud. But we want a unified analytics strategy, so they are adamant in focusing on deployment flexibility. That's what I'd say is 1st 2nd I would say that the interest in operationalize in machine learning but not necessarily forcing the analytics team to hammer the data science team about which tools or the best tools. That's the probably number two. And then I'd say Number three. And it's because when you look at companies like Uber or the Trade Desk or A T and T or Cerner performance at scale, when they say milliseconds, they think that flow. When they say petabytes, they're like, Yeah, that was yesterday. So performance at scale good enough for vertical is never good enough. And it's why we're constantly building at the core the next generation execution engine, database designer, optimization engine, all that stuff >>I wanna also ask you. When I first started following vertical, we covered the cube covering the BBC. One of things I noticed was in talking to customers and people in the community is that you have a community edition, uh, free addition, and it's not neutered ais that have you maintain that that ethos, you know, through the transitions into into micro focus. And can you talk about that a little bit >>absolutely vertical community edition is vertical. It's all of the verdict of functionality geospatial time series, pattern matching, machine learning, all of the verdict, vertical neon mode, vertical and enterprise mode. All vertical is the community edition. The only limitation is one terabyte of data and three notes, and it's free now. If you want commercial support, where you can file a support ticket and and things like that, you do have to buy the life. But it's free, and we people say, Well, free for how long? Like our field? I've asked that and I say forever and what he said, What do you mean forever? Because we want people to use vertical for use cases that are small. They want to learn that they want to try, and we see no reason to limit that. And what we look for is when they're ready to grow when they need the next set of data that goes beyond a terabyte or they need more compute than three notes, then we're here for them, and it also brings up an important thing that I should remind you or tell you about Davis. You haven't heard it, and that's about the Vertical Academy Academy that vertical dot com well, what is that? That is, well, self paced on demand as well as vertical essential certification. Training and certification means you have seven days with your hands on a vertical cluster hosted in the cloud to go through all the certification. And guess what? All of that is free. Why why would you give it for free? Because for us empowering the market, giving the market the expert East, the learning they need to take advantage of vertical, just like with Community Edition is fundamental to our mission because we see the advantage that vertical can bring. And we want to make it possible for every company all around the world that take advantage >>of it. I love that ethos of vertical. I mean, obviously great product. But it's not just the product. It's the business practices and really progressive progressive pricing and embracing of all these trends and not running away from the waves but really leaning in joy. Thanks so much. Great interview really appreciate it. And, ah, I wished we could have been faced face in Boston, but I think it's prudent thing to do, >>I promise you, Dave we will, because the verdict of BTC and 2021 is already booked. So I will see you there. >>Haas enjoyed King. Thanks so much for coming on the Cube. And thank you for watching. Remember, the Cube is running this program in conjunction with the virtual vertical BDC goto vertical dot com slash BBC 2020 for all the coverage and keep it right there. This is Dave Vellante with the Cube. We'll be right back. >>Yeah, >>yeah, yeah.
SUMMARY :
Yeah, it's the queue covering the virtual vertical Big Data Conference Love to have you on. Thank you so much, David. So one of the trends that you see the big waves that you're writing Those are the three big trends that vertical is focusing on right now. it's bringing the cloud experience to wherever the data lives. So now that the key is, how do we take advantage of all of that data? And then we can drill into some of the technologies had the opportunity to deploy their vertical licenses in EON mode on Well, let me stop you there, because I just wanna I want to mention So we talked to Joe Gonzalez and past Mutual, And that's one of the things that Mass Mutual is going to benefit from, I want to mention you beat actually a number of the cloud players with that capability. for the hardware underneath, so we are totally motivated to be independent of that So just to clarify, you're saying I can pay by the drink if I want to. So for us, it's about what do you need? And then if you want a surge above that, for the license that you bring to the cloud. And you guys are in the marketplace. directly from vertical I can pay by the month. Well, and even in the public cloud you can pay for by the hour by the minute or whatever, and the pricing, and I think my take away here is Optionality. And as you said, I'll call it So it's It's the data applying machine intelligence to that data. So that's the journey we started And so being able to replicate that in and open up and make the the and get the next at the next day, you're looking at way too much time doing it in the What are they pushing you for now? commanded to do or what options you might have in the future. And can you talk about that a little bit the market, giving the market the expert East, the learning they need to take advantage of vertical, But it's not just the product. So I will see you there. And thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
September of 2019 | DATE | 0.99+ |
Joe Gonzalez | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
2007 | DATE | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Joy King | PERSON | 0.99+ |
Joe | PERSON | 0.99+ |
Joy | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
2018 | DATE | 0.99+ |
Boston | LOCATION | 0.99+ |
Vertical Academy Academy | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
seven days | QUANTITY | 0.99+ |
one terabyte | QUANTITY | 0.99+ |
python | TITLE | 0.99+ |
three notes | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
Hewlett Packard Enterprises | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
BBC | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
100 terabytes | QUANTITY | 0.99+ |
Ian Mode | PERSON | 0.99+ |
six months ago | DATE | 0.99+ |
Python | TITLE | 0.99+ |
first release | QUANTITY | 0.99+ |
1st 2nd | QUANTITY | 0.99+ |
three year | QUANTITY | 0.99+ |
Mass Mutual | ORGANIZATION | 0.99+ |
Eight | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
Stone Breaker | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
one | QUANTITY | 0.98+ |
America 10 | TITLE | 0.98+ |
King | PERSON | 0.98+ |
today | DATE | 0.98+ |
four terabyte | QUANTITY | 0.97+ |
John Mode | PERSON | 0.97+ |
Haas | PERSON | 0.97+ |
yesterday | DATE | 0.97+ |
first verdict | QUANTITY | 0.96+ |
one place | QUANTITY | 0.96+ |
s three | COMMERCIAL_ITEM | 0.96+ |
single | QUANTITY | 0.95+ |
first thing | QUANTITY | 0.95+ |
One | QUANTITY | 0.95+ |
both | QUANTITY | 0.95+ |
Tensorflow | TITLE | 0.95+ |
Hadoop | TITLE | 0.95+ |
third trend | QUANTITY | 0.94+ |
MPP Columbia | ORGANIZATION | 0.94+ |
Hadoop | PERSON | 0.94+ |
last last year | DATE | 0.92+ |
three big trends | QUANTITY | 0.92+ |
vertical 10 | TITLE | 0.92+ |
two public clouds | QUANTITY | 0.92+ |
Pure Storage Accelerate conference | EVENT | 0.91+ |
Andy | PERSON | 0.9+ |
few years ago | DATE | 0.9+ |
next day | DATE | 0.9+ |
Mutual | ORGANIZATION | 0.9+ |
Mode | PERSON | 0.89+ |
telco | ORGANIZATION | 0.89+ |
three big | QUANTITY | 0.88+ |
eon | TITLE | 0.88+ |
Verdict | PERSON | 0.88+ |
three separate data | QUANTITY | 0.88+ |
Cube | COMMERCIAL_ITEM | 0.87+ |
petabytes | QUANTITY | 0.87+ |
Google Cloud | TITLE | 0.86+ |
Larry Lancaster, Zebrium | Virtual Vertica BDC 2020
>> Announcer: It's theCUBE! Covering the Virtual Vertica Big Data Conference 2020 brought to you by Vertica. >> Hi, everybody. Welcome back. You're watching theCUBE's coverage of the Vertica Virtual Big Data Conference. It was, of course, going to be in Boston at the Encore Hotel. Win big with big data with the new casino but obviously Coronavirus has changed all that. Our hearts go out and we are empathy to those people who are struggling. We are going to continue our wall-to-wall coverage of this conference and we're here with Larry Lancaster who's the founder and CTO of Zebrium. Larry, welcome to theCUBE. Thanks for coming on. >> Hi, thanks for having me. >> You're welcome. So first question, why did you start Zebrium? >> You know, I've been dealing with machine data a long time. So for those of you who don't know what that is, if you can imagine servers or whatever goes on in a data center or in a SAS shop. There's data coming out of those servers, out of those applications and basically, you can build a lot of cool stuff on that. So there's a lot of metrics that come out and there's a lot of log files that come. And so, I've built this... Basically spent my career building that sort of thing. So tools on top of that or products on top of that. The problem is that since at least log files are completely unstructured, it's always doing the same thing over and over again, which is going in and understanding the data and extracting the data and all that stuff. It's very time consuming. If you've done it like five times you don't want to do it again. So really, my idea was at this point with machine learning where it's at there's got to be a better way. So Zebrium was founded on the notion that we can just do all that automatically. We can take a pile of machine data, we can turn it into a database, and we can build stuff on top of that. And so the company is really all about bringing that value to the market. >> That's cool. I want to get in to that, just better understand who you're disrupting and understand that opportunity better. But before I do, tell us a little bit about your background. You got kind of an interesting background. Lot of tech jobs. Give us some color there. >> Yeah, so I started in the Valley I guess 20 years ago and when my son was born I left grad school. I was in grad school over at Berkeley, Biophysics. And I realized I needed to go get a job so I ended up starting in software and I've been there ever since. I mean, I spent a lot of time at, I guess I cut my teeth at Nedap, which was a storage company. And then I co-founded a business called Glassbeam, which was kind of an ETL database company. And then after that I ended up at Nimble Storage. Another company, EMC, ended up buying the Glassbeam so I went over there and then after Nimble though, which where I build the InfoSight platform. That's where I kind of, after that I was able to step back and take a year and a half and just go into my basement, actually, this is my kind of workspace here, and come up with the technology and actually build it so that I could go raise money and get a team together to build Zebrium. So that's really my career in a nutshell. >> And you've got Hello Kitty over your right shoulder, which is kind of cool >> That's right. >> And then up to the left you got your monitor, right? >> Well, I had it. It's over here, yeah. >> But it was great! Pull it out, pull it out, let me see it. So, okay, so you got that. So what do you do? You just sit there and code all night or what? >> Yeah, that's right. So Hello Kitty's over here. I have a daughter and she setup my workspace here on this side with Hello Kitty and so on. And over on this side, I've got my recliner where I basically lay it all the way back and then I pivot this thing down over my face and put my keyboard on my lap and I can just sit there for like 20 hours. It's great. Completely comfortable. >> That's cool. All right, better put that monitor back or our guys will yell at me. But so, obviously, we're talking to somebody with serious coding chops and I'll also add that the Nimble InfoSight, I think it was one of the best pick ups that HP, HPE, has had in a while. And the thing that interested me about that, Larry, is the ability that the company was able to take that InfoSight and poured it very quickly across its product lines. So that says to me it was a modern, architecture, I'm sure API, microservices, and all those cool buzz words, but the proof is in their ability to bring that IP to other parts of the portfolio. So, well done. >> Yeah, well thanks. Appreciate that. I mean, they've got a fantastic team there. And the other thing that helps is when you have the notion that you don't just build on top of the data, you extract the data, you structure it, you put that in a database, we used Vertica there for that, and then you build on top of that. Taking the time to build that layer is what lets you build a scalable platform. >> Yeah, so, why Vertica? I mean, Vertica's been around for awhile. You remember you had the you had the old RDBMS, Oracles, Db2s, SQL Server, and then the database was kind of a boring market. And then, all of a sudden, you had all of these MPP companies came out, a spade of them. They all got acquired, including Vertica. And they've all sort of disappeared and morphed into different brands and Micro Focus has preserved the Vertica brand. But it seems like Vertica has been able to survive the transitions. Why Vertica? What was it about that platform that was unique and interested you? >> Well, I mean, so they're the first fund to build, what I would call a real column store that's kind of market capable, right? So there was the C-Store project at Berkeley, which Stonebreaker was involved in. And then that became sort of the seed from which Vertica was spawned. So you had this idea of, let's lay things out in a columnar way. And when I say columnar, I don't just mean that the data for every column is in a different set of files. What I mean by that is it takes full advantage of things like run length and coding, and L file and coding, and block--impression, and so you end up with these massive orders of magnitude savings in terms of the data that's being pulled off of storage as well as as it's moving through the pipeline internally in Vertica's query processing. So why am I saying all this? Because it's fundamentally, it was a fundamentally disruptive technology. I think column stores are ubiquitous now in analytics. And I think you could name maybe a couple of projects which are mostly open source who do something like Vertica does but name me another one that's actually capable of serving an enterprise as a relational database. I still think Vertica is unique in being that one. >> Well, it's interesting because you're a startup. And so a lot of startups would say, okay, we're going with a born-in-the-cloud database. Now Vertica touts that, well look, we've embraced cloud. You know, we have, we run in the cloud, we run on PRAM, all different optionality. And you hear a lot of vendors say that, but a lot of times they're just taking their stack and stuffing it into the cloud. But, so why didn't you go with a cloud-native database and is Vertica able to, I mean, obviously, that's why you chose it, but I'm interested from a technologist standpoint as to why you, again, made that choice given all these other choices around there. >> Right, I mean, again, I'm not, so... As I explained a column store, which I think is the appropriate definition, I'm not aware of another cloud-native-- >> Hm, okay. >> I'm aware of other cloud-native transactional databases, I'm not aware of one that has the analytics form it and I've tried some of them. So it was not like I didn't look. What I was actually impressed with and I think what let me move forward using Vertica in our stack is the fact that Eon really is built from the ground up to be cloud-native. And so we've been using Eon almost ever since we started the work that we're doing. So I've been really happy with the performance and with reliability of Eon. >> It's interesting. I've been saying for years that Vertica's a diamond in the rough and it's previous owner didn't know what to do with it because it got distracted and now Micro Focus seems to really see the value and is obviously putting some investments in there. >> Yeah >> Tell me more about your business. Who are you disrupting? Are you kind of disrupting the do-it-yourself? Or is there sort of a big whale out there that you're going to go after? Add some color to that. >> Yeah, so our broader market is monitoring software, that's kind of the high-level category. So you have a lot of people in that market right now. Some of them are entrenched in large players, like Datadog would be a great example. Some of them are smaller upstarts. It's a pretty, it's a pretty saturated market. But what's happened over the last, I'd say two years, is that there's been sort of a push towards what's called observability in terms of at least how some of the products are architected, like Honeycomb, and how some of them are messaged. Most of them are messaged these days. And what that really means is there's been sort of an understanding that's developed that that MTTR is really what people need to focus on to keep their customers happy. If you're a SAS company, MTTR is going to be your bread and butter. And it's still measured in hours and days. And the biggest reason for that is because of what's called unknown unknowns. Because of complexity. Now a days, things are, applications are ten times as complex as they used to be. And what you end up with is a situation where if something is new, if it's a known issue with a known symptom and a known root cause, then you can setup a automation for it. But the ones that really cost a lot of time in terms of service disruption are unknown unknowns. And now you got to go dig into this massive mass of data. So observability is about making tools to help you do that, but it's still going to take you hours. And so our contention is, you need to automate the eyeball. The bottleneck is now the eyeball. And so you have to get away from this notion of a person's going to be able to do it infinitely more efficient and recognize that you need automated help. When you get an alert agent, it shouldn't be that, "Hey, something weird's happening. Now go dig in." It should be, "Here's a root cause and a symptom." And that should be proposed to you by a system that actually does the observing. That actually does the watching. And that's what Zebrium does. >> Yeah, that's awesome. I mean, you're right. The last thing you want is just another alert and it say, "Go figure something out because there's a problem." So how does it work, Larry? In terms of what you built there. Can you take us inside the covers? >> Yeah, sure. So there's really, right now there's two kinds of data that we're ingesting. There's metrics and there's log files. Metrics, there's actually sort of a framework that's really popular in DevOp circles especially but it's becoming popular everywhere, which is called Prometheus. And it's a way of exporting metrics so that scrapers can collect them. And so if you go look at a typical stack, you'll find that most of the open source components and many of the closed source components are going to have exporters that export all their stacks to Prometheus. So by supporting that stack we can bring in all of those metrics. And then there's also the log files. And so you've got host log files in a containerized environment, you've got container logs, and you've got application-specific logs, perhaps living on a host mount. And you want to pull all those back and you want to be able to associate this log that I've collected here is associated with the same container on the same host that this metric is associated with. But now what? So once you've got that, you've got a pile of unstructured logs. So what we do is we take a look at those logs and we say, let's structure those into tables, right? So where I used to have a log message, if I look in my log file and I see it says something like, X happened five times, right? Well, that event types going to occur again and it'll say, X happened six times or X happened three times. So if I see that as a human being, I can say, "Oh clearly, that's the same thing." And what's interesting here is the times that X, that X happened, and that this number read... I may want to know when the numbers happened as a time series, the values of that column. And so you can imagine it as a table. So now I have table for that event type and every time it happens, I get a row. And then I have a column with that number in it. And so now I can do any kind of analytics I want almost instantly across my... If I have all my event types structured that way, every thing changes. You can do real anomaly detection and incident detection on top of that data. So that's really how we go about doing it. How we go about being able to do autonomous monitoring in a way that's effective. >> How do you handle doing that for, like the Spoke app? Do you have to, does somebody have to build a connector to those apps? How do you handle that? >> Yeah, that's a really good question. So you're right. So if I go and install a typical log manager, there'll be connectors for different apps and usually what that means is pulling in the stuff on the left, if you were to be looking at that log line, and it will be things like a time stamp, or a severity, or a function name, or various other things. And so the connector will know how to pull those apart and then the stuff to the right will be considered the message and that'll get indexed for search. And so our approach is we actually go in with machine learning and we structure that whole thing. So there's a table. And it's going to have a column called severity, and timestamp, and function name. And then it's going to have columns that correspond to the parameters that are in that event. And it'll have a name associated with the constant parts of that event. And so you end up with a situation where you've structured all of it automatically so we don't need collectors. It'll work just as well on your home-grown app that has no collectors or no parsers to find or anything. It'll work immediately just as well as it would work on anything else. And that's important, because you can't be asking people for connectors to their own applications. It just, it becomes now they've go to stop what they're doing and go write code for you, for your platform and they have to maintain it. It's just untenable. So you can be up and running with our service in three minutes. It'll just be monitoring those for you. >> That's awesome! I mean, that is really a breakthrough innovation. So, nice. Love to see that hittin' the market. Who do you sell to? Both types of companies and what role within the company? >> Well, definitely there's two main sort of pushes that we've seen, or I should say pulls. One is from DevOps folks, SRE folks. So these are people who are tasked with monitoring an environment, basically. And then you've got people who are in engineering and they have a staging environment. And what they actually find valuable is... Because when we find an incident in a staging environment, yeah, half the time it's because they're tearing everything up and it's not release ready, whatever's in stage. That's fine, they know that. But the other half the time it's new bugs, it's issues and they're finding issues. So it's kind of diverged. You have engineering users and they don't have titles like QA, they're Dev engineers or Dev managers that are really interested. And then you've got DevOps and SRE people there (mumbles). >> And how do I consume your product? Is the SAS... I sign up and you say within three minutes I'm up and running. I'm paying by the drink. >> Well, (laughs) right. So there's a couple ways. So, right. So the easiest way is if you use Kubernetes. So Kubernetes is what's called a container orchestrator. So these days, you know Docker and containers and all that, so now there's container orchestrators have become, I wouldn't say ubiquitous but they're very popular now. So it's kind of on that inflection curve. I'm not exactly sure the penetration but I'm going to say 30-40% probably of shops that were interested are using container orchestrators. So if you're using Kubernetes, basically you can install our Kubernetes chart, which basically means copying and pasting a URL and so on into your little admin panel there. And then it'll just start collecting all the logs and metrics and then you just login on the website. And the way you do that is just go to our website and it'll show you how to sign up for the service and you'll get your little API key and link to the chart and you're off and running. You don't have to do anything else. You can add rules, you can add stuff, but you don't have to. You shouldn't have to, right? You should never have to do any more work. >> That's great. So it's a SAS capability and I just pay for... How do you price it? >> Oh, right. So it's priced on volume, data volume. I don't want to go too much into it because I'm not the pricing guy. But what I'll say is that it's, as far as I know it's as cheap or cheaper than any other log manager or metrics product. It's in that same neighborhood as the very low priced ones. Because right now, we're not trying to optimize for take. We're trying to make a healthy margin and get the value of autonomous monitoring out there. Right now, that's our priority. >> And it's running in the cloud, is that right? AWB West-- >> Yeah, that right. Oh, I should've also pointed out that you can have a free account if it's less than some number of gigabytes a day we're not going to charge. Yeah, so we run in AWS. We have a multi-tenant instance in AWS. And we have a Vertica Eon cluster behind that. And it's been working out really well. >> And on your freemium, you have used the Vertica Community Edition? Because they don't charge you for that, right? So is that how you do it or... >> No, no. We're, no, no. So, I don't want to go into that because I'm not the bizdev guy. But what I'll say is that if you're doing something that winds up being OEM-ish, you can work out the particulars with Vertica. It's not like you're going to just go pay retail and they won't let you distinguish between tests, and prod, and paid, and all that. They'll work with you. Just call 'em up. >> Yeah, and that's why I brought it up because Vertica, they have a community edition, which is not neutered. It runs Eon, it's just there's limits on clusters and storage >> There's limits. >> But it's still fully functional though. >> So to your point, we want it multi-tenant. So it's big just because it's multi-tenant. We have hundred of users on that (audio cuts out). >> And then, what's your partnership with Vertica like? Can we close on that and just describe that a little bit? >> What's it like. I mean, it's pleasant. >> Yeah, I mean (mumbles). >> You know what, so the important thing... Here's what's important. What's important is that I don't have to worry about that layer of our stack. When it comes to being able to get the performance I need, being able to get the economy of scale that I need, being able to get the absolute scale that I need, I've not been disappointed ever with Vertica. And frankly, being able to have acid guarantees and everything else, like a normal mature database that can join lots of tables and still be fast, that's also necessary at scale. And so I feel like it was definitely the right choice to start with. >> Yeah, it's interesting. I remember in the early days of big data a lot of people said, "Who's going to need these acid properties and all this complexity of databases." And of course, acid properties and SQL became the killer features and functions of these databases. >> Who didn't see that one coming, right? >> Yeah, right. And then, so you guys have done a big seed round. You've raised a little over $6 million dollars and you got the product market fit down. You're ready to rock, right? >> Yeah, that's right. So we're doing a launch probably, well, when this airs it'll probably be the day before this airs. Basically, yeah. We've got people... Like literally in the last, I'd say, six to eight weeks, It's just been this sort of pique of interest. All of a sudden, everyone kind of gets what we're doing, realizes they need it, and we've got a solution that seems to meet expectations. So it's like... It's been an amazing... Let me just say this, it's been an amazing start to the year. I mean, at the same time, it's been really difficult for us but more difficult for some other people that haven't been able to go to work over the last couple of weeks and so on. But it's been a good start to the year, at least for our business. So... >> Well, Larry, congratulations on getting the company off the ground and thank you so much for coming on theCUBE and being part of the Virtual Vertica Big Data Conference. >> Thank you very much. >> All right, and thank you everybody for watching. This is Dave Vellante for theCUBE. Keep it right there. We're covering wall-to-wall Virtual Vertica BDC. You're watching theCUBE. (upbeat music)
SUMMARY :
brought to you by Vertica. and we're here with Larry Lancaster why did you start Zebrium? and basically, you can build a lot of cool stuff on that. and understand that opportunity better. and actually build it so that I could go raise money It's over here, yeah. So what do you do? and then I pivot this thing down over my face and I'll also add that the Nimble InfoSight, And the other thing that helps is when you have the notion and Micro Focus has preserved the Vertica brand. and so you end up with these massive orders And you hear a lot of vendors say that, I'm not aware of another cloud-native-- I'm not aware of one that has the analytics form it and now Micro Focus seems to really see the value Are you kind of disrupting the do-it-yourself? And that should be proposed to you In terms of what you built there. And so you can imagine it as a table. And so you end up with a situation I mean, that is really a breakthrough innovation. and it's not release ready, I sign up and you say within three minutes And the way you do that So it's a SAS capability and I just pay for... and get the value of autonomous monitoring out there. that you can have a free account So is that how you do it or... and they won't let you distinguish between Yeah, and that's why I brought it up because Vertica, But it's still So to your point, I mean, it's pleasant. What's important is that I don't have to worry I remember in the early days of big data and you got the product market fit down. that haven't been able to go to work and thank you so much for coming on theCUBE All right, and thank you everybody for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Larry Lancaster | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Larry | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
five times | QUANTITY | 0.99+ |
three times | QUANTITY | 0.99+ |
six times | QUANTITY | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
six | QUANTITY | 0.99+ |
Zebrium | ORGANIZATION | 0.99+ |
20 hours | QUANTITY | 0.99+ |
Glassbeam | ORGANIZATION | 0.99+ |
Nedap | ORGANIZATION | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Nimble | ORGANIZATION | 0.99+ |
Nimble Storage | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
a year and a half | QUANTITY | 0.99+ |
Micro Focus | ORGANIZATION | 0.99+ |
ten times | QUANTITY | 0.99+ |
two kinds | QUANTITY | 0.99+ |
two years | QUANTITY | 0.99+ |
three minutes | QUANTITY | 0.99+ |
first question | QUANTITY | 0.99+ |
eight weeks | QUANTITY | 0.98+ |
Stonebreaker | ORGANIZATION | 0.98+ |
Prometheus | TITLE | 0.98+ |
30-40% | QUANTITY | 0.98+ |
Eon | ORGANIZATION | 0.98+ |
hundred of users | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
Vertica Virtual Big Data Conference | EVENT | 0.98+ |
Kubernetes | TITLE | 0.97+ |
first fund | QUANTITY | 0.97+ |
Virtual Vertica Big Data Conference 2020 | EVENT | 0.97+ |
AWB West | ORGANIZATION | 0.97+ |
Virtual Vertica Big Data Conference | EVENT | 0.97+ |
Honeycomb | ORGANIZATION | 0.96+ |
SAS | ORGANIZATION | 0.96+ |
20 years ago | DATE | 0.96+ |
Both types | QUANTITY | 0.95+ |
theCUBE | ORGANIZATION | 0.95+ |
Datadog | ORGANIZATION | 0.95+ |
two main | QUANTITY | 0.94+ |
over $6 million dollars | QUANTITY | 0.93+ |
Hello Kitty | ORGANIZATION | 0.93+ |
SQL | TITLE | 0.93+ |
Zebrium | PERSON | 0.91+ |
Spoke | TITLE | 0.89+ |
Encore Hotel | LOCATION | 0.88+ |
InfoSight | ORGANIZATION | 0.88+ |
Coronavirus | OTHER | 0.88+ |
one | QUANTITY | 0.86+ |
less | QUANTITY | 0.85+ |
Oracles | ORGANIZATION | 0.85+ |
2020 | DATE | 0.85+ |
CTO | PERSON | 0.84+ |
Vertica | TITLE | 0.82+ |
Nimble InfoSight | ORGANIZATION | 0.81+ |
Ron Cormier, The Trade Desk | Virtual Vertica BDC 2020
>> David: It's the cube covering the virtual Vertica Big Data conference 2020 brought to you by Vertica. Hello, buddy, welcome to this special digital presentation of the cube. We're tracking the Vertica virtual Big Data conferences, the cubes. I think fifth year doing the BDC. We've been to every big data conference that they've held and really excited to be helping with the digital component here in these interesting times. Ron Cormier is here, Principal database engineer at the Trade Desk. Ron, great to see you. Thanks for coming on. >> Hi, David, my pleasure, good to see you as well. >> So we're talking a little bit about your background you got, you're basically a Vertica and database guru, but tell us about your role at Trade Desk and then I want to get into a little bit about what Trade Desk does. >> Sure, so I'm a principal database engineer at the Trade Desk. The Trade Desk was one of my customers when I was working with Hp, at HP, as a member of the Vertica team, and I joined the Trade Desk in early 2016. And since then, I've been working on building out their Vertica capabilities and expanding the data warehouse footprint and as ever growing database technology, data volume environment. >> And the Trade Desk is an ad tech firm and you are specializing in real time ad serving and pricing. And I guess real time you know, people talk about real time a lot we define real time as before you lose the customer. Maybe you can talk a little bit about you know, the Trade Desk in the business and maybe how you define real time. >> Totally, so to give everybody kind of a frame of reference. Anytime you pull up your phone or your laptop and you go to a website or you use some app and you see an ad what's happening behind the scenes is an auction is taking place. And people are bidding on the privilege to show you an ad. And across the open Internet, this happens seven to 13 million times per second. And so the ads, the whole auction dynamic and the display of the ad needs to happen really fast. So that's about as real time as it gets outside of high frequency trading, as far as I'm aware. So we put the Trade Desk participates in those auctions, we bid on behalf of our customers, which are ad agencies, and the agencies represent brands so the agencies are the madman companies of the world and they have brands that under their guidance, and so they give us budget to spend, to place the ads and to display them and once the ads get displayed, so we bid on the hundreds of thousands of auctions per second. Once we make those bids, anytime we do make a bid some data flows into our data platform, which is powered by Vertica. And, so we're getting hundreds of thousands of events per second. We have other events that flow into Vertica as well. And we clean them up, we aggregate them, and then we run reports on the data. And we run about 40,000 reports per day on behalf of our customers. The reports aren't as real time as I was talking about earlier, they're more batch oriented. Our customers like to see big chunks of time, like a whole day or a whole week or a whole month on a single report. So we wait for that time period to complete and then we run the reports on the results. >> So you you have one of the largest commercial infrastructures, in the Big Data sphere. Paint a picture for us. I understand you got a couple of like 320 node clusters we're talking about petabytes of data. But describe what your environment looks like. >> Sure, so like I said, we've been very good customers for a while. And we started out with with a bunch of enterprise clusters. So the Enterprise Mode is the traditional Vertica deployment where the compute and the storage is tightly coupled all raid arrays on the servers. And we had four of those and we're doing okay, but our volumes are ever increasing, we wanted to store more data. And we wanted to run more reports in a shorter period of time, was to keep pushing. And so we had these four clusters and then we started talking with Vertica about Eon mode, and that's Vertica separation of compute and storage where you get the compute and the storage can be scaled independently, we can add storage without adding compute or vice versa or we can add both, like. So that was something that we were very interested in for a couple reasons. One, our enterprise clusters, we're running out of disk, like when adding disk is expensive. In Enterprise Mode, it's kind of a pain, you got to add, compute at the same time, so you kind of end up in an unbalanced place. So beyond mode that problem gets a lot better. We can add disk, infinite disk because it's backed by S3. And we can add compute really easy to scale, the number of things that we run in parallel concurrency, just add a sub cluster. So they are two US East and US west of Amazon, so reasonably diverse. And and the real benefit is that they can, we can stop nodes when we don't need them. Our workload is fairly lumpy, I call it. Like we, after the day completes, we do the ingest, we do the aggregation for ingesting and aggregating all day, but the final hour, so it needs to be completed. And then once that's done, then the number of reports that we need to run spikes up, it goes really high. And we run those reports, we spin up a bunch of extra compute on the fly, run those reports and then spin them down. And we don't have to pay for that, for the rest of the day. So Eon has been a nice Boone for us for both those reasons. >> I'd love to explore you on little bit more. I mean, it's relatively new, I think 2018 Vertica announced Eon mode, so it's only been out there a couple years. So I'm curious for the folks that haven't moved the Eon mode, can you which presumably they want to for the same reasons that you mentioned why by the stories and chunks when you're on Storage if you don't have to, what were some of the challenges that you had to, that you faced in going to Eon mode? What kind of things did you have to prepare for? Were there any out of scope expectations? Can you share that experience with us? >> Sure, so we were an early adopter. We participated in the beta program. I mean, we, I think it's fair to say we actually drove the requirements and a lot of ways because we approached Vertica early on. So the challenges were what you'd expect any early adopter to be going through. The sort of getting things working as expected. I mean, there's a number of cases, which I could touch upon, like, we found an efficiency in the way that it accesses the data on S3 and it was accessing the data too frequently, which ended up was just expensive. So our S3 bill went up pretty significantly for a couple of months. So that was a challenge, but we worked through that another was that we recently made huge strides in with Vertica was the ability to stop and start nodes and not have to start them very quickly. And when they start to not interfere with any running queries, so when we create, when we want to spin up a bunch to compute, there was a point in time when it would break certain queries that were already running. So that that was a challenge. But again, the very good team has been quite responsive to solving these issues and now that's behind us. In terms of those who need to get started, there's or looking to get started. there's a number of things to think about. Off the top of my head there's sort of new configuration items that you'll want to think about, like how instance type. So certainly the Amazon has a variety of instances and its important to consider one of Vertica's architectural advantages in these areas Vertica has this caching layer on the instances themselves. And what that does is if we can keep the data in cache, what we've found is that the performance is basically the same performance of Enterprise Mode. So having a good size cast when needed, can be a little worrying. So we went with the I three instance types, which have a lot of local NVME storage that we can, so we can cache data and get good performance. That's one thing to think about. The number of nodes, the instance type, certainly the number of shards is a sort of technical item that needs to be considered. It's how the data gets, its distributed. It's sort of a layer on top of the segmentation that some Vertica engineers will be familiar with. And probably I mean, the, one of the big things that one needs to consider is how to get data in the database. So if you have an existing database, there's no sort of nice tool yet to suck all the data into an Eon database. And so I think they're working on that. But we're at the point we got there. We had to, we exported all our data out of enterprise cluster as cache dumped it out to S3 and then we had the Eon cluster to suck that data. >> So awesome advice. Thank you for sharing that with the community. So but at the end of the day, so it sounds like you had some learning to do some tweaking to do and obviously how to get the data in. At the end of the day, was it worth it? What was the business impact? >> Yeah, it definitely was worth it for us. I mean, so right now, we have four times the data in our Eon cluster that we have in our enterprise clusters. We still run some enterprise clusters. We started with four at the peak. Now we're down to two. So we have the two young clusters. So it's been, I think our business would say it's been a huge win, like we're doing things that we really never could have done before, like for accessing the data on enterprise would have been really difficult. It would have required non trivial engineering to do things like daisy chaining clusters together, and then how to aggregate data across clusters, which would, again, non trivial. So we have all the data we want, we can continue to grow data, where running reports on seasonality. So our customers can compare their campaigns last year versus this year, which is something we just haven't been able to do in the past. We've expanded that. So we grew the data vertically, we've expanded the data horizontally as well. So we were adding columns to our aggregates. We are, in reaching the data much more than we have in the past. So while we still have enterprise kicking around, I'd say our clusters are doing the majority of the heavy lifting. >> And the cloud was part of the enablement, here, particularly with scale, is that right? And are you running certain... >> Definitely. >> And you are running on prem as well, or are you in a hybrid mode? Or is it all AWS? >> Great question, so yeah. When I've been speaking about enterprise, I've been referring to on prem. So we have a physical machines in data centers. So yeah, we are running a hybrid now and I mean, and so it's really hard to get like an apples to apples direct comparison of enterprise on prem versus Eon in the cloud. One thing that I touched upon in my presentation is it would require, if I try to get apples to apples, And I think about how I would run the entire workload on enterprise or on Eon, I had to run the entire thing, we want both, I tried to think about how many cores, we would need CPU cores to do that. And basically, it would be about the same number of cores, I think, for enterprise on prime versus Eon in the cloud. However, Eon nodes only need to be running half the course only need to be running about six hours out of the day. So the other the other 18 hours I can shut them down and not be paying for them, mostly. >> Interesting, okay, and so, I got to ask you, I mean, notwithstanding the fact that you've got a lot invested in Vertica, and get a lot of experience there. A lot of you know, emerging cloud databases. Did you look, I mean, you know, a lot about database, not just Vertica, your database guru in many areas, you know, traditional RDBMS, as well as MPP new cloud databases. What is it about Vertica that works for you in this specific sweet spot that you've chosen? What's really the difference there? >> Yeah, so I think the key differences is the maturity. There are a number, I am familiar with another, a number of other database platforms in the cloud and otherwise, column stores specifically, that don't have the maturity that we're used to and we need at our scale. So being able to specify alternate projections, so different sort orders on my data is huge. And, there's other platforms where we don't have that capability. And so the, Vertica is, of course, the original column store and they've had time to build up a lead in terms of their maturity and features and I think that other other column stores cloud, otherwise are playing a little bit of catch up in that regard. Of course, Vertica is playing catch up on the cloud side. But if I had to pick whether I wanted to write a column store, first graph from scratch, or use a defined file system, like a cloud file system from scratch, I'd probably think it would be easier to write the cloud file system. The column store is where the real smarts are. >> Interesting, let's talk a little bit about some of the challenges you have in reporting. You have a very dynamic nature of reporting, like I said, your clients want to they want to a time series, they just don't want to snap snapshot of a slice. But at the same time, your reporting is probably pretty lumpy, a very dynamic, you know, demand curve. So first of all, is that accurate? Can you describe that sort of dynamic, dynamism and how are you handling that? >> Yep, that's exactly right. It is lumpy. And that's the exact word that I use. So like, at the end of the UTC day, when UTC midnight rolls around, that's we do the final ingest the final aggregate and then the queue for the number of reports that need to run spikes. So the majority of those 40,000 reports that we run per day are run in the four to six hours after that spikes up. And so that's when we need to have all the compute come online. And that's what helps us answer all those queries as fast as possible. And that's a big reason why Eon is advantage for us because the rest of the day we kind of don't necessarily need all that compute and we can shut it down and not pay for it. >> So Ron, I wonder if you could share with us just sort of the wrap here, where you want to take this you're obviously very close to Vertica. Are you driving them in a heart and Eon mode, you mentioned before you'd like, you'd have the ability to load data into Eon mode would have been nice for you, I guess that you're kind of over that hump. But what are the kinds of things, If Column Mahoney is here in the room, what are you telling him that you want the team, the engineering team at Vertica to work on that would make your life better? >> I think the things that need the most attention sort of near term is just the smoothing out some of the edges in terms of making it a little bit more seamless in terms of the cloud aspects to it. So our goal is to be able to start instances and have them join the cluster in less than five minutes. We're not quite there yet. If you look at some of the other cloud database platforms, they're beating that handle it so I know the team is working on that. Some of the other things are the control. Like I mentioned, while we like control in the column store, we also want control on the cloud side of things in terms of being able to dedicate cluster, some clusters specific. We can pin workloads against a specific sub cluster and take advantage of the cast that's over there. We can say, okay, this resource pool. I mean, the sub cluster is a new concept, relatively new concept for Vertica. So being able to have control of many things at sub cluster level, resource pools, configuration parameters, and so on. >> Yeah, so I mean, I personally have always been impressed with Vertica. And their ability to sort of ride the wave adopt new trends. I mean, they do have a robust stack. It's been, you know, been 10 plus years around. They certainly embraced to do, the embracing machine learning, we've been talking about the cloud. So I actually have a lot of confidence to them, especially when you compare it to other sort of mid last decade MPP column stores that came out, you know, Vertica is one of the few remaining certainly as an independent brand. So I think that speaks the team there and the engineering culture. But give your final word. Just final thoughts on your role the company Vertica wherever you want to take it. >> Yeah, no, I mean, we're really appreciative and we value the partners that we have and so I think it's been a win win, like our volumes are, like I know that we have some data that got pulled into their test suite. So I think it's been a win win for both sides and it'll be a win for other Vertica customers and prospects, knowing that they're working with some of the highest volume, velocity variety data that (mumbles) >> Well, Ron, thanks for coming on. I wish we could have met face to face at the the Encore in Boston. I think next year we'll be able to do that. But I appreciate that technology allows us to have these remote conversations. Stay safe, all the best to you and your family. And thanks again. >> My pleasure, David, good speaking with you. >> And thank you for watching everybody, we're covering this is the Cubes coverage of the Vertica virtual Big Data conference. I'm Dave volante. We'll be right back right after this short break. (soft music)
SUMMARY :
brought to you by Vertica. So we're talking a little bit about your background and I joined the Trade Desk in early 2016. And the Trade Desk is an ad tech firm And people are bidding on the privilege to show you an ad. So you you have one of the largest And and the real benefit is that they can, for the same reasons that you mentioned why by dumped it out to S3 and then we had the Eon cluster So but at the end of the day, So we have all the data we want, And the cloud was part of the enablement, here, half the course only need to be running I mean, notwithstanding the fact that you've got that don't have the maturity about some of the challenges you have in reporting. because the rest of the day we kind of So Ron, I wonder if you could share with us in terms of the cloud aspects to it. the company Vertica wherever you want to take it. and we value the partners that we have Stay safe, all the best to you and your family. of the Vertica virtual Big Data conference.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ron | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Ron Cormier | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
40,000 reports | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
18 hours | QUANTITY | 0.99+ |
fifth year | QUANTITY | 0.99+ |
US | LOCATION | 0.99+ |
Dave volante | PERSON | 0.99+ |
next year | DATE | 0.99+ |
seven | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
2018 | DATE | 0.99+ |
less than five minutes | QUANTITY | 0.99+ |
this year | DATE | 0.99+ |
10 plus years | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
four | QUANTITY | 0.99+ |
early 2016 | DATE | 0.98+ |
apples | ORGANIZATION | 0.98+ |
two young clusters | QUANTITY | 0.98+ |
two | QUANTITY | 0.98+ |
both sides | QUANTITY | 0.98+ |
about six hours | QUANTITY | 0.98+ |
Cubes | ORGANIZATION | 0.98+ |
six hours | QUANTITY | 0.98+ |
US East | LOCATION | 0.98+ |
Hp | ORGANIZATION | 0.98+ |
Eon | ORGANIZATION | 0.96+ |
S3 | TITLE | 0.95+ |
13 million times per second | QUANTITY | 0.94+ |
half | QUANTITY | 0.94+ |
prime | COMMERCIAL_ITEM | 0.94+ |
four times | QUANTITY | 0.92+ |
hundreds of thousands of auctions | QUANTITY | 0.92+ |
mid last decade | DATE | 0.89+ |
one thing | QUANTITY | 0.88+ |
One thing | QUANTITY | 0.87+ |
single report | QUANTITY | 0.85+ |
couple reasons | QUANTITY | 0.84+ |
four clusters | QUANTITY | 0.83+ |
first graph | QUANTITY | 0.81+ |
Vertica | TITLE | 0.81+ |
hundreds of thousands of events per second | QUANTITY | 0.8+ |
about 40,000 reports per day | QUANTITY | 0.78+ |
Vertica Big Data conference 2020 | EVENT | 0.77+ |
320 node | QUANTITY | 0.74+ |
a whole week | QUANTITY | 0.72+ |
Vertica virtual Big Data | EVENT | 0.7+ |
Joe Gonzalez, MassMutual | Virtual Vertica BDC 2020
(bright music) >> Announcer: It's theCUBE. Covering the Virtual Vertica Big Data Conference 2020, brought to you by Vertica. Hello everybody, welcome back to theCUBE's coverage of the Vertica Big Data Conference, the Virtual BDC. My name is Dave Volante, and you're watching theCUBE. And we're here with Joe Gonzalez, who is a Vertica DBA, at MassMutual Financial. Joe, thanks so much for coming on theCUBE I'm sorry that we can't be face to face in Boston, but at least we're being responsible. So thank you for coming on. >> (laughs) Thank you for having me. It's nice to be here. >> Yeah, so let's set it up. We'll talk about, you know, a little bit about MassMutual. Everybody knows it's a big financial firm, but what's your role there and kind of your mission? >> So my role is Vertica DBA. I was hired January of last year to come on and manage their Vertica cluster. They've been on Vertica for probably about a year and a half before that started out on on-prem cluster and then move to AWS Enterprise in the cloud, and brought me on just as they were considering transitioning over to Vertica's EON mode. And they didn't really have anybody dedicated to Vertica, nobody who really knew and understood the product. And I've been working with Vertica for about probably six, seven years, at that point. I was looking for something new and landed a really good opportunity here with a great company. >> Yeah, you have a lot of experience in Vertica. You had a role as a market research, so you're a data guy, right? I mean that's really what you've been doing your entire career. >> I am, I've worked with Pitney Bowes, in the postage industry, I worked with healthcare auditing, after seven years in market research. And then I've been with MassMutual for a little over a year now, yeah, quite a lot. >> So tell us a little bit about kind of what your objectives are at MassMutual, what you're kind of doing with the platform, what application just supporting, paint a picture for us if you would. >> Certainly, so my role is, MassMutual just decided to make Vertica its enterprise data warehouse. So they've really bought into Vertica. And we're moving all of our data there probably about to good 80, 90% of MassMutual's data is going to be on the Vertica platform, in EON mode. So, and we have a wide usage of that data across corporation. Right now we're about 50 terabytes and growing quickly. And a wide variety of users. So there's a lot of ETLs coming in overnight, loading a lot of data, transforming a lot of data. And a lot of reporting tools are using it. So currently, Tableau MicroStrategy. We have Alteryx using it, and we also have API's running against it throughout the day, 24/7 with people coming in, especially now these days with the, you know, some financial uncertainty going on. A lot of people coming and checking their 401k's, checking their insurance and status and what not. So we have to handle a lot of concurrent traffic on top of the normal big query. So it's a quite diverse cluster. And I'm glad they're really investing in using Vertica as their overall solution for this. >> Yeah, I mean, these days your 401k like this, right? (laughing) Afraid to look. So I wonder, Joe if you could share with our audience. I mean, for those who might not be as familiar with the history of just Vertica, and specifically, about MPP, you've had historically you have, you know, traditional RDBMS, whether it's Db2 or Oracle, and then you had a spate of companies that came out with this notion of MPP Vertica is the one that, I think it's probably one of the few if only brands that they've survived, but what did that bring to the industry and why is that important for people to understand, just in terms of whatever it is, scale, performance, cost. Can you explain that? >> To me, it actually brought scale at good cost. And that's why I've been a big proponent of Vertica ever since I started using it. There's a number, like you said of different platforms where you can load big data and store and house big data. But the purpose of having that big data is not just for it to sit there, but to be used, and used in a variety of ways. And that's from, you know, something small, like the first installation I was on was about 10 terabytes. And, you know, I work with the data warehouses up to 100 terabytes, and, you know, there's Vertica installations with, you know, hundreds of petabytes on them. You want to be able to use that data, so you need a platform that's going to be able to access that data and get it to the clients, get it to the customers as quickly as possible, and not paying an arm and a leg for the privilege to do so. And Vertica allows companies to do that, not only get their data to clients and you know, in company users quickly, but save money while doing so. >> So, but so, why couldn't I just use a traditional RDBMS? Why not just throw it all into Oracle? >> One, cost, Oracle is very expensive while Vertica's a lot more affordable than that. But the column-score structure of Vertica allows for a lot more optimized queries. Some of the queries that you can run in Vertica in 2, 3, 4 seconds, will take minutes and sometimes hours in an RDBMS, like Oracle, like SQL Server. They have the capability to store that amount of data, no question, but the usability really lacks when you start querying tables that are 180 billion column, 180 billion rows rather of tables in Vertica that are over 1000 columns. Those will take hours to run on a traditional RDBMS and then running them in Vertica, I get my queries back in a sec. >> You know what's interesting to me, Joe and I wonder if you could comment, it seems that Vertica has done a good job of embracing, you know, riding the waves, whether it was HDFS and the big data in our early part of the big data era, the machine learning, machine intelligence. Whether it's, you know, TensorFlow and other data science tools, it seems like Vertica somehow in the cloud is the other one, right? A lot of times cloud is super disruptive, particularly to companies that started on-prem, it seems like Vertica somehow has been able to adopt and embrace some of these trends. Why, from your standpoint, first of all, from your standpoint, as a customer, is that true? And why do you think that is? Is it architectural? Is it true mindset engineering? I wonder if you could comment on that. >> It's absolutely true, I've started out again, on an on-prem Vertica data warehouse, and we kind of, you know, rolled kind of along with them, you know, more and more people have been using data, they want to make it accessible to people on the web now. And you know, having that, the option to provide that data from an on-prem solution, from AWS is key, and now Vertica is offering even a hybrid solution, if you want to keep some of your data behind a firewall, on-prem, and put some in the cloud as well. So data at Vertica has absolutely evolved along with the industry in ways that no other company really has that I've seen. And I think the reason for it and the reason I've stayed with Vertica, and specifically have remained at Vertica DBA for the last seven years, is because of the way Vertica stays in touch with it's persons. I've been working with the same people for the seven, eight years, I've been using Vertica, they're family. I'm part of their family, and you know, I'm good friends with some of these people. And they really are in tune not only with the customer but what they're doing. They really sit down with you and have those conversations about, you know, what are your needs? How can we make Vertica better? And they listen to their clients. You know, just having access to the data engineers who develop Vertica to be arranged on a phone call or whatnot, I've never had that with any other company. Vertica makes that available to their customers when they need it. So the personal touch is a huge for them. >> That's good, it's always good to get the confirmation from the practitioners, just not hear from the vendor. I want to ask you about the EON transition. You mentioned that MassMutual brought you in to help with that. What were some of the challenges that you faced? And how did you get over them? And what did, what is, why EON? You know, what was the goal, the outcome and some of the challenges maybe that you had to overcome? >> Right. So MassMutual had an interesting setup when I first came in. They had three different Vertica clusters to accommodate three different portions of their business. The data scientists who use the data quite extensively in very large queries, very intense queries, their work with their predictive analytics and whatnot. It was a separate one for the API's, which needed, you know, sub-second query response times. And the enterprise solution, they weren't always able to get the performance they needed, because the fast queries were being overrun by the larger queries that needed more resources. And then they had a third for starting to develop this enterprise data platform and started, you know, looking into their future. The first challenge was, first of all, bringing all those three together, and back into a single cluster, and allowing our users to have both of the heavy queries and the API queries running at the same time, on the same platform without having to completely separate them out onto different clusters. EON really helps with that because it allows to store that data in the S3 communal storage, have the main cluster set up to run the heavy queries. And then you can set up sub clusters that still point to that S3 data, but separates out the compute so that the API's really have their own resources to run and not be interfered with by the other process. >> Okay, so that, I'm hearing a couple of things. One is you're sort of busting down data silos. So you're able to have a much more coherent view of your data, which I would imagine is critical, certainly. Companies like MassMutual, have been around for 100 years, and so you've got all kinds of data dispersed. So to the extent that you can break down those silos, that's important, but also being able to I guess have granular increments of compute and storage is what I'm hearing. What does that do for you? It make that more efficient? Well, they are other business benefits? Maybe you could elucidate. >> Well, one cost is again, a huge benefit, the cost of running three different clusters in even AWS, in the enterprise solution was a little costly, you know, you had to have your dedicated servers here and there. So you're paying for like, you know, 12, 15 different servers, for example. Whereas we bring them all back into EON, I can run everything on a six-node production cluster. And you know, when things are busy, I can spin up the three-node top cluster for the API's, only paid for when I need them, and then bring them back into the main cluster when things are slowed down a bit, and they can get that performance that they need. So that saves a ton on resource costs, you know, you're not paying for the storage, you're paying for one S3 bucket, you're only paying for the nodes, these are two instances, that are up and running when you need them., and that is huge. And again, like you said, it gives us the ability to silo our data without having to completely separate our data into different storage areas. Which is a big benefit, it gives us the ability to query everything from one single cluster without having to synchronize it to, you know, three different ones. So this one going to have there's, this one going to have there's, but everyone's still looking at the same data and replicate that in QA and Devs so that people can do it outside of production and do some testing as well. >> So EON, obviously a very important innovation. And of course, Vertica touts the difference between others who separate huge storage, and you know, they're not the only one that does that, but they are really I think the only one that does it for on-prem, and virtually across clouds. So my question is, and I think you're doing a breakout session on the Virtual BDC. We're going to be in Boston, now we're doing it online. If I'm in the audience, I'm imagining I'm a junior DBA at an organization that maybe doesn't have a Joe. I haven't been an expert for seven years. How hard is it for me to get, what do I need to do to get up to speed on EON? It sounds great, I want it. I'm going to save my company money, but I'm nervous 'cause I've only been at Vertica DBA for, you know, a year, and I'm sort of, you know, not as experienced as you. What are the things that I should be thinking about? Do I need to bring in? Do I need to hire somebody? Do I need to bring in a consultant? Can I learn it myself? What would you advise? >> It's definitely easy enough that if you have at least a little bit of work experience, you can learn it yourself, okay? 'Cause the concepts are still there. There's some you know, little bits of nuances where you do need to be aware of certain changes between the Enterprise and EON edition. But I would also say consult with your Vertica Account Manager, consult with your, you know, let them bring in the right people from Vertica to help you get up to speed and if you need to, there are also resources available as far as consultants go, that will help you get up to speed very quickly. And we did work together with Vertica and with one of their partners, Clarity, in helping us to understand EON better, set it up the right way, you know, how do we take our, the number of shards for our data warehouse? You know, they helped us evaluate all that and pick the right number of shards, the right number of nodes to get set up and going. And, you know, helped us figure out the best ways to get our data over from the Enterprise Edition into EON very quickly and very efficient. So different with yourself. >> I wanted to ask you about organizational, you know, issues because, you know, the guys like you practitioners always tell me, "Look, the tech, technology comes and goes, that's kind of the easy part, we're good at that. It's the people it's the processes, the skill sets." What does your, you know, team regime look like? And do you have any sort of ideal team makeup or, you know, ideal advice, is it two piece of teams? Is it what kind of skills? What kind of interaction and communications to senior leadership? I wonder if you could just give us some color on that. >> One of the things that makes me extremely proud to be working for MassMutual right now, is that they do what a lot of companies have not been doing and that is investing in IT. They have put a lot of thought, a lot of money, and a lot of support into setting up their enterprise data platform and putting Vertica at the center. And not only did they put the money into getting the software that they needed, like Vertica, you know, MicroStrategy, and all the other tools that we were using to use that, they put the money in the people. Our managers are extremely supportive of us. We hired about 40 to 45 different people within a four-month time frame, data engineers, data analysts, data modelers, a nice mix of people across who can help shape your data and bring the data in and help the users use the data properly, and allow me as the database administrator to make sure that they're doing what they're doing most efficiently and focus on my job. So you have to have that diversity among the different data skills in order to make your team successful. >> That's awesome. Kind of a side question, and it's really not Vertica's wheelhouse, but I'm curious, you know, in the early days of the big data, you know, movement, a lot of the data scientists would complain, and they still do that, "80% of my time is spent wrangling data." The tools for the data engineer, the data scientists, the database, you know, experts, they're all different. And is that changing? And to what degree is that changing? Kind of what ending are we in and just in terms of a more facile environment for all those roles? >> Again, I think it depends on company to company, you know, what resources they make available to the data scientists. And the data scientists, we have a lot of them at MassMutual. And they're very much into doing a lot of machine learning, model training, predictive analytics. And they are, you know, used to doing it outside of Vertica too, you know, pulling that data out into Python and Scalars Bar, and tools like that. And they're also now just getting into using Vertica's in-database analytics and machine learning, which is a skill that, you know, definitely nobody else out there has. So being able to have one somebody who understands Vertica like myself, and being able to train other people to use Vertica the way that is most efficient for them is key. But also just having people who understand not only the tools that you're using, but how to model data, how to architect your tables, your schemas, the interaction between your tables and schemas and whatnot, you need to have that diversity in order to make this work. And our data scientists have benefited immensely from the struct that MassMutual put in place by our data management delivery team. >> That's great, I think I saw, somewhere in your background, that you've trained about 100 people in Vertica. Did I get that right? >> Yes, I've, since I started here, I've gone to our Boston location, our Springfield location, and our New York City location and trained, probably about this point, about 120, 140 of our Vertica users. And I'm trying to do, you know, a couple of follow-up sessions per year. >> So adoption, obviously, is a big goal of yours. Getting people to adopt the platform, but then more importantly, I guess, deliver business value and outcomes. >> Absolutely. >> Yeah, I wanted to ask you about encryption. You know, in the perfect world, everything would be encrypted, but there are trade offs. Are you using encryption? What are you doing in that regard? >> We are actually just getting into that now due to the New York and the CCPA regulations that are now in place. We do have a lot of Person Identifiable Information in our data store that does require encryption. So we are going through a month's long process that started in December, I think, it's actually a bit earlier than that, to start identifying all the columns, not only in our Vertica database, but in, you know, the other databases that we do use, you know, we have Postgres database, SQL Server, Teradata for the time being, until that moves into Vertica. And identify where that data sits, what downstream applications, pull that data from the data sources and store it locally as well, and starts encrypting that data. And because of the tight relationship between Voltage and Vertica, we settled on Voltages as the major platform to start doing that encryption. So we're going to be implementing that in Vertica probably within the next month or two, and roll it out to all the teams that have data that requires encryption. We're going to start rolling it out to the downstream application owners to make sure that they are encrypting the data as they get it pulled over. And we're also using another product for several other applications that don't mesh well as well with both. >> Voltage being micro, focuses encryption solution, correct? >> Right, yes. >> Yes, of course, like a focus for the audience's is the, it owns Vertica and if Vertica is a separate brand. So I want to ask you kind of close on what success looks like. You've been at this for a number of years, coming into MassMutual which was great to hear. I've had some past experience with MassMutual, it's an awesome company, I've been to the Springfield facility and in Boston as well, and I have great respect for them, and they've really always been a leader. So it's great to hear that they're investing in technology as a differentiator. What does success look like for you? Let's say you're at MassMutual for a few years, you're looking back, what success look like? Go. >> A good question. It's changing every day just, you know, with more and more, you know, applications coming onboard, more and more data being pulled in, more uses being found for the data that we have. I think success for me is making sure that Vertica, first of all, is always up made, is always running at its most optimal to keep our users happy. I think when I started, you know, we had a lot of processes that were running, you know, six, seven hours, some of them were taking, you know, almost a day long, because they were so complicated, we've got those running in under an hour now, some of them running in a matter of minutes. I want to keep that optimization going for all of our processes. Like I said, there's a lot of users using this data. And it's been hard over the first year of me being here to get to all of them. And thankfully, you know, I'm getting a bit of help now, I have a couple of system DBAs, and I'm training up to help out with these optimizations, you know, fixing queries, fixing projections to make sure that queries do run as quickly as possible. So getting that to its optimal stage is one. Two, getting our data encrypted and protected so that even if for whatever reasons, somehow somebody breaks into our data, they're not going to be able to get anything at all, because our data is 100% protected. And I think more companies need to be focusing on that as well. And third, I want to see our data science teams using more and more of Vertica's in-database predictive analytics, in-database machine learning products, and really helping make their jobs more efficient by doing so. >> Joe, you're awesome guest I mean, we always like I said, love having the practitioners on and getting the straight, skinny and pros. You're welcome back anytime, and as I say, I wish we could have met in Boston, maybe next year at the BDC. But it's great to have you online, and thanks for coming on theCUBE. >> And thank you for having me and hopefully we'll meet next year. >> Yeah, I hope so. And thank you everybody for watching that. Remember theCUBE is running concurrent with the Vertica Virtual BDC, it's vertica.com/bdc2020. If you want to check out all the keynotes, and all the breakout sessions, I'm Dave Volante for theCUBE. We'll be going. More interviews, for people right there. Thanks for watching. (bright music)
SUMMARY :
Big Data Conference 2020, brought to you by Vertica. (laughs) Thank you for having me. We'll talk about, you know, cluster and then move to AWS Enterprise in the cloud, Yeah, you have a lot of experience in Vertica. in the postage industry, I worked with healthcare auditing, paint a picture for us if you would. with the, you know, some financial uncertainty going on. and then you had a spate of companies that came out their data to clients and you know, Some of the queries that you can run in Vertica a good job of embracing, you know, riding the waves, And you know, having that, the option to provide and some of the challenges maybe that you had to overcome? It was a separate one for the API's, which needed, you know, So to the extent that you can break down those silos, So that saves a ton on resource costs, you know, and I'm sort of, you know, not as experienced as you. to help you get up to speed and if you need to, because, you know, the guys like you practitioners the database administrator to make sure that they're doing of the big data, you know, movement, Again, I think it depends on company to company, you know, Did I get that right? And I'm trying to do, you know, a couple of follow-up Getting people to adopt the platform, but then more What are you doing in that regard? the other databases that we do use, you know, So I want to ask you kind of close on what success looks like. And thankfully, you know, I'm getting a bit of help now, But it's great to have you online, And thank you for having me And thank you everybody for watching that.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Joe Gonzalez | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Dave Volante | PERSON | 0.99+ |
MassMutual | ORGANIZATION | 0.99+ |
Boston | LOCATION | 0.99+ |
December | DATE | 0.99+ |
100% | QUANTITY | 0.99+ |
Joe | PERSON | 0.99+ |
six | QUANTITY | 0.99+ |
New York City | LOCATION | 0.99+ |
seven years | QUANTITY | 0.99+ |
12 | QUANTITY | 0.99+ |
80% | QUANTITY | 0.99+ |
seven | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
four-month | QUANTITY | 0.99+ |
vertica.com/bdc2020 | OTHER | 0.99+ |
Springfield | LOCATION | 0.99+ |
2 | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
two instances | QUANTITY | 0.99+ |
seven hours | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Scalars Bar | TITLE | 0.99+ |
Python | TITLE | 0.99+ |
180 billion rows | QUANTITY | 0.99+ |
Two | QUANTITY | 0.99+ |
third | QUANTITY | 0.99+ |
15 different servers | QUANTITY | 0.99+ |
two piece | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
180 billion column | QUANTITY | 0.98+ |
over 1000 columns | QUANTITY | 0.98+ |
eight years | QUANTITY | 0.98+ |
Voltage | ORGANIZATION | 0.98+ |
three | QUANTITY | 0.98+ |
hundreds of petabytes | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
six-node | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
one single cluster | QUANTITY | 0.98+ |
Vertica Big Data Conference | EVENT | 0.98+ |
MassMutual Financial | ORGANIZATION | 0.98+ |
4 seconds | QUANTITY | 0.98+ |
EON | ORGANIZATION | 0.98+ |
New York | LOCATION | 0.97+ |
about 10 terabytes | QUANTITY | 0.97+ |
first challenge | QUANTITY | 0.97+ |
next month | DATE | 0.97+ |
Keynote Analysis | Virtual Vertica BDC 2020
(upbeat music) >> Narrator: It's theCUBE, covering the Virtual Vertica Big Data Conference 2020. Brought to you by Vertica. >> Dave Vellante: Hello everyone, and welcome to theCUBE's exclusive coverage of the Vertica Virtual Big Data Conference. You're watching theCUBE, the leader in digital event tech coverage. And we're broadcasting remotely from our studios in Palo Alto and Boston. And, we're pleased to be covering wall-to-wall this digital event. Now, as you know, originally BDC was scheduled this week at the new Encore Hotel and Casino in Boston. Their theme was "Win big with big data". Oh sorry, "Win big with data". That's right, got it. And, I know the community was really looking forward to that, you know, meet up. But look, we're making the best of it, given these uncertain times. We wish you and your families good health and safety. And this is the way that we're going to broadcast for the next several months. Now, we want to unpack Colin Mahony's keynote, but, before we do that, I want to give a little context on the market. First, theCUBE has covered every BDC since its inception, since the BDC's inception that is. It's a very intimate event, with a heavy emphasis on user content. Now, historically, the data engineers and DBAs in the Vertica community, they comprised the majority of the content at this event. And, that's going to be the same for this virtual, or digital, production. Now, theCUBE is going to be broadcasting for two days. What we're doing, is we're going to be concurrent with the Virtual BDC. We got practitioners that are coming on the show, DBAs, data engineers, database gurus, we got a security experts coming on, and really a great line up. And, of course, we'll also be hearing from Vertica Execs, Colin Mahony himself right of the keynote, folks from product marketing, partners, and a number of experts, including some from Micro Focus, which is the, of course, owner of Vertica. But I want to take a moment to share a little bit about the history of Vertica. The company, as you know, was founded by Michael Stonebraker. And, Verica started, really they started out as a SQL platform for analytics. It was the first, or at least one of the first, to really nail the MPP column store trend. Not only did Vertica have an early mover advantage in MPP, but the efficiency and scale of its software, relative to traditional DBMS, and also other MPP players, is underscored by the fact that Vertica, and the Vertica brand, really thrives to this day. But, I have to tell you, it wasn't without some pain. And, I'll talk a little bit about that, and really talk about how we got here today. So first, you know, you think about traditional transaction databases, like Oracle or IMBDB tour, or even enterprise data warehouse platforms like Teradata. They were simply not purpose-built for big data. Vertica was. Along with a whole bunch of other players, like Netezza, which was bought by IBM, Aster Data, which is now Teradata, Actian, ParAccel, which was the basis for Redshift, Amazon's Redshift, Greenplum was bought, in the early days, by EMC. And, these companies were really designed to run as massively parallel systems that smoked traditional RDBMS and EDW for particular analytic applications. You know, back in the big data days, I often joked that, like an NFL draft, there was run on MPP players, like when you see a run on polling guards. You know, once one goes, they all start to fall. And that's what you saw with the MPP columnar stores, IBM, EMC, and then HP getting into the game. So, it was like 2011, and Leo Apotheker, he was the new CEO of HP. Frankly, he has no clue, in my opinion, with what to do with Vertica, and totally missed one the biggest trends of the last decade, the data trend, the big data trend. HP picked up Vertica for a song, it wasn't disclosed, but my guess is that it was around 200 million. So, rather than build a bunch of smart tokens around Vertica, which I always call the diamond in the rough, Apotheker basically permanently altered HP for years. He kind of ruined HP, in my view, with a 12 billion dollar purchase of Autonomy, which turned out to be one of the biggest disasters in recent M&A history. HP was forced to spin merge, and ended up selling most of its software to Microsoft, Micro Focus. (laughs) Luckily, during its time at HP, CEO Meg Whitman, largely was distracted with what to do with the mess that she inherited form Apotheker. So, Vertica was left alone. Now, the upshot is Colin Mahony, who was then the GM of Vertica, and still is. By the way, he's really the CEO, and he just doesn't have the title, I actually think they should give that to him. But anyway, he's been at the helm the whole time. And Colin, as you'll see in our interview, is a rockstar, he's got technical and business jobs, people love him in the community. Vertica's culture is really engineering driven and they're all about data. Despite the fact that Vertica is a 15-year-old company, they've really kept pace, and not been polluted by legacy baggage. Vertica, early on, embraced Hadoop and the whole open-source movement. And that helped give it tailwinds. It leaned heavily into cloud, as we're going to talk about further this week. And they got a good story around machine intelligence and AI. So, whereas many traditional database players are really getting hurt, and some are getting killed, by cloud database providers, Vertica's actually doing a pretty good job of servicing its install base, and is in a reasonable position to compete for new workloads. On its last earnings call, the Micro Focus CFO, Stephen Murdoch, he said they're investing 70 to 80 million dollars in two key growth areas, security and Vertica. Now, Micro Focus is running its Suse play on these two parts of its business. What I mean by that, is they're investing and allowing them to be semi-autonomous, spending on R&D and go to market. And, they have no hardware agenda, unlike when Vertica was part of HP, or HPE, I guess HP, before the spin out. Now, let me come back to the big trend in the market today. And there's something going on around analytic databases in the cloud. You've got companies like Snowflake and AWS with Redshift, as we've reported numerous times, and they're doing quite well, they're gaining share, especially of new workloads that are merging, particularly in the cloud native space. They combine scalable compute, storage, and machine learning, and, importantly, they're allowing customers to scale, compute, and storage independent of each other. Why is that important? Because you don't have to buy storage every time you buy compute, or vice versa, in chunks. So, if you can scale them independently, you've got granularity. Vertica is keeping pace. In talking to customers, Vertica is leaning heavily into the cloud, supporting all the major cloud platforms, as we heard from Colin earlier today, adding Google. And, why my research shows that Vertica has some work to do in cloud and cloud native, to simplify the experience, it's more robust in motor stack, which supports many different environments, you know deep SQL, acid properties, and DNA that allows Vertica to compete with these cloud-native database suppliers. Now, Vertica might lose out in some of those native workloads. But, I have to say, my experience in talking with customers, if you're looking for a great MMP column store that scales and runs in the cloud, or on-prem, Vertica is in a very strong position. Vertica claims to be the only MPP columnar store to allow customers to scale, compute, and storage independently, both in the cloud and in hybrid environments on-prem, et cetera, cross clouds, as well. So, while Vertica may be at a disadvantage in a pure cloud native bake-off, it's more robust in motor stack, combined with its multi-cloud strategy, gives Vertica a compelling set of advantages. So, we heard a lot of this from Colin Mahony, who announced Vertica 10.0 in his keynote. He really emphasized Vertica's multi-cloud affinity, it's Eon Mode, which really allows that separation, or scaling of compute, independent of storage, both in the cloud and on-prem. Vertica 10, according to Mahony, is making big bets on in-database machine learning, he talked about that, AI, and along with some advanced regression techniques. He talked about PMML models, Python integration, which was actually something that they talked about doing with Uber and some other customers. Now, Mahony also stressed the trend toward object stores. And, Vertica now supports, let's see S3, with Eon, S3 Eon in Google Cloud, in addition to AWS, and then Pure and HDFS, as well, they all support Eon Mode. Mahony also stressed, as I mentioned earlier, a big commitment to on-prem and the whole cloud optionality thing. So 10.0, according to Colin Mahony, is all about really doubling down on these industry waves. As they say, enabling native PMML models, running them in Vertica, and really doing all the work that's required around ML and AI, they also announced support for TensorFlow. So, object store optionality is important, is what he talked about in Eon Mode, with the news of support for Google Cloud and, as well as HTFS. And finally, a big focus on deployment flexibility. Migration tools, which are a critical focus really on improving ease of use, and you hear this from a lot of customers. So, these are the critical aspects of Vertica 10.0, and an announcement that we're going to be unpacking all week, with some of the experts that I talked about. So, I'm going to close with this. My long-time co-host, John Furrier, and I have talked some time about this new cocktail of innovation. No longer is Moore's law the, really, mainspring of innovation. It's now about taking all these data troves, bringing machine learning and AI into that data to extract insights, and then operationalizing those insights at scale, leveraging cloud. And, one of the things I always look for from cloud is, if you've got a cloud play, you can attract innovation in the form of startups. It's part of the success equation, certainly for AWS, and I think it's one of the challenges for a lot of the legacy on-prem players. Vertica, I think, has done a pretty good job in this regard. And, you know, we're going to look this week for evidence of that innovation. One of the interviews that I'm personally excited about this week, is a new-ish company, I would consider them a startup, called Zebrium. What they're doing, is they're applying AI to do autonomous log monitoring for IT ops. And, I'm interviewing Larry Lancaster, who's their CEO, this week, and I'm going to press him on why he chose to run on Vertica and not a cloud database. This guy is a hardcore tech guru and I want to hear his opinion. Okay, so keep it right there, stay with us. We're all over the Vertica Virtual Big Data Conference, covering in-depth interviews and following all the news. So, theCUBE is going to be interviewing these folks, two days, wall-to-wall coverage, so keep it right there. We're going to be right back with our next guest, right after this short break. This is Dave Vellante and you're watching theCUBE. (upbeat music)
SUMMARY :
Brought to you by Vertica. and the Vertica brand, really thrives to this day.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Larry Lancaster | PERSON | 0.99+ |
Colin | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
70 | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Michael Stonebraker | PERSON | 0.99+ |
Colin Mahony | PERSON | 0.99+ |
Stephen Murdoch | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
Zebrium | ORGANIZATION | 0.99+ |
two days | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Boston | LOCATION | 0.99+ |
Verica | ORGANIZATION | 0.99+ |
Micro Focus | ORGANIZATION | 0.99+ |
2011 | DATE | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
Mahony | PERSON | 0.99+ |
Meg Whitman | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Aster Data | ORGANIZATION | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
First | QUANTITY | 0.99+ |
12 billion dollar | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
this week | DATE | 0.99+ |
John Furrier | PERSON | 0.99+ |
15-year-old | QUANTITY | 0.98+ |
Python | TITLE | 0.98+ |
Oracle | ORGANIZATION | 0.98+ |
olin Mahony | PERSON | 0.98+ |
around 200 million | QUANTITY | 0.98+ |
Virtual Vertica Big Data Conference 2020 | EVENT | 0.98+ |
theCUBE | ORGANIZATION | 0.98+ |
80 million dollars | QUANTITY | 0.97+ |
today | DATE | 0.97+ |
two parts | QUANTITY | 0.97+ |
Vertica Virtual Big Data Conference | EVENT | 0.97+ |
Teradata | ORGANIZATION | 0.97+ |
one | QUANTITY | 0.97+ |
Actian | ORGANIZATION | 0.97+ |
Dan Woicke, Cerner Corporation | Virtual Vertica BDC 2020
(gentle electronic music) >> Hello, everybody, welcome back to the Virtual Vertica Big Data Conference. My name is Dave Vellante and you're watching theCUBE, the leader in digital coverage. This is the Virtual BDC, as I said, theCUBE has covered every Big Data Conference from the inception, and we're pleased to be a part of this, even though it's challenging times. I'm here with Dan Woicke, the senior director of CernerWorks Engineering. Dan, good to see ya, how are things where you are in the middle of the country? >> Good morning, challenging times, as usual. We're trying to adapt to having the kids at home, out of school, trying to figure out how they're supposed to get on their laptop and do virtual learning. We all have to adapt to it and figure out how to get by. >> Well, it sure would've been my pleasure to meet you face to face in Boston at the Encore Casino, hopefully next year we'll be able to make that happen. But let's talk about Cerner and CernerWorks Engineering, what is that all about? >> So, CernerWorks Engineering, we used to be part of what's called IP, or Intellectual Property, which is basically the organization at Cerner that does all of our software development. But what we did was we made a decision about five years ago to organize my team with CernerWorks which is the hosting side of Cerner. So, about 80% of our clients choose to have their domains hosted within one of the two Kansas City data centers. We have one in Lee's Summit, in south Kansas City, and then we have one on our main campus that's a brand new one in downtown, north Kansas City. About 80, so we have about 27,000 environments that we manage in the Kansas City data centers. So, what my team does is we develop software in order to make it easier for us to monitor, manage, and keep those clients healthy within our data centers. >> Got it. I mean, I think of Cerner as a real advanced health tech company. It's the combination of healthcare and technology, the collision of those two. But maybe describe a little bit more about Cerner's business. >> So we have, like I said, 27,000 facilities across the world. Growing each day, thank goodness. And, our goal is to ensure that we reduce errors and we digitize the entire medical records for all of our clients. And we do that by having a consulting practice, we do that by having engineering, and then we do that with my team, which manages those particular clients. And that's how we got introduced to the Vertica side as well, when we introduced them about seven years ago. We were actually able to take a tremendous leap forward in how we manage our clients. And I'd be more than happy to talk deeper about how we do that. >> Yeah, and as we get into it, I want to understand, healthcare is all about outcomes, about patient outcomes and you work back from there. IT, for years, has obviously been a contributor but removed, and somewhat indirect from those outcomes. But, in this day and age, especially in an organization like yours, it really starts with the outcomes. I wonder if you could ratify that and talk about what that means for Cerner. >> Sorry, are you talking about medical outcomes? >> Yeah, outcomes of your business. >> So, there's two different sides to Cerner, right? There's the medical side, the clinical side, which is obviously our main practice, and then there's the side that I manage, which is more of the operational side. Both are very important, but they go hand in hand together. On the operational side, the goal is to ensure that our clinicians are on the system, and they don't know they're on the system, right? Things are progressing, doctors don't want to be on the system, trust me. My job is to ensure they're having the most seamless experience possible while they're on the EMR and have it just be one of their side jobs as opposed to taking their attention away from the patients. That make sense? >> Yeah it does, I mean, EMR and meaningful use, around the Affordable Care Act, really dramatically changed the unit. I mean, people had to demonstrate in order to get paid, and so that became sort of an unfunded mandate for folks and you really had to respond to that, didn't you? >> We did, we did that about three to four years ago. And we had to help our clients get through what's called meaningful use, there was different stages of meaningful use. And what we did, is we have the website called the Lights On Network which is free to all of our clients. Once you get onto the website the Lights On Network, you can actually show how you're measured and whether or not you're actually completing the different necessary tasks in order to get those payments for meaningful use. And it also allows you to see what your performance is on your domain, how the clinicians are doing on the system, how many hours they're spending on the system, how many orders they're executing. All of that is completely free and visible to our clients on the Lights On Network. And that's actually backed by some of the Vertica software that we've invested in. >> Yeah, so before we get into that, it sounds like your mission, really, is just great user experiences for the people that are on the network. Full stop. >> We do. So, one of the things that we invented about 10 years ago is called RTMS Timers. They're called Response Time Measurement System. And it started off as a way of us proving that clients are actually using the system, and now it's turned into more of a user outcomes. What we do is we collect 2.5 billion timers per day across all of our clients across the world. And every single one of those records goes to the Vertica platform. And then we've also developed a system on that which allows us in real time to go and see whether or not they're deviating from their normal. So we do baselines every hour of the week and then if they're deviating from those baselines, we can immediately call a service center and have them engage the client before they call in. >> So, Dan, I wonder if you could paint a picture. By the way, that's awesome. I wonder if you could paint a picture of your analytics environment. What does it look like? Maybe give us a sense of the scale. >> Okay. So, I've been describing how we operate, our remote hosted clients in the two Kansas City data centers, but all the software that we write, we also help our client hosted agents as well. Not only do we take care of what's going on at the Kansas City data center, but we do write software to ensure that all of clients are treated the same and we provide the same level of care and performance management across all those clients. So what we do is we have 90,000 agents that we have split across all these clients across the world. And every single hour, we're committing a billion rows to Vertica of operational data. So I talked a little bit about the RTMS timers, but we do things just like everyone else does for CPU, memory, Java Heap Stack. We can tell you how many concurrent users are on the system, I can tell you if there's an application that goes down unexpected, like a crash. I can tell you the response time from the network as most of us use Citrix at Cerner. And so what we do is we measure the amount of time it takes from the client side to PCs, it's sitting in the virtual data centers, sorry, in the hospitals, and then round trip to the Citrix servers that are sitting in the Kansas City data center. That's called the RTT, our round trip transactions. And what we've done is, over the last couple of years, what we've done is we've switched from just summarizing CPU and memory and all that high-level stuff, in order to go down to a user level. So, what are you doing, Dr. Smith, today? How many hours are you using the EMR? Have you experienced any slowness? Have you experienced any hourglass holding within your application? Have you experienced, unfortunately, maybe a crash? Have you experienced any slowness compared to your normal use case? And that's the step we've taken over the last few years, to go from summarization of high-level CPU memory, over to outcome metrics, which are what is really happening with a particular user. >> So, really granular views of how the system is being used and deep analytics on that. I wonder, go ahead, please. >> And, we weren't able to do that by summarizing things in traditional databases. You have to actually have the individual rows and you can't summarize information, you have to have individual metrics that point to exactly what's going on with a particular clinician. >> So, okay, the MPP architecture, the columnar store, the scalability of Vertica, that's what's key. That was my next question, let me take us back to the days of traditional RDBMS and then you brought in Vertica. Maybe you could give us a sense as to why, what that did for you, the before and after. >> Right. So, I'd been painting a picture going forward here about how traditionally, eight years ago, all we could do was summarize information. If CPU was going to go and jump up 8%, I could alarm the data center and say, hey, listen, CPU looks like it's higher, maybe an application's hanging more than it has been in the past. Things are a little slower, but I wouldn't be able to tell you who's affected. And that's where the whole thing has changed, when we brought Vertica in six years ago is that, we're able to take those 90,000 agents and commit a billion rows per hour operational data, and I can tell you exactly what's going on with each of our clinicians. Because you know, it's important for an entire domain to be healthy. But what about the 10 doctors that are experiencing frustration right now? If you're going to summarize that information and roll it up, you'll never know what those 10 doctors are experiencing and then guess what happens? They call the data center and complain, right? The squeaky wheels? We don't want that, we want to be able to show exactly who's experiencing a bad performance right now and be able to reach out to them before they call the help desk. >> So you're able to be proactive there, so you've gone from, Houston, we have a problem, we really can't tell you what it is, go figure it out, to, we see that there's an issue with these docs, or these users, and go figure that out and focus narrowly on where the problem is as opposed to trying to whack-a-mole. >> Exactly. And the other big thing that we've been able to do is corelation. So, we operate two gigantic data centers. And there's things that are shared, switches, network, shared storage, those things are shared. So if there is an issue that goes on with one of those pieces of equipment, it could affect multiple clients. Now that we have every row in Vertica, we have a new program in place called performance abnormality flags. And what we're able to do is provide a website in real time that goes through the entire stack from Citrix to network to database to back-end tier, all the way to the end-user desktop. And so if something was going to be related because we have a network switch going out of the data center or something's backing up slow, you can actually see which clients are on that switch, and, what we did five years ago before this, is we would deploy out five different teams to troubleshoot, right? Because five clients would call in, and they would all have the same problem. So, here you are having to spare teams trying to investigate why the same problem is happening. And now that we have all of the data within Vertica, we're able to show that in a real time fashion, through a very transparent dashboard. >> And so operational metrics throughout the stack, right? A game changer. >> It's very compact, right? I just label five different things, the stack from your end-user device all the way through the back-end to your database and all the way back. All that has to work properly, right? Including the network. >> How big is this, what are we talking about? However you measure it, terabytes, clusters. What can you share there? >> Sorry, you mean, the amount of data that we process within our data centers? >> Give us a fun fact. >> Absolute petabytes, yeah, for sure. And in Vertica right now we have two petabytes of data, and I purge it out every year, one year's worth of data within two different clusters. So we have to two different data centers I've been describing, what we've done is we've set Vertica up to be in both data centers, to be highly redundant, and then one of those is configured to do real-time analysis and corelation research, and then the other one is to provide service towards what I described earlier as our Lights On Network, so it's a very dedicated hardened cluster in one of our data centers to allow the Lights On Network to provide the transparency directly to our clients. So we want that one to be pristine, fast, and nobody touch it. As opposed to the other one, where, people are doing real-time, ad hoc queries, which sometimes aren't the best thing in the world. No matter what kind of database or how fast it is, people do bad things in databases and we just don't want that to affect what we show our clients in a transparent fashion. >> Yeah, I mean, for our audience, Vertica has always been aimed at these big, hairy, analytic problems, it's not for a tiny little data mart in a department, it's really the big scale problems. I wonder if I could ask you, so you guys, obviously, healthcare, with HIPAA and privacy, are you doing anything in the cloud, or is it all on-prem today? >> So, in the operational space that I manage, it's all on-premises, and that is changing. As I was describing earlier, we have an initiative to go to AWS and provide levels of service to countries like Sweden which does not want any operational data to leave that country's walls, whether it be operational data or whether it be PHI. And so, we have to be able to adapt into Vertia Eon Mode in order to provide the same services within Sweden. So obviously, Cerner's not going to go up and build a data center in every single country that requires us, so we're going to leverage our partnership with AWS to make this happen. >> Okay, so, I was going to ask you, so you're not running Eon Mode today, it's something that you're obviously interested in. AWS will allow you to keep the data locally in that region. In talking to a lot of practitioners, they're intrigued by this notion of being able to scale independently, storage from compute. They've said they wished that's a much more efficient way, I don't have to buy in chunks, if I'm out of storage, I don't have to buy compute, and vice-versa. So, maybe you could share with us what you're thinking, I know it's early days, but what's the logic behind the business case there? >> I think you're 100% correct in your assessment of taking compute away from storage. And, we do exactly what you say, we buy a server. And it has so much compute on it, and so much storage. And obviously, it's not scaled properly, right? Either storage runs out first or compute runs out first, but you're still paying big bucks for the entire server itself. So that's exactly why we're doing the POC right now for Eon Mode. And I sit on Vertica's TAB, the advisory board, and they've been doing a really good job of taking our requirements and listening to us, as to what we need. And that was probably number one or two on everybody's lists, was to separate storage from compute. And that's exactly what we're trying to do right now. >> Yeah, it's interesting, I've talked to some other customers that are on the customer advisory board. And Vertica is one of these companies that're pretty transparent about what goes on there. And I think that for the early adopters of Eon Mode there were some challenges with getting data into the new system, I know Vertica has been working on that very hard but you guys push Vertica pretty hard and from what I can tell, they listen. Your thoughts. >> They do listen, they do a great job. And even though the Big Data Conference is canceled, they're committed to having us go virtually to the CAD meeting on Monday, so I'm looking forward to that. They do listen to our requirements and they've been very very responsive. >> Nice. So, I wonder if you could give us some final thoughts as to where you want to take this thing. If you look down the road a year or two, what does success look like, Dan? >> That's a good question. Success means that we're a little bit more nimble as far as the different regions across the world that we can provide our services to. I want to do more corelation. I want to gather more information about what users are actually experiencing. I want to be able to have our phone never ring in our data center, I know that's a grand thought there. But I want to be able to look forward to measuring the data internally and reaching out to our clients when they have issues and then doing the proper corelation so that I can understand how things are intertwining if multiple clients are having an issue. That's the goal going forward. >> Well, in these trying times, during this crisis, it's critical that your operations are running smoothly. The last thing that organizations need right now, especially in healthcare, is disruption. So thank you for all the hard work that you and your teams are doing. I wish you and your family all the best. Stay safe, stay healthy, and thanks so much for coming on theCUBE. >> I really appreciate it, thanks for the opportunity. >> You're very welcome, and thank you, everybody, for watching, keep it right there, we'll be back with our next guest. This is Dave Vellante for theCUBE. Covering Virtual Vertica Big Data Conference. We'll be right back. (upbeat electronic music)
SUMMARY :
in the middle of the country? and figure out how to get by. been my pleasure to meet you and then we have one on our main campus and technology, the and then we do that with my team, Yeah, and as we get into it, the goal is to ensure that our clinicians in order to get paid, and so that became in order to get those for the people that are on the network. So, one of the things that we invented I wonder if you could paint a picture from the client side to PCs, of how the system is being used that point to exactly what's going on and then you brought in Vertica. and be able to reach out to them we really can't tell you what it is, And now that we have all And so operational metrics and all the way back. are we talking about? And in Vertica right now we in the cloud, or is it all on-prem today? So, in the operational I don't have to buy in chunks, and listening to us, as to what we need. that are on the customer advisory board. so I'm looking forward to that. as to where you want to take this thing. and reaching out to our that you and your teams are doing. thanks for the opportunity. and thank you, everybody,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dan Woicke | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Cerner | ORGANIZATION | 0.99+ |
Affordable Care Act | TITLE | 0.99+ |
Boston | LOCATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Dan | PERSON | 0.99+ |
10 doctors | QUANTITY | 0.99+ |
Sweden | LOCATION | 0.99+ |
90,000 agents | QUANTITY | 0.99+ |
five clients | QUANTITY | 0.99+ |
CernerWorks | ORGANIZATION | 0.99+ |
8% | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Kansas City | LOCATION | 0.99+ |
Smith | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Cerner Corporation | ORGANIZATION | 0.99+ |
next year | DATE | 0.99+ |
Monday | DATE | 0.99+ |
Both | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
one year | QUANTITY | 0.99+ |
a year | QUANTITY | 0.99+ |
27,000 facilities | QUANTITY | 0.99+ |
Houston | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
two petabytes | QUANTITY | 0.99+ |
five years ago | DATE | 0.99+ |
CernerWorks Engineering | ORGANIZATION | 0.98+ |
south Kansas City | LOCATION | 0.98+ |
eight years ago | DATE | 0.98+ |
about 80% | QUANTITY | 0.98+ |
Virtual Vertica Big Data Conference | EVENT | 0.98+ |
Citrix | ORGANIZATION | 0.98+ |
two different data centers | QUANTITY | 0.97+ |
each day | QUANTITY | 0.97+ |
four years ago | DATE | 0.97+ |
two different clusters | QUANTITY | 0.97+ |
six years ago | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
north Kansas City | LOCATION | 0.97+ |
HIPAA | TITLE | 0.97+ |
five different teams | QUANTITY | 0.97+ |
first | QUANTITY | 0.96+ |
five different things | QUANTITY | 0.95+ |
two different sides | QUANTITY | 0.95+ |
about 27,000 environments | QUANTITY | 0.95+ |
both data centers | QUANTITY | 0.95+ |
About 80 | QUANTITY | 0.95+ |
Response Time Measurement System | OTHER | 0.95+ |
two gigantic data centers | QUANTITY | 0.93+ |
Java Heap | TITLE | 0.92+ |
Gabriel Chapman, Pure Storage | Virtual Vertica BDC 2020
>>Yeah, it's the queue covering the virtual vertical Big Data Conference 2020. Brought to you by vertical. >>Hi, everybody. And welcome to this cube special presentation of the vertical virtual Big Data conference. The Cube is running in parallel with Day One and day two of the vertical of Big Data event. By the way, the Cube has been every single big data event in It's our pleasure to be here in the virtual slash digital event as well. Gabriel Chapman is here. He's the director of Flash Blade Products Solutions Marketing at Pure Storage. Great to see you. Thanks for coming on. >>Great to see you too. How's it going? >>It's going very well. I mean, I wish we were meeting in Boston at the Encore Hotel, but, uh, you know, and hopefully we'll be able to meet it, accelerate at some point, future or one of the sub shows that you guys are doing the regional shows, but because we've been covering that show as well. But I really want to get into it. And the last accelerate September 2019 pure and vertical announced. Ah, partnership. I remember a joint being ran up to me and said, Hey, you got to check this out. The separation of compute and storage by EON mode now available on Flash Blade. So, uh and and I believe still the only company that can support that separation and independent scaling both on Prem and in the cloud. So I want to ask, what were the trends and analytical database and cloud led to this partnership? You know, >>realistically, I think what we're seeing is that there's been a kind of a larger shift when it comes to modern analytics platforms towards moving away from the traditional, you know, Hadoop type architecture where we were doing on and leveraging a lot of directors that storage primarily because of the limitations of how that solution was architected. When we start to look at the larger trends towards you know how organizations want to do this type of work on premises, they're looking at solutions that allow them to scale the compute storage pieces independently and therefore, you know, the flash blade platform ended up being a great solution to support America in their transition Tian mode. Leveraging essentially is an S three object store. >>Okay, so let's let's circle back on that you guys in your in your announcement of the flash blade, you make the claim that Flash Blade is the industry's most advanced file and object storage platform ever. That's a bold statement. So defend that What? >>I would like to go beyond that and just say, you know, So we've really kind of looked at this from a standpoint of, you know, as as we've developed Flash Blade as a platform and keep in mind, it's been a product that's been around for over three years now and has been very successful for pure storage. The reality is, is that fast file and fast object as a combined storage platform is a direction that many organizations are looking to go, and we believe that we're a leader in that fast object best file storage place in realistically, which we start to see more organizations start to look at building solutions that leverage cloud storage characteristics. But doing so on Prem for a multitude of different reasons. We've built a platform that really addresses a lot of those needs around simplicity around, you know, making things this year that you know, fast matters for us. Ah, simple is smart. Um we can provide, you know, cloud integrations across the spectrum. And, you know, there's a subscription model that fits into that as well. We fall that that falls into our umbrella of what we consider the modern day takes variance. And it's something that we've built into the entire pure portfolio. >>Okay, so I want to get into the architecture a little bit of flash blade and then understand the fit for, uh, analytic databases generally, but specifically for vertical. So it is a blade, so you got compute and network included. It's a key value store based system. So you're talking about scale out. Unlike, unlike, uh, pure is sort of, you know, initial products which were scale up, Um, and so I want on It is a fabric based system. I want to understand what that all means to take us through the architecture. You know, some of the quote unquote firsts that you guys talk about. So let's start with sort of the blade >>aspect. Yeah, the blade aspect of what we call the flash blade. Because if you look at the actual platform, you have, ah, primarily a chassis with built in networking components, right? So there's ah, fabric interconnect with inside the platform that connects to each one of the individual blades. Individual blades have their own compute that drives basically a pure storage flash components inside. It's not like we're just taking SSD is and plugging them into a system and like you would with the traditional commodity off the shelf hardware design. This is very much an engineered solution that is built towards the characteristics that we believe were important with fast filing past object scalability, massive parallel ization. When it comes to performance and the ability to really kind of grow and scale from essentially seven blades right now to 150 that's that's the kind of scale that customers are looking for, especially as we start to address these larger analytics pools. They are multi petabytes data sets, you know that single addressable object space and, you know, file performance that is beyond what most of your traditional scale up storage platforms are able to deliver. >>Yes, I interviewed cause last September and accelerate, and Christie Pure has been attacked by some of the competitors. There's not having scale out. I asked him his thoughts on that, he said Well, first of all, our flash blade is scale out. He said, Look, anything that adds complexity, you know we avoid. But for the workloads that are associated with flash blade scale out is the right sort of approach. Maybe you could talk about why that is. Well, >>realistically, I think you know that that approach is better when we're starting to work with large, unstructured data sets. I mean, flash blade is unique. The architected to allow customers to achieve superior resource utilization for compute and storage, while at the same time, you know, reducing significantly the complexity that has arisen around this kind of bespoke or siloed nature of big data and analytics solutions. I mean, we're really kind of look at this from a standpoint of you have built and delivered are created applications in the public cloud space of dress, you know, object storage and an unstructured data. And for some organizations, the importance is bringing that on Prem. I mean, we do see about repatriation coming on a lot of organizations as these data egress, charges continue to expand and grow, um, and then organizations that want even higher performance and what we're able to get into the public cloud space. They are bringing that data back on Prem They are looking at from a stamp. We still want to be able to scale the way we scale in the cloud. We still want to operate the same way we operate in the cloud, but we want to do it within control of our own, our own borders. And so that's, you know, that's one of the bigger pieces to that. And we start to look at how do we address cloud characteristics and dynamics and consumption metrics or models? A zealous the benefits and efficiencies of scale that they're able to afford but allowing customers to do that with inside their own data center. >>So you're talking about the trends earlier. You have these cloud native databases that allowed of the scaling of compute and storage independently. Vertical comes in with eon of a lot of times we talk about these these partnerships as Barney deals of you know I love you, You love me. Here's a press release and then we go on or they're just straight, you know, go to market. Are there other aspects of this partnership that they're non Barney deal like, in other words, any specific engineering. Um, you know other go to market programs? Could you talk about that a little bit? Yeah, >>it's it's It's more than just that what we consider a channel meet in the middle or, you know, that Barney type of deal. It's realistically, you know, we've done some first with Veronica that I think, really Courtney, if they think you look at the architecture and how we did, we've brought to market together. Ah, we have solutions. Teams in the back end who are, you know, subject matter experts. In this space, if you talk to joy and the people from vertical, they're very high on our very excited about the partnership because it often it opens up a new set of opportunities for their customers to leverage on mode and get into some of the the nuance task specs of how they leverage the depot depot with inside each individual. Compute node in adjustments with inside their reach. Additional performance gains for customers on Prem and at the same time, for them, that's still tough. The ability to go into that cloud model if they wish to. And so I think a lot of it is around. How do we partner is to companies? How do we do a joint selling motions? How do we show up in and do white papers and all of the traditional marketing aspects that we bring to the market? And then, you know, joint selling opportunities exist where they are, and so that's realistically. I think, like any other organization that's going to market with a partner on MSP that they have, ah, strong partnership with. You'll continue to see us, you know, talking about are those mutually beneficial relationships and the solutions that we're bringing to the market. >>Okay, you know, of course, he used to be a Gartner analyst, and you go to the vendor side now, but it's but it's, but it's a Gartner analyst. You're obviously objective. You see it on, you know well, there's a lot of ways to skin the cat There, there their strengths, weaknesses, opportunities, threats, etcetera for every vendor. So you have you have vertical who's got a very mature stack and talking to a number of the customers out there who are using EON mode. You know there's certain workloads where these cloud native databases makes sense. It's not just the economics of scaling and storage independently. I want to talk more about that. There's flexibility aspect as well. But Vertical really has to play its its trump card, which is Look, we've got a big on premise state, and we're gonna bring that eon capability both on Prem and we're embracing the cloud now. There obviously have been there to play catch up in the cloud, but at the same time, they've got a much more mature stack than a lot of these other cloud native databases that might have just started a couple of years ago. So you know, so there's trade offs that customers have to make. How do you sort through that? Where do you see the interest in this? And and what's the sweet spot for this partnership? You know, we've >>been really excited to build the partnership with vertical A and provide, you know, we're really proud to provide pretty much the only on Prem storage platform that's validated with the yang mode to deliver a modern data experience for our customers together. You know, it's ah, it's that partnership that allows us to go into customers that on Prem space, where I think that there's still not to say that not everybody wants to go there, but I think there's aspects and solutions that worked very well there. But for the vast majority, I still think that there's, you know, the your data center is not going away. And you do want to have control over some of the many of the assets with inside of the operational confines. So therefore, we start to look at how do we can do the best of what cloud offers but on prim. And that's realistically, where we start to see the stronger push for those customers. You still want to manage their data locally. A swell as maybe even worked around some of the restrictions that they might have around cost and complexity hiring. You know, the different types of skills skill sets that are required to bring applications purely cloud native. It's still that larger part of that digital transformation that many organizations are going for going forward with. And realistically, I think they're taking a look at the pros and cons, and we've been doing cloud long enough where people recognize that you know it's not perfect for everything and that there's certain things that we still want to keep inside our own data center. So I mean, realistically, as we move forward, that's, Ah, that better option when it comes to a modern architecture that can do, you know, we can deliver an address, a diverse set of performance requirements and allow the organization to continue to grow the model to the data, you know, based on the data that they're actually trying to leverage. And that's really what Flash was built for. It was built for a platform that could address small files or large files or high throughput, high throughput, low latency scale of petabytes in a single name. Space in a single rack is we like to put it in there. I mean, we see customers that have put 150 flash blades into production as a single name space. It's significant for organizations that are making that drive towards modern data experience with modern analytics platforms. Pure and Veronica have delivered an experience that can address that to a wide range of customers that are implementing uh, you know, particularly on technology. >>I'm interested in exploring the use case. A little bit further. You just sort of gave some parameters and some examples and some of the flexibility that you have, um, and take us through kind of what the customer discussions are like. Obviously you've got a big customer base, you and vertical that that's on Prem. That's the the unique advantage of this. But there are others. It's not just the economics of the granular scaling of compute and storage independently. There are other aspects of take us through that sort of a primary use case or use cases. Yeah, you >>know, I mean, I could give you a couple customer examples, and we have a large SAS analyst company which uses vertical on last way to authenticate the quality of digital media in real time, You know, then for them it makes a big difference is they're doing their streaming and whatnot that they can. They can fine tune the grand we control that. So that's one aspect that that we address. We have a multinational car car company, which uses vertical on flash blade to make thousands of decisions per second for autonomous vehicle decision making trees. You know, that's what really these new modern analytics platforms were built for, um, there's another healthcare organization that uses vertical on flash blade to enable healthcare providers to make decisions in real time. The impact lives, especially when we start to look at and, you know, the current state of affairs with code in the Corona virus. You know, those types of technologies, we're really going to help us kind of get of and help lower invent, bend that curve downward. So, you know, there's all these different areas where we can address that the goals and the achievements that we're trying to look bored with with real time analytics decision making tools like and you know, realistically is we have these conversations with customers they're looking to get beyond the ability of just, you know, a data scientist or a data architect looking to just kind of driving information >>that we're talking about Hadoop earlier. We're kind of going well beyond that now. And I guess what I'm saying is that in the first phase of cloud, it was all about infrastructure. It was about, you know, uh, spin it up. You know, compute and storage is a little bit of networking in there. >>It >>seems like the next new workload that's clearly emerging is you've got. And it started with the cloud native databases. But then bringing in, you know, AI and machine learning tooling on top of that Ah, and then being able to really drive these new types of insights and it's really about taking data these bog this bog of data that we've collected over the last 10 years. A lot of that is driven by a dupe bringing machine intelligence into the equation, scaling it with either cloud public cloud or bringing that cloud experience on Prem scale. You know, across organizations and across your partner network, that really is a new emerging workloads. You see that? And maybe talk a little bit about what you're seeing with customers. >>Yeah. I mean, it really is. We see several trends. You know, one of those is the ability to take a take this approach to move it out of the lab, but into production. Um, you know, especially when it comes to data science projects, machine learning projects that traditionally start out as kind of small proofs of concept, easy to spin up in the cloud. But when a customer wants to scale and move towards a riel you know, derived a significant value from that. They do want to be able to control more characteristic site, and we know machine learning, you know, needs toe needs to learn from a massive amounts of data to provide accuracy. There's just too much data retrieving the cloud for every training job. Same time Predictive analytics without accuracy is not going to deliver the business advantage of what everyone is seeking. You know, we see this. Ah, the visualization of Data Analytics is Tricia deployed is being on a continuum with, you know, the things that we've been doing in the long in the past with data warehousing, data Lakes, ai on the other end. But this way, we're starting to manifest it and organizations that are looking towards getting more utility and better elasticity out of the data that they are working for. So they're not looking to just build apps, silos of bespoke ai environments. They're looking to leverage. Ah, you know, ah, platform that can allow them to, you know, do ai, for one thing, machine learning for another leverage multiple protocols to access that data because the tools are so much Jeff um, you know, it is a growing diversity of of use cases that you can put on a single platform I think organizations are looking for as they try to scale these environment. >>I think it's gonna be a big growth area in the coming years. Gable. I wish we were in Boston together. You would have painted your little corner of Boston orange. I know that you guys have but really appreciate you coming on the cube wall to wall coverage. Two days of the vertical vertical virtual big data conference. Keep it right there. Right back. Right after this short break, Yeah.
SUMMARY :
Brought to you by vertical. of the vertical of Big Data event. Great to see you too. future or one of the sub shows that you guys are doing the regional shows, but because we've been you know, the flash blade platform ended up being a great solution to support America Okay, so let's let's circle back on that you guys in your in your announcement of the I would like to go beyond that and just say, you know, So we've really kind of looked at this from a standpoint you know, initial products which were scale up, Um, and so I want on It is a fabric based object space and, you know, file performance that is beyond what most adds complexity, you know we avoid. you know, that's one of the bigger pieces to that. straight, you know, go to market. it's it's It's more than just that what we consider a channel meet in the middle or, you know, So you know, so there's trade offs that customers have to make. been really excited to build the partnership with vertical A and provide, you know, we're really proud to provide pretty and some examples and some of the flexibility that you have, um, and take us through you know, the current state of affairs with code in the Corona virus. It was about, you know, uh, spin it up. But then bringing in, you know, AI and machine learning data because the tools are so much Jeff um, you know, it is a growing diversity of I know that you guys have but really appreciate you coming on the cube wall to wall coverage.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Gabriel Chapman | PERSON | 0.99+ |
September 2019 | DATE | 0.99+ |
Boston | LOCATION | 0.99+ |
Barney | ORGANIZATION | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
Two days | QUANTITY | 0.99+ |
Veronica | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
last September | DATE | 0.99+ |
thousands | QUANTITY | 0.98+ |
150 | QUANTITY | 0.98+ |
Courtney | PERSON | 0.98+ |
one | QUANTITY | 0.98+ |
one aspect | QUANTITY | 0.98+ |
Day One | QUANTITY | 0.97+ |
day two | QUANTITY | 0.97+ |
seven blades | QUANTITY | 0.97+ |
both | QUANTITY | 0.96+ |
Virtual Vertica | ORGANIZATION | 0.96+ |
over three years | QUANTITY | 0.96+ |
150 flash blades | QUANTITY | 0.95+ |
first | QUANTITY | 0.95+ |
single rack | QUANTITY | 0.94+ |
Corona virus | OTHER | 0.94+ |
single name | QUANTITY | 0.94+ |
first phase | QUANTITY | 0.94+ |
Pure Storage | ORGANIZATION | 0.93+ |
Prem | ORGANIZATION | 0.92+ |
Christie Pure | ORGANIZATION | 0.91+ |
single platform | QUANTITY | 0.91+ |
each individual | QUANTITY | 0.91+ |
this year | DATE | 0.91+ |
firsts | QUANTITY | 0.9+ |
Big Data Conference 2020 | EVENT | 0.9+ |
America | LOCATION | 0.89+ |
Flash Blade Products Solutions | ORGANIZATION | 0.89+ |
couple of years ago | DATE | 0.88+ |
single name | QUANTITY | 0.84+ |
each one | QUANTITY | 0.84+ |
one thing | QUANTITY | 0.83+ |
Tricia | PERSON | 0.82+ |
Pure | ORGANIZATION | 0.81+ |
last 10 years | DATE | 0.8+ |
Hadoop | TITLE | 0.75+ |
single addressable | QUANTITY | 0.74+ |
second | QUANTITY | 0.72+ |
Veronica | ORGANIZATION | 0.7+ |
Encore Hotel | LOCATION | 0.68+ |
Big Data | EVENT | 0.67+ |
Cube | COMMERCIAL_ITEM | 0.66+ |
SAS | ORGANIZATION | 0.65+ |
Flash Blade | TITLE | 0.62+ |
petabytes | QUANTITY | 0.62+ |
eon | ORGANIZATION | 0.59+ |
couple customer | QUANTITY | 0.55+ |
EON | ORGANIZATION | 0.53+ |
single big | QUANTITY | 0.5+ |
Big | EVENT | 0.49+ |
years | DATE | 0.48+ |
sub | QUANTITY | 0.46+ |
2020 | DATE | 0.33+ |
UNLIST TILL 4/2 - Vertica Database Designer - Today and Tomorrow
>> Jeff: Hello everybody and thank you for joining us today for the Virtual VERTICA BDC 2020. Today's breakout session has been titled, "VERTICA Database Designer Today and Tomorrow." I'm Jeff Healey, Product VERTICA Marketing, I'll be your host for this breakout session. Joining me today is Yuanzhe Bei, Senior Technical Manager from VERTICA Engineering. But before we begin, (clearing throat) I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions, as we're able to during that time, any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VERTICA forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums, to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button at the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you Yuanzhe. >> Yuanzhe: Thanks Jeff. Hi everyone, my name is Yuanzhe Bei, I'm a Senior Technical Manager at VERTICA Server RND Group. I run the query optimizer, catalog and the disaggregated engine team. Very glad to be here today, to talk about, the "VERTICA Database Designer Today and Tomorrow". This presentation will be organized as the following; I will first refresh some knowledge about, VERTICA fundamentals such as Tables and Projections, which will bring to the question, "What is Database Designer?" and "Why we need this tool?". Then I will take you through a deep dive, into a Database Designer or we call DBD, and see how DBD's internals works, after that I'll show you some exciting DBD improvements, we have planned for 10.0 release and lastly, I will share with you, some DBD future roadmap we planned next. As most of you should already know, VERTICA is built on a columnar architecture. That means, data is stored column wise. Here we can see a very simple example, of table with four columns, and the many of you may also know, table in VERTICA is a virtual concept. It's just a logical representation of data, which means user can write SQL query, to reference the table names and column, just like other relational database management system, but the actual physical storage of data, is called Projection. A Projection can reference a subset, or all of the columns all to its anchor table, and must be sorted by at least one column. Each table need at least one C for projection which reference all the columns to the table. If you load data to a table with no projection, and automated, auto production will be created, which will be arbitrarily assorted by, the first couple of columns in the table. As you can imagine, even though such other production, can be used to answer any query, the performance is not optimized in most cases. A common practice in VERTICA, is to create multiple projections, contain difference step of column, and sorted in different ways on the same table. When query is sent to the server, the optimizer will pick the projection, that can answer the query in the most efficient way. For example, here you can say, let's say you have a query, that select columns B, D, C and sorted by B and D, the third projection will be ideal, because the data is already sorted, so you can save the sorting costs while executing the query. Basically when you choose the design of the projection, you need to consider four things. First and foremost, of course the sort order. The data already sorted in the right way, can benefit quite a lot of the query actually, like Ordered by, Group By, Analytics, Merge, Join, Predicates and so on. The select column group is also important, because the projection must contain, all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, VERTICA is the distributed database, and allow projection to be segmented, based on the hash of a set of columns, which is beneficial if the segmentation merged, the join keys or group keys. And finally encoding of each per columns is also part of the design, because the data is sorted in different way, may completely change the optimal encoding for each column. This example only show the benefit of the first two, but you can imagine the rest too are also important. But even for that, it doesn't sound that hard, right? Well I hope you change your mind already when you see this, at least I do. These machine generated queries, really beats me. It will probably take an experienced DBA hours, to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries, in the regular work logs in the real world. So what can we do? That's why we need DBD. DBD is a tool integrated in the VERTICA server, that it can help DBA to perform an access, on their work log query, tabled schema and data, and then automatically figure out, the most optimized projection design for their workload. In addition, DBD also a sophisticated tool, that can take customize by a user, by sending a lot of parameters objectives and so on. And lastly, DBD has access to the optimizer, so DB knows what kind of attribute, the projection need to have, in order to have the optimizer to benefit from them. DBD has been there for years, and I'm sure there are plenty of materials available online, to show you how DBD can be used in different scenarios, whether to achieve the query optimize, or load optimize, whether it's the comprehensive design, or the incremental design, whether it's a dumping deployment script, and manual deployment later, or let the DBD do the order deployment for you, and the many other options. I'm not planning to talk about this today, instead, I will take the opportunity today, to open this black box DBD, and show you what exactly hide inside. DBD is a complex tool and I have tried my best to summarize the DBD design process into seven steps; Extract, Permute, Prune, Build, Score, Identify and Encode. What do they mean? Don't worry, I will show you step by step. The first step is Extract. Extract Interesting Columns. In this step, DBD pass the design queries, and figure out the operations that can be benefited, by the potential projection design, and extract the corresponding columns, as interesting columns. So Predicates, Group By, Order By, Joint Condition, and analytics are all interesting Column to the DBD. As you can see this three simple sample queries, DBD can extract the interest in column sets on the right. Some of these column sets are unordered. For example, the green one for Group By a1 and b1, the DBD extracts the interesting column set, and put them in the own orders set, because either data sorted by a1 first or b1 first, can benefit from this Group By operation. Some of the other sets are ordered, and the best example is here, order by clause a2 and b2, and obviously you cannot sort it by b2 and then a2. These interesting columns set will be used as if, to extend to actual projection sort order candidates. The next step is Permute, once DBD extract all the C's, it will enumerate sort order using C, and how does DBD do that? I'm starting with a very simple example. So here you can see DBD can enumerate two sort orders, by extending d1 with the unordered set a1, b1, and the derived at two sort order candidates, d1, a1, b1, and d1, b1, a1. This sort order can benefit queries with predicate on d1, and also benefit queries by Group By a1, b1, when a1, sorry when d1 is constant. So with the same idea, DBD will try to extend other States with each other, and populate more sort order permutations. You can imagine that how many of them, there could be many of them, these candidates, based on how many queries you have in the design and that can be handled of the sort order candidates. That comes to the third step, which is Pruning. This step is to limit the candidates sort order, so that the design won't be running forever. DBD uses very simple capping mechanism. It sorts all the, sort all the candidates, are ranked by length, and only a certain number of the sort order, with longest length, will be moved forward to the next step. And now we have all the sort orders candidate, that we want to try, but whether this sort order candidate, will be actually be benefit from the optimizer, DBD need to ask the optiizer. So this step before that happens, this step has to build those projection candidate, in the catalog. So this step will build, will generates the projection DBL's, surround the sort order, and create this projection in the catalog. These projections won't be loaded with real data, because that takes a lot of time, instead, DBD will copy over the statistic, on existing projections, to this projection candidates, so that the optimizer can use them. The next step is Score. Scoring with optimizer. Now projection candidates are built in the catalog. DBD can send a work log queries to optimizer, to generate a query plan. And then optimizer will return the query plan, DBD will go through the query plan, and investigate whether, there are certain benefits being achieved. The benefits list have been growing over time, when optimizer add more optimizations. Let's say in this case because the projection candidates, can be sorted by the b1 and a1, it is eligible for Group By Pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries, will be aggregated and then recorded, for each projection candidate. We are almost there. Now we have all the total benefit score, for the projection candidates, we derived on the work log queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner d1, b1 and a1. Sometimes you need to find more winners, because the chosen winner may only benefit a subset, of the work log query you provided to the DBD. So in order to have the rest of the queries, to be also benefit, you need more projections. So in this case, DBD will go to the next iteration, and let's say in this case find to another winner, d1, c1, to benefit the work log queries, that cannot be benefit by d1, b1 and a1. The number of iterations and thus the winner outcome, DBD really depends on the design objective that uses that. It can be load optimized, which means that only one, super projection winner will be selected, or query optimized, where DBD try to create as many projections, to cover most of the work log queries, or somewhat balance an objective in the middle. The last step is to decide encoding, for each projection columns, for the projection winners. Because the data are sorted differently, the encoding benefits, can be very different from the existing projection. So choose the right projection encoding design, will save the disk footprint a significant factor. So it's worth the effort, to find out the best thing encoding. DBD picks the encoding, based on the actual sampling the data, and measure the storage footprint. For example, in this case, the projection winner has three columns, and say each column has a few encoding options. DBD will write the sample data in the way this projection is sorted, and then you can see with different encoding, the disk footprint is different. DBD will then compare the disk footprint of each, of different options for each column, and pick the best encoding options, based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basic that how DBD internal works, of course, I think we've heard it quite a lot. For example, I didn't mention how the DBD handles segmentation, but the idea is similar to analyze the sort order. But I hope this section gave you some basic idea, about DBD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DBD in many ways. In this talk I will highlight four issues in old DBD and describe how the 10.0 version new DBD, will address those issues. The first issue is that a DBD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday, with the new or different projection can help speed it up? However, to answer a simple question like this using DBD, user will be very likely to have the documentation open on the side, because they have to go through it's whole complex flow, from creating a projection, run the design, get outputs and then create a design in the end. And that's not there yet, for each step, there are several functions user need to call in order. So adding these up, user need to write the quite long script with dozens of functions, it's just too complicated, and most of you may find it annoying. They either manually tune the projection to themselves, or simply live with the performance and come back, when it gets really slow again, and of course in most situations, they never come back to use the DBD. In 10.0 VERTICA support the new simplified API, to run DBD easily. There will be just one function designer_single_run and one argument, the interval that you think, your query was slow. In this case, user complained about it yesterday. So what does this user to need to do, is just specify one day, as argument and run it. The user don't need to provide anything else, because the DBD will look up his query or history, within that time window and automatically populate design, run design and export the projection design, and the clean up, no user intervention needed. No need to have the documentation on the side and carefully write a script, and a debug, just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes to another issue. To fully utilize this single round function, users are encouraged to run DBD on the production cluster. However, in fact, VERTICA used to not recommend, to run a design on a production cluster. One of the reasons issue, is that DBD picks massive locks, both table locks and catalog locks, which will badly interfere the running workload, on a production cluster. As of 10.0, we eliminated all the table and ten catalog locks from DBD. Yes, we eliminate 100% of them, simple improvement, clear win. The third issue, which user may not be aware of, is that DBD writes intermediate result. into real VERTICA tables, the real DBD have to do that is, DBD is the background task. So the intermediate results, some user needs to monitor it, the progress of the DBD in concurrent session. For complex design, the intermediate result can be quite massive, and as a result, many lost files will be created, and written to the disk, and we should both stress, the catalog, and that the disk can slow down the design. For ER mode, it's even worse because, the table are shared on communal storage. So writing to the regular table, means that it has to upload the data, to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer, and make this shared in memory data structure. Monitoring queries will go directly look up, in memory data structure, and go through the system table, and return the results. No Intermediate Results files will be written anymore. Another expensive lubidge of local disk for DBD is encoding design, as I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DBD need to actually write down, the sample data to the disk, using the different encoding options, and to find out which ones have the smallest footprint, or pick it as the best choice. These written sample data will be useless after this, and it will be wiped out right away, and you can imagine this is a huge waste of the system resource. In 10.0 we improve this process. So instead of writing, the different encoded data on the disk, and then read the file size, DBD aggregate the data block size on-the-fly. The data block will not be written to the disk, so the overall encoding and design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put a significant amount of the resource on the improving the DBD in 10.0, is because the VERTICA DBD, as essential component of the out of box performance design campaign. To simply illustrate the timeline, we are now on the second step, where we significantly reduced, the running overhead of the DBD, so that user will no longer fear, to run DBD on their production cluster. Please be noted that as of 10.0, we haven't really started changing, how DBD design algorithm works, so that what we have discussed in the deep dive today, still holds. For the next phase of DBD, we will briefly make the design process smarter, and this will include better enumeration mechanism, so that the pruning is more intelligence rather than brutal, then that will result in better design quality, and also faster design. The longer term is to make DBD to achieve the automation. What entail automation and what I really mean is that, instead of having user to decide when to use DBD, until their query is slow, VERTICA have to know, detect this event, and have have DBD run automatically for users, and suggest the better projections design, if the existing projection is not good enough. Of course, there will be a lot of work that need to be done, before we can actually fully achieve the automation. But we are working on that. At the end of day, what the user really wants, is the fast database, right? And thank you for listening to my presentation. so I hope you find it useful. Now let's get ready for the Q&A.
SUMMARY :
at the end of the presentation. and the many of you may also know,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Yuanzhe Bei | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
100% | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
one day | QUANTITY | 0.99+ |
second step | QUANTITY | 0.99+ |
third step | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
third issue | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
Each benefit | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
third projection | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
b2 | OTHER | 0.99+ |
each column | QUANTITY | 0.99+ |
first issue | QUANTITY | 0.99+ |
one column | QUANTITY | 0.99+ |
three columns | QUANTITY | 0.99+ |
VERTICA Engineering | ORGANIZATION | 0.99+ |
Yuanzhe | PERSON | 0.99+ |
each step | QUANTITY | 0.98+ |
Each table | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
DBD | TITLE | 0.98+ |
DBD | ORGANIZATION | 0.98+ |
seven steps | QUANTITY | 0.98+ |
DBL | ORGANIZATION | 0.98+ |
each | QUANTITY | 0.98+ |
one argument | QUANTITY | 0.98+ |
VERTICA | TITLE | 0.98+ |
each projection | QUANTITY | 0.97+ |
first two | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
hundreds | QUANTITY | 0.97+ |
one function | QUANTITY | 0.97+ |
clause a2 | OTHER | 0.97+ |
one | QUANTITY | 0.97+ |
each per columns | QUANTITY | 0.96+ |
Tomorrow | DATE | 0.96+ |
both | QUANTITY | 0.96+ |
four issues | QUANTITY | 0.95+ |
VERTICA | ORGANIZATION | 0.95+ |
b1 | OTHER | 0.95+ |
single round | QUANTITY | 0.94+ |
4/2 | DATE | 0.94+ |
first couple of columns | QUANTITY | 0.92+ |
VERTICA Database Designer Today and Tomorrow | TITLE | 0.91+ |
Vertica | ORGANIZATION | 0.91+ |
10.0 | QUANTITY | 0.89+ |
one function call | QUANTITY | 0.89+ |
a1 | OTHER | 0.89+ |
four things | QUANTITY | 0.88+ |
c1 | OTHER | 0.87+ |
two sort order | QUANTITY | 0.85+ |
UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases
>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.
SUMMARY :
Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff | PERSON | 0.99+ |
Sumeet | PERSON | 0.99+ |
Sumeet Keswani | PERSON | 0.99+ |
Shirang Kamat | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
6 nodes | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
five minutes | QUANTITY | 0.99+ |
six years | QUANTITY | 0.99+ |
ten minutes | QUANTITY | 0.99+ |
12 nodes | QUANTITY | 0.99+ |
Shirang | PERSON | 0.99+ |
1,000 files | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
12 shards | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
99.9% | QUANTITY | 0.99+ |
two modes | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
first subcluster | QUANTITY | 0.99+ |
first time | QUANTITY | 0.99+ |
two options | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
two subclusters | QUANTITY | 0.99+ |
Each node | QUANTITY | 0.99+ |
hundreds of nodes | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each app | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
last year | DATE | 0.99+ |
second | QUANTITY | 0.99+ |
One | QUANTITY | 0.98+ |
three nodes | QUANTITY | 0.98+ |
SQL | TITLE | 0.98+ |
Eon Mode | TITLE | 0.98+ |
single container | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
16 secondary shard subscription | QUANTITY | 0.97+ |
two types | QUANTITY | 0.97+ |
Sizing and Configuring Vertica in Eon Mode for Different Use Cases | TITLE | 0.97+ |
Vertica | TITLE | 0.97+ |
one limitation | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning
>> Paige: Hello, everybody, and thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled "Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning." I'm Paige Roberts, Opensource Relations Manager at Vertica, and I'll be your host for this session. Joining me is Vertica Software Engineer, George Larionov. >> George: Hi. >> Paige: (chuckles) That's George. So, before we begin, I encourage you guys to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. So, as soon as a question occurs to you, go ahead and type it in, and there will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to get to during that time. Any questions we don't get to, we'll do our best to answer offline. Now, alternatively, you can visit Vertica Forum to post your questions there, after the session. Our engineering team is planning to join the forums to keep the conversation going, so you can ask an engineer afterwards, just as if it were a regular conference in person. Also, reminder, you can maximize your screen by clicking the double-arrow button in the lower right corner of the slides. And, before you ask, yes, this virtual session is being recorded, and it will be available to view by the end this week. We'll send you a notification as soon as it's ready. Now, let's get started, over to you, George. >> George: Thank you, Paige. So, I've been introduced. I'm a Software Engineer at Vertica, and today I'm going to be talking about a new feature, Vertica's Integration with TensorFlow. So, first, I'm going to go over what is TensorFlow and what are neural networks. Then, I'm going to talk about why integrating with TensorFlow is a useful feature, and, finally, I am going to talk about the integration itself and give an example. So, as we get started here, what is TensorFlow? TensorFlow is an opensource machine learning library, developed by Google, and it's actually one of many such libraries. And, the whole point of libraries like TensorFlow is to simplify the whole process of working with neural networks, such as creating, training, and using them, so that it's available to everyone, as opposed to just a small subset of researchers. So, neural networks are computing systems that allow us to solve various tasks. Traditionally, computing algorithms were designed completely from the ground up by engineers like me, and we had to manually sift through the data and decide which parts are important for the task and which are not. Neural networks aim to solve this problem, a little bit, by sifting through the data themselves, automatically and finding traits and features which correlate to the right results. So, you can think of it as neural networks learning to solve a specific task by looking through the data without having human beings have to sit and sift through the data themselves. So, there's a couple necessary parts to getting a trained neural model, which is the final goal. By the way, a neural model is the same as a neural network. Those are synonymous. So, first, you need this light blue circle, an untrained neural model, which is pretty easy to get in TensorFlow, and, in edition to that, you need your training data. Now, this involves both training inputs and training labels, and I'll talk about exactly what those two things are on the next slide. But, basically, you need to train your model with the training data, and, once it is trained, you can use your trained model to predict on just the purple circle, so new training inputs. And, it will predict the training labels for you. You don't have to label it anymore. So, a neural network can be thought of as... Training a neural network can be thought of as teaching a person how to do something. For example, if I want to learn to speak a new language, let's say French, I would probably hire some sort of tutor to help me with that task, and I would need a lot of practice constructing and saying sentences in French. And a lot of feedback from my tutor on whether my pronunciation or grammar, et cetera, is correct. And, so, that would take me some time, but, finally, hopefully, I would be able to learn the language and speak it without any sort of feedback, getting it right. So, in a very similar manner, a neural network needs to practice on, example, training data, first, and, along with that data, it needs labeled data. In this case, the labeled data is kind of analogous to the tutor. It is the correct answers, so that the network can learn what those look like. But, ultimately, the goal is to predict on unlabeled data which is analogous to me knowing how to speak French. So, I went over most of the bullets. A neural network needs a lot of practice. To do that, it needs a lot of good labeled data, and, finally, since a neural network needs to iterate over the training data many, many times, it needs a powerful machine which can do that in a reasonable amount of time. So, here's a quick checklist on what you need if you have a specific task that you want to solve with a neural network. So, the first thing you need is a powerful machine for training. We discussed why this is important. Then, you need TensorFlow installed on the machine, of course, and you need a dataset and labels for your dataset. Now, this dataset can be hundreds of examples, thousands, sometimes even millions. I won't go into that because the dataset size really depends on the task at hand, but if you have these four things, you can train a good neural network that will predict whatever result you want it to predict at the end. So, we've talked about neural networks and TensorFlow, but the question is if we already have a lot of built-in machine-learning algorithms in Vertica, then why do we need to use TensorFlow? And, to answer that question, let's look at this dataset. So, this is a pretty simple toy dataset with 20,000 points, but it shows, it simulates a more complex dataset with some sort of two different classes which are not related in a simple way. So, the existing machine-learning algorithms that Vertica already has, mostly fail on this pretty simple dataset. Linear models can't really draw a good line separating the two types of points. Naïve Bayes, also, performs pretty badly, and even the Random Forest algorithm, which is a pretty powerful algorithm, with 300 trees gets only 80% accuracy. However, a neural network with only two hidden layers gets 99% accuracy in about ten minutes of training. So, I hope that's a pretty compelling reason to use neural networks, at least sometimes. So, as an aside, there are plenty of tasks that do fit the existing machine-learning algorithms in Vertica. That's why they're there, and if one of your tasks that you want to solve fits one of the existing algorithms, well, then I would recommend using that algorithm, not TensorFlow, because, while neural networks have their place and are very powerful, it's often easier to use an existing algorithm, if possible. Okay, so, now that we've talked about why neural networks are needed, let's talk about integrating them with Vertica. So, neural networks are best trained using GPUs, which are Graphics Processing Units, and it's, basically, just a different processing unit than a CPU. GPUs are good for training neural networks because they excel at doing many, many simple operations at the same time, which is needed for a neural network to be able to iterate through the training data many times. However, Vertica runs on CPUs and cannot run on GPUs at all because that's not how it was designed. So, to train our neural networks, we have to go outside of Vertica, and exporting a small batch of training data is pretty simple. So, that's not really a problem, but, given this information, why do we even need Vertica? If we train outside, then why not do everything outside of Vertica? So, to answer that question, here is a slide that Philips was nice enough to let us use. This is an example of production system at Philips. So, it consists of two branches. On the left, we have a branch with historical device log data, and this can kind of be thought of as a bunch of training data. And, all that data goes through some data integration, data analysis. Basically, this is where you train your models, whether or not they are neural networks, but, for the purpose of this talk, this is where you would train your neural network. And, on the right, we have a branch which has live device log data coming in from various MRI machines, CAT scan machines, et cetera, and this is a ton of data. So, these machines are constantly running. They're constantly on, and there's a bunch of them. So, data just keeps streaming in, and, so, we don't want this data to have to take any unnecessary detours because that would greatly slow down the whole system. So, this data in the right branch goes through an already trained predictive model, which need to be pretty fast, and, finally, it allows Philips to do some maintenance on these machines before they actually break, which helps Philips, obviously, and definitely the medical industry as well. So, I hope this slide helped explain the complexity of a live production system and why it might not be reasonable to train your neural networks directly in the system with the live device log data. So, a quick summary on just the neural networks section. So, neural networks are powerful, but they need a lot of processing power to train which can't really be done well in a production pipeline. However, they are cheap and fast to predict with. Prediction with a neural network does not require GPU anymore. And, they can be very useful in production, so we do want them there. We just don't want to train them there. So, the question is, now, how do we get neural networks into production? So, we have, basically, two options. The first option is to take the data and export it to our machine with TensorFlow, our powerful GPU machine, or we can take our TensorFlow model and put it where the data is. In this case, let's say that that is Vertica. So, I'm going to go through some pros and cons of these two approaches. The first one is bringing the data to the analytics. The pros of this approach are that TensorFlow is already installed, running on this GPU machine, and we don't have to move the model at all. The cons, however, are that we have to transfer all the data to this machine and if that data is big, if it's, I don't know, gigabytes, terabytes, et cetera, then that becomes a huge bottleneck because you can only transfer in small quantities. Because GPU machines tend to not be that big. Furthermore, TensorFlow prediction doesn't actually need a GPU. So, you would end up paying for an expensive GPU for no reason. It's not parallelized because you just have one GPU machine. You can't put your production system on this GPU, as we discussed. And, so, you're left with good results, but not fast and not where you need them. So, now, let's look at the second option. So, the second option is bringing the analytics to the data. So, the pros of this approach are that we can integrate with our production system. It's low impact because prediction is not processor intensive. It's cheap, or, at least, it's pretty much as cheap as your system was before. It's parallelized because Vertica was always parallelized, which we'll talk about in the next slide. There's no extra data movement. You get the benefit from model management in Vertica, meaning, if you import multiple TensorFlow models, you can keep track of their various attributes, when they were imported, et cetera. And, the results are right where you need them, inside your production pipeline. So, two cons are that TensorFlow is limited to just prediction inside Vertica, and, if you want to retrain your model, you need to do that outside of Vertica and, then, reimport. So, just as a recap of parallelization. Everything in Vertica is parallelized and distributed, and TensorFlow is no exception. So, when you import your TensorFlow model to your Vertica cluster, it gets copied to all the nodes, automatically, and TensorFlow will run in fenced mode which means that it the TensorFlow process fails for whatever reason, even though it shouldn't, but if it does, Vertica itself will not crash, which is obviously important. And, finally, prediction happens on each node. There are multiple threads of TensorFlow processes running, processing different little bits of data, which is faster, much faster, than processing the data line by line because it happens all in a parallelized fashion. And, so, the result is fast prediction. So, here's an example which I hope is a little closer to what everyone is used to than the usual machine learning TensorFlow example. This is the Boston housing dataset, or, rather, a small subset of it. Now, on the left, we have the input data to go back to, I think, the first slide, and, on the right, is the training label. So, the input data consists of, each line is a plot of land in Boston, along with various attributes, such as the level of crime in that area, how much industry is in that area, whether it's on the Charles River, et cetera, and, on the right, we have as the labels the median house value in that plot of land. And, so, the goal is to put all this data into the neural network and, finally, get a model which can train... I don't know, which can predict on new incoming data and predict a good housing value for that data. Now, I'm going to go through, step by step, how to actually use TensorFlow models in Vertica. So, the first step I won't go into much detail on because there are countless tutorials and resources online on how to use TensorFlow to train a neural network, so that's the first step. Second step is to save the model in TensorFlow's 'frozen graph' format. Again, this information is available online. The third step is to create a small, simple JSON file describing the inputs and outputs of the model, and what data type they are, et cetera. And, this is needed for Vertica to be able to translate from TensorFlow land into Vertica equal land, so that it can use a sequel table instead of the input set TensorFlow usually takes. So, once you have your model file and your JSON file, you want to put both of those files in a directory on a node, any node, in a Vertica cluster, and name that directory whatever you want your model to ultimately be called inside of Vertica. So, once you do that you can go ahead and import that directory into Vertica. So, this import model's function already exists in Vertica. All we added was a new category to be able to import. So, what you need to do is specify the pass to your neural network directory and specify that the category that the model is is a TensorFlow model. Once you successfully import, in order to predict, you run this brand new predict TensorFlow function, so, in this case, we're predicting on everything from the input table, which is what the star means. The model name is Boston housing net which is the name of your directory, and, then, there's a little bit of boilerplate. And, the two ID and value after the as are just the names of the columns of your outputs, and, finally, the Boston housing data is whatever sequel table you want to predict on that fits the import type of your network. And, this will output a bunch of predictions. In this case, values of houses that the network thinks are appropriate for all the input data. So, just a quick summary. So, we talked about what is TensorFlow and what are neural networks, and, then, we discussed that TensorFlow works best on GPUs because it needs very specific characteristics. That is TensorFlow works best for training on GPUs while Vertica is designed to use CPUs, and it's really good at storing and accessing a lot of data quickly. But, it's not very well designed for having neural networks trained inside of it. Then, we talked about how neural models are powerful, and we want to use them in our production flow. And, since prediction is fast, we can go ahead and do that, but we just don't want to train there, and, finally, I presented Vertica TensorFlow integration which allows importing a trained neural model, a trained neural TensorFlow model, into Vertica and predicting on all the data that is inside Vertica with few simple lines of sequel. So, thank you for listening. I'm going to take some questions, now.
SUMMARY :
and I'll be your host for this session. So, as soon as a question occurs to you, So, the second option is bringing the analytics to the data.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Vertica | ORGANIZATION | 0.99+ |
Philips | ORGANIZATION | 0.99+ |
Boston | LOCATION | 0.99+ |
George | PERSON | 0.99+ |
99% | QUANTITY | 0.99+ |
20,000 points | QUANTITY | 0.99+ |
second option | QUANTITY | 0.99+ |
Charles River | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
thousands | QUANTITY | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
third step | QUANTITY | 0.99+ |
first step | QUANTITY | 0.99+ |
George Larionov | PERSON | 0.99+ |
first option | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Second step | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
each line | QUANTITY | 0.99+ |
two branches | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
two options | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
300 trees | QUANTITY | 0.99+ |
two approaches | QUANTITY | 0.99+ |
millions | QUANTITY | 0.99+ |
first slide | QUANTITY | 0.99+ |
TensorFlow | TITLE | 0.99+ |
Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning | TITLE | 0.99+ |
two types | QUANTITY | 0.99+ |
two different classes | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
first one | QUANTITY | 0.98+ |
two cons | QUANTITY | 0.97+ |
about ten minutes | QUANTITY | 0.97+ |
two hidden layers | QUANTITY | 0.97+ |
French | OTHER | 0.96+ |
each node | QUANTITY | 0.95+ |
one | QUANTITY | 0.95+ |
end this week | DATE | 0.94+ |
two ID | QUANTITY | 0.91+ |
four things | QUANTITY | 0.89+ |
UNLIST TILL 4/2 - Autonomous Log Monitoring
>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.
SUMMARY :
Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Larry Lancaster | PERSON | 0.99+ |
David Gill | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
five times | QUANTITY | 0.99+ |
Larry | PERSON | 0.99+ |
S3 | TITLE | 0.99+ |
three minutes | QUANTITY | 0.99+ |
six times | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
100 services | QUANTITY | 0.99+ |
Zebrium | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
three | QUANTITY | 0.99+ |
five years | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
yesterday | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
Kubernetes | TITLE | 0.99+ |
one | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
one customer | QUANTITY | 0.98+ |
three lines | QUANTITY | 0.98+ |
three tables | QUANTITY | 0.98+ |
each event | QUANTITY | 0.98+ |
hundreds | QUANTITY | 0.98+ |
first people | QUANTITY | 0.98+ |
1,000 log streams | QUANTITY | 0.98+ |
20 years ago | DATE | 0.98+ |
eight incidents | QUANTITY | 0.98+ |
tens of thousands of customers | QUANTITY | 0.97+ |
later this week | DATE | 0.97+ |
thousands of users | QUANTITY | 0.97+ |
Stonebraker | ORGANIZATION | 0.96+ |
each occurrence | QUANTITY | 0.96+ |
Postgres | ORGANIZATION | 0.96+ |
One thing | QUANTITY | 0.95+ |
three event types | QUANTITY | 0.94+ |
million | QUANTITY | 0.94+ |
Vertica | TITLE | 0.94+ |
one thing | QUANTITY | 0.93+ |
4/2 | DATE | 0.92+ |
English | OTHER | 0.92+ |
four function names | QUANTITY | 0.86+ |
day one | QUANTITY | 0.84+ |
Prometheus | TITLE | 0.83+ |
one-stop | QUANTITY | 0.82+ |
Berkeley | LOCATION | 0.82+ |
Confluence | ORGANIZATION | 0.79+ |
double arrow | QUANTITY | 0.79+ |
last couple of months | DATE | 0.79+ |
one of | QUANTITY | 0.76+ |
cStor | ORGANIZATION | 0.75+ |
a billion | QUANTITY | 0.73+ |
Atlassian Stack | ORGANIZATION | 0.72+ |
Eon | ORGANIZATION | 0.71+ |
Bitbucket | ORGANIZATION | 0.68+ |
couple more examples | QUANTITY | 0.68+ |
Litmus | TITLE | 0.65+ |
UNLIST TILL 4/2 - End-to-End Security
>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled End-to-End Security in Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica. I'll be your host for this session. Joining me is Vertica Software Engineers, Fenic Fawkes and Chris Morris. Before we begin, I encourage you to submit your questions or comments during the virtual session. You don't have to wait until the end. Just type your question or comment in the question box below the slide as it occurs to you and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Also, you can visit Vertica forums to post your questions there after the session. Our team is planning to join the forums to keep the conversation going, so it'll be just like being at a conference and talking to the engineers after the presentation. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this whole session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. I think we're ready to get started. Over to you, Fen. >> Fenic: Hi, welcome everyone. My name is Fen. My pronouns are fae/faer and Chris will be presenting the second half, and his pronouns are he/him. So to get started, let's kind of go over what the goals of this presentation are. First off, no deployment is the same. So we can't give you an exact, like, here's the right way to secure Vertica because how it is to set up a deployment is a factor. But the biggest one is, what is your threat model? So, if you don't know what a threat model is, let's take an example. We're all working from home because of the coronavirus and that introduces certain new risks. Our source code is on our laptops at home, that kind of thing. But really our threat model isn't that people will read our code and copy it, like, over our shoulders. So we've encrypted our hard disks and that kind of thing to make sure that no one can get them. So basically, what we're going to give you are building blocks and you can pick and choose the pieces that you need to secure your Vertica deployment. We hope that this gives you a good foundation for how to secure Vertica. And now, what we're going to talk about. So we're going to start off by going over encryption, just how to secure your data from attackers. And then authentication, which is kind of how to log in. Identity, which is who are you? Authorization, which is now that we know who you are, what can you do? Delegation is about how Vertica talks to other systems. And then auditing and monitoring. So, how do you protect your data in transit? Vertica makes a lot of network connections. Here are the important ones basically. There are clients talk to Vertica cluster. Vertica cluster talks to itself. And it can also talk to other Vertica clusters and it can make connections to a bunch of external services. So first off, let's talk about client-server TLS. Securing data between, this is how you secure data between Vertica and clients. It prevents an attacker from sniffing network traffic and say, picking out sensitive data. Clients have a way to configure how strict the authentication is of the server cert. It's called the Client SSLMode and we'll talk about this more in a bit but authentication methods can disable non-TLS connections, which is a pretty cool feature. Okay, so Vertica also makes a lot of network connections within itself. So if Vertica is running behind a strict firewall, you have really good network, both physical and software security, then it's probably not super important that you encrypt all traffic between nodes. But if you're on a public cloud, you can set up AWS' firewall to prevent connections, but if there's a vulnerability in that, then your data's all totally vulnerable. So it's a good idea to set up inter-node encryption in less secure situations. Next, import/export is a good way to move data between clusters. So for instance, say you have an on-premises cluster and you're looking to move to AWS. Import/Export is a great way to move your data from your on-prem cluster to AWS, but that means that the data is going over the open internet. And that is another case where an attacker could try to sniff network traffic and pull out credit card numbers or whatever you have stored in Vertica that's sensitive. So it's a good idea to secure data in that case. And then we also connect to a lot of external services. Kafka, Hadoop, S3 are three of them. Voltage SecureData, which we'll talk about more in a sec, is another. And because of how each service deals with authentication, how to configure your authentication to them differs. So, see our docs. And then I'd like to talk a little bit about where we're going next. Our main goal at this point is making Vertica easier to use. Our first objective was security, was to make sure everything could be secure, so we built relatively low-level building blocks. Now that we've done that, we can identify common use cases and automate them. And that's where our attention is going. Okay, so we've talked about how to secure your data over the network, but what about when it's on disk? There are several different encryption approaches, each depends on kind of what your use case is. RAID controllers and disk encryption are mostly for on-prem clusters and they protect against media theft. They're invisible to Vertica. S3 and GCP are kind of the equivalent in the cloud. They also invisible to Vertica. And then there's field-level encryption, which we accomplish using Voltage SecureData, which is format-preserving encryption. So how does Voltage work? Well, it, the, yeah. It encrypts values to things that look like the same format. So for instance, you can see date of birth encrypted to something that looks like a date of birth but it is not in fact the same thing. You could do cool stuff like with a credit card number, you can encrypt only the first 12 digits, allowing the user to, you know, validate the last four. The benefits of format-preserving encryption are that it doesn't increase database size, you don't need to alter your schema or anything. And because of referential integrity, it means that you can do analytics without unencrypting the data. So again, a little diagram of how you could work Voltage into your use case. And you could even work with Vertica's row and column access policies, which Chris will talk about a bit later, for even more customized access control. Depending on your use case and your Voltage integration. We are enhancing our Voltage integration in several ways in 10.0 and if you're interested in Voltage, you can go see their virtual BDC talk. And then again, talking about roadmap a little, we're working on in-database encryption at rest. What this means is kind of a Vertica solution to encryption at rest that doesn't depend on the platform that you're running on. Encryption at rest is hard. (laughs) Encrypting, say, 10 petabytes of data is a lot of work. And once again, the theme of this talk is everyone has a different key management strategy, a different threat model, so we're working on designing a solution that fits everyone. If you're interested, we'd love to hear from you. Contact us on the Vertica forums. All right, next up we're going to talk a little bit about access control. So first off is how do I prove who I am? How do I log in? So, Vertica has several authentication methods. Which one is best depends on your deployment size/use case. Again, theme of this talk is what you should use depends on your use case. You could order authentication methods by priority and origin. So for instance, you can only allow connections from within your internal network or you can enforce TLS on connections from external networks but relax that for connections from your internal network. That kind of thing. So we have a bunch of built-in authentication methods. They're all password-based. User profiles allow you to set complexity requirements of passwords and you can even reject non-TLS connections, say, or reject certain kinds of connections. Should only be used by small deployments because you probably have an LDAP server, where you manage users if you're a larger deployment and rather than duplicating passwords and users all in LDAP, you should use LDAP Auth, where Vertica still has to keep track of users, but each user can then use LDAP authentication. So Vertica doesn't store the password at all. The client gives Vertica a username and password and Vertica then asks the LDAP server is this a correct username or password. And the benefits of this are, well, manyfold, but if, say, you delete a user from LDAP, you don't need to remember to also delete their Vertica credentials. You can just, they won't be able to log in anymore because they're not in LDAP anymore. If you like LDAP but you want something a little bit more secure, Kerberos is a good idea. So similar to LDAP, Vertica doesn't keep track of who's allowed to log in, it just keeps track of the Kerberos credentials and it even, Vertica never touches the user's password. Users log in to Kerberos and then they pass Vertica a ticket that says "I can log in." It is more complex to set up, so if you're just getting started with security, LDAP is probably a better option. But Kerberos is, again, a little bit more secure. If you're looking for something that, you know, works well for applications, certificate auth is probably what you want. Rather than hardcoding a password, or storing a password in a script that you use to run an application, you can instead use a certificate. So, if you ever need to change it, you can just replace the certificate on disk and the next time the application starts, it just picks that up and logs in. Yeah. And then, multi-factor auth is a feature request we've gotten in the past and it's not built-in to Vertica but you can do it using Kerberos. So, security is a whole application concern and fitting MFA into your workflow is all about fitting it in at the right layer. And we believe that that layer is above Vertica. If you're interested in more about how MFA works and how to set it up, we wrote a blog on how to do it. And now, over to Chris, for more on identity and authorization. >> Chris: Thanks, Fen. Hi everyone, I'm Chris. So, we're a Vertica user and we've connected to Vertica but once we're in the database, who are we? What are we? So in Vertica, the answer to that questions is principals. Users and roles, which are like groups in other systems. Since roles can be enabled and disabled at will and multiple roles can be active, they're a flexible way to use only the privileges you need in the moment. For example here, you've got Alice who has Dbadmin as a role and those are some elevated privileges. She probably doesn't want them active all the time, so she can set the role and add them to her identity set. All of this information is stored in the catalog, which is basically Vertica's metadata storage. How do we manage these principals? Well, depends on your use case, right? So, if you're a small organization or maybe only some people or services need Vertica access, the solution is just to manage it with Vertica. You can see some commands here that will let you do that. But what if we're a big organization and we want Vertica to reflect what's in our centralized user management system? Sort of a similar motivating use case for LDAP authentication, right? We want to avoid duplication hassles, we just want to centralize our management. In that case, we can use Vertica's LDAPLink feature. So with LDAPLink, principals are mirrored from LDAP. They're synced in a considerable fashion from the LDAP into Vertica's catalog. What this does is it manages creating and dropping users and roles for you and then mapping the users to the roles. Once that's done, you can do any Vertica-specific configuration on the Vertica side. It's important to note that principals created in Vertica this way, support multiple forms of authentication, not just LDAP. This is a separate feature from LDAP authentication and if you created a user via LDAPLink, you could have them use a different form of authentication, Kerberos, for example. Up to you. Now of course this kind of system is pretty mission-critical, right? You want to make sure you get the right roles and the right users and the right mappings in Vertica. So you probably want to test it. And for that, we've got new and improved dry run functionality, from 9.3.1. And what this feature offers you is new metafunctions that let you test various parameters without breaking your real LDAPLink configuration. So you can mess around with parameters and the configuration as much as you want and you can be sure that all of that is strictly isolated from the live system. Everything's separated. And when you use this, you get some really nice output through a Data Collector table. You can see some example output here. It runs the same logic as the real LDAPLink and provides detailed information about what would happen. You can check the documentation for specifics. All right, so we've connected to the database, we know who we are, but now, what can we do? So for any given action, you want to control who can do that, right? So what's the question you have to ask? Sometimes the question is just who are you? It's a simple yes or no question. For example, if I want to upgrade a user, the question I have to ask is, am I the superuser? If I'm the superuser, I can do it, if I'm not, I can't. But sometimes the actions are more complex and the question you have to ask is more complex. Does the principal have the required privileges? If you're familiar with SQL privileges, there are things like SELECT, INSERT, and Vertica has a few of their own, but the key thing here is that an action can require specific and maybe even multiple privileges on multiple objects. So for example, when selecting from a table, you need USAGE on the schema and SELECT on the table. And there's some other examples here. So where do these privileges come from? Well, if the action requires a privilege, these are the only places privileges can come from. The first source is implicit privileges, which could come from owning the object or from special roles, which we'll talk about in a sec. Explicit privileges, it's basically a SQL standard GRANT system. So you can grant privileges to users or roles and optionally, those users and roles could grant them downstream. Discretionary access control. So those are explicit and they come from the user and the active roles. So the whole identity set. And then we've got Vertica-specific inherited privileges and those come from the schema, and we'll talk about that in a sec as well. So these are the special roles in Vertica. First role, DBADMIN. This isn't the Dbadmin user, it's a role. And it has specific elevated privileges. You can check the documentation for those exact privileges but it's less than the superuser. The PSEUDOSUPERUSER can do anything the real superuser can do and you can grant this role to whomever. The DBDUSER is actually a role, can run Database Designer functions. SYSMONITOR gives you some elevated auditing permissions and we'll talk about that later as well. And finally, PUBLIC is a role that everyone has all the time so anything you want to be allowed for everyone, attach to PUBLIC. Imagine this scenario. I've got a really big schema with lots of relations. Those relations might be changing all the time. But for each principal that uses this schema, I want the privileges for all the tables and views there to be roughly the same. Even though the tables and views come and go, for example, an analyst might need full access to all of them no matter how many there are or what there are at any given time. So to manage this, my first approach I could use is remember to run grants every time a new table or view is created. And not just you but everyone using this schema. Not only is it a pain, it's hard to enforce. The second approach is to use schema-inherited privileges. So in Vertica, schema grants can include relational privileges. For example, SELECT or INSERT, which normally don't mean anything for a schema, but they do for a table. If a relation's marked as inheriting, then the schema grants to a principal, for example, salespeople, also apply to the relation. And you can see on the diagram here how the usage applies to the schema and the SELECT technically but in Sales.foo table, SELECT also applies. So now, instead of lots of GRANT statements for multiple object owners, we only have to run one ALTER SCHEMA statement and three GRANT statements and from then on, any time that you grant some privileges or revoke privileges to or on the schema, to or from a principal, all your new tables and views will get them automatically. So it's dynamically calculated. Now of course, setting it up securely, is that you want to know what's happened here and what's going on. So to monitor the privileges, there are three system tables which you want to look at. The first is grants, which will show you privileges that are active for you. That is your user and active roles and theirs and so on down the chain. Grants will show you the explicit privileges and inherited_privileges will show you the inherited ones. And then there's one more inheriting_objects which will show all tables and views which inherit privileges so that's useful more for not seeing privileges themselves but managing inherited privileges in general. And finally, how do you see all privileges from all these sources, right? In one go, you want to see them together? Well, there's a metafunction added in 9.3.1. Get_privileges_description which will, given an object, it will sum up all the privileges for a current user on that object. I'll refer you to the documentation for usage and supported types. Now, the problem with SELECT. SELECT let's you see everything or nothing. You can either read the table or you can't. But what if you want some principals to see subset or a transformed version of the data. So for example, I have a table with personnel data and different principals, as you can see here, need different access levels to sensitive information. Social security numbers. Well, one thing I could do is I could make a view for each principal. But I could also use access policies and access policies can do this without introducing any new objects or dependencies. It centralizes your restriction logic and makes it easier to manage. So what do access policies do? Well, we've got row and column access policies. Rows will hide and column access policies will transform data in the row or column, depending on who's doing the SELECTing. So it transforms the data, as we saw on the previous slide, to look as requested. Now, if access policies let you see the raw data, you can still modify the data. And the implication of this is that when you're crafting access policies, you should only use them to refine access for principals that need read-only access. That is, if you want a principal to be able to modify it, the access policies you craft should let through the raw data for that principal. So in our previous example, the loader service should be able to see every row and it should be able to see untransformed data in every column. And as long as that's true, then they can continue to load into this table. All of this is of course monitorable by a system table, in this case access_policy. Check the docs for more information on how to implement these. All right, that's it for access control. Now on to delegation and impersonation. So what's the question here? Well, the question is who is Vertica? And that might seem like a silly question, but here's what I mean by that. When Vertica's connecting to a downstream service, for example, cloud storage, how should Vertica identify itself? Well, most of the time, we do the permissions check ourselves and then we connect as Vertica, like in this diagram here. But sometimes we can do better. And instead of connecting as Vertica, we connect with some kind of upstream user identity. And when we do that, we let the service decide who can do what, so Vertica isn't the only line of defense. And in addition to the defense in depth benefit, there are also benefits for auditing because the external system can see who is really doing something. It's no longer just Vertica showing up in that external service's logs, it's somebody like Alice or Bob, trying to do something. One system where this comes into play is with Voltage SecureData. So, let's look at a couple use cases. The first one, I'm just encrypting for compliance or anti-theft reasons. In this case, I'll just use one global identity to encrypt or decrypt with Voltage. But imagine another use case, I want to control which users can decrypt which data. Now I'm using Voltage for access control. So in this case, we want to delegate. The solution here is on the Voltage side, give Voltage users access to appropriate identities and these identities control encryption for sets of data. A Voltage user can access multiple identities like groups. Then on the Vertica side, a Vertica user can set their Voltage username and password in a session and Vertica will talk to Voltage as that Voltage user. So in the diagram here, you can see an example of how this is leverage so that Alice could decrypt something but Bob cannot. Another place the delegation paradigm shows up is with storage. So Vertica can store and interact with data on non-local file systems. For example, HGFS or S3. Sometimes Vertica's storing Vertica-managed data there. For example, in Eon mode, you might store your projections in communal storage in S3. But sometimes, Vertica is interacting with external data. For example, this usually maps to a user storage location in the Vertica side and it might, on the external storage side, be something like Parquet files on Hadoop. And in that case, it's not really Vertica's data and we don't want to give Vertica more power than it needs, so let's request the data on behalf of who needs it. Lets say I'm an analyst and I want to copy from or export to Parquet, using my own bucket. It's not Vertica's bucket, it's my data. But I want Vertica to manipulate data in it. So the first option I have is to give Vertica as a whole access to the bucket and that's problematic because in that case, Vertica becomes kind of an AWS god. It can see any bucket, any Vertica user might want to push or pull data to or from any time Vertica wants. So it's not good for the principals of least access and zero trust. And we can do better than that. So in the second option, use an ID and secret key pair for an AWS, IAM, if you're familiar, principal that does have access to the bucket. So I might use my, the analyst, credentials, or I might use credentials for an AWS role that has even fewer privileges than I do. Sort of a restricted subset of my privileges. And then I use that. I set it in Vertica at the session level and Vertica will use those credentials for the copy export commands. And it gives more isolation. Something that's in the works is support for keyless delegation, using assumable IAM roles. So similar benefits to option two here, but also not having to manage keys at the user level. We can do basically the same thing with Hadoop and HGFS with three different methods. So first option is Kerberos delegation. I think it's the most secure. It definitely, if access control is your primary concern here, this will give you the tightest access control. The downside is it requires the most configuration outside of Vertica with Kerberos and HGFS but with this, you can really determine which Vertica users can talk to which HGFS locations. Then, you've got secure impersonation. If you've got a highly trusted Vertica userbase, or at least some subset of it is, and you're not worried about them doing things wrong but you want to know about auditing on the HGFS side, that's your primary concern, you can use this option. This diagram here gives you a visual overview of how that works. But I'll refer you to the docs for details. And then finally, option three, this is bringing your own delegation token. It's similar to what we do with AWS. We set something in the session level, so it's very flexible. The user can do it at an ad hoc basis, but it is manual, so that's the third option. Now on to auditing and monitoring. So of course, we want to know, what's happening in our database? It's important in general and important for incident response, of course. So your first stop, to answer this question, should be system tables. And they're a collection of information about events, system state, performance, et cetera. They're SELECT-only tables, but they work in queries as usual. The data is just loaded differently. So there are two types generally. There's the metadata table, which stores persistent information or rather reflects persistent information stored in the catalog, for example, users or schemata. Then there are monitoring tables, which reflect more transient information, like events, system resources. Here you can see an example of output from the resource pool's storage table which, these are actually, despite that it looks like system statistics, they're actually configurable parameters for using that. If you're interested in resource pools, a way to handle users' resource allocation and various principal's resource allocation, again, check that out on the docs. Then of course, there's the followup question, who can see all of this? Well, some system information is sensitive and we should only show it to those who need it. Principal of least privilege, right? So of course the superuser can see everything, but what about non-superusers? How do we give access to people that might need additional information about the system without giving them too much power? One option's SYSMONITOR, as I mentioned before, it's a special role. And this role can always read system tables but not change things like a superuser would be able to. Just reading. And another option is the RESTRICT and RELEASE metafunctions. Those grant and revoke access to from a certain system table set, to and from the PUBLIC role. But the downside of those approaches is that they're inflexible. So they only give you, they're all or nothing. For a specific preset of tables. And you can't really configure it per table. So if you're willing to do a little more setup, then I'd recommend using your own grants and roles. System tables support GRANT and REVOKE statements just like any regular relations. And in that case, I wouldn't even bother with SYSMONITOR or the metafunctions. So to do this, just grant whatever privileges you see fit to roles that you create. Then go ahead and grant those roles to the users that you want. And revoke access to the system tables of your choice from PUBLIC. If you need even finer-grained access than this, you can create views on top of system tables. For example, you can create a view on top of the user system table which only shows the current user's information, uses a built-in function that you can use as part of the view definition. And then, you can actually grant this to PUBLIC, so that each user in Vertica could see their own user's information and never give access to the user system table as a whole, just that view. Now if you're a superuser or if you have direct access to nodes in the cluster, filesystem/OS, et cetera, then you have more ways to see events. Vertica supports various methods of logging. You can see a few methods here which are generally outside of running Vertica, you'd interact with them in a different way, with the exception of active events which is a system table. We've also got the data collector. And that sorts events by subjects. So what the data collector does, it extends the logging and system table functionality, by the component, is what it's called in the documentation. And it logs these events and information to rotating files. For example, AnalyzeStatistics is a function that could be of use by users and as a database administrator, you might want to monitor that so you can use the data collector for AnalyzeStatistics. And the files that these create can be exported into a monitoring database. One example of that is with the Management Console Extended Monitoring. So check out their virtual BDC talk. The one on the management console. And that's it for the key points of security in Vertica. Well, many of these slides could spawn a talk on their own, so we encourage you to check out our blog, check out the documentation and the forum for further investigation and collaboration. Hopefully the information we provided today will inform your choices in securing your deployment of Vertica. Thanks for your time today. That concludes our presentation. Now, we're ready for Q&A.
SUMMARY :
in the question box below the slide as it occurs to you So for instance, you can see date of birth encrypted and the question you have to ask is more complex.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Chris Morris | PERSON | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
two types | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
Alice | PERSON | 0.99+ |
second approach | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
third option | QUANTITY | 0.99+ |
AWS' | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Today | DATE | 0.99+ |
first approach | QUANTITY | 0.99+ |
second half | QUANTITY | 0.99+ |
each service | QUANTITY | 0.99+ |
Bob | PERSON | 0.99+ |
10 petabytes | QUANTITY | 0.99+ |
Fenic | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first source | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Fen | PERSON | 0.98+ |
S3 | TITLE | 0.98+ |
One system | QUANTITY | 0.98+ |
first objective | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
First role | QUANTITY | 0.97+ |
each principal | QUANTITY | 0.97+ |
4/2 | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Vertica | TITLE | 0.97+ |
First | QUANTITY | 0.97+ |
one | QUANTITY | 0.96+ |
this week | DATE | 0.95+ |
three different methods | QUANTITY | 0.95+ |
three system tables | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
Fenic Fawkes | PERSON | 0.94+ |
Parquet | TITLE | 0.94+ |
Hadoop | TITLE | 0.94+ |
One example | QUANTITY | 0.93+ |
Dbadmin | PERSON | 0.92+ |
10.0 | QUANTITY | 0.92+ |
UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives
>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.
SUMMARY :
Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tom Wall | PERSON | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Roger Huebner | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Tom | PERSON | 0.99+ |
Python 2 | TITLE | 0.99+ |
August 2019 | DATE | 0.99+ |
2019 | DATE | 0.99+ |
Python 3 | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
Python | TITLE | 0.99+ |
python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
late 2018 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
end of 2019 | DATE | 0.99+ |
Vertica | TITLE | 0.99+ |
today | DATE | 0.99+ |
Java | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
C++ | TITLE | 0.99+ |
JavaScript | TITLE | 0.99+ |
vertica-python | TITLE | 0.99+ |
Today | DATE | 0.99+ |
first time | QUANTITY | 0.99+ |
11 different releases | QUANTITY | 0.99+ |
UDXs | TITLE | 0.99+ |
Kafka | TITLE | 0.99+ |
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives | TITLE | 0.98+ |
Grafana | ORGANIZATION | 0.98+ |
PyODBC | TITLE | 0.98+ |
first | QUANTITY | 0.98+ |
UDX | TITLE | 0.98+ |
vertica 10 | TITLE | 0.98+ |
ODBC | TITLE | 0.98+ |
10 | DATE | 0.98+ |
Postgres | TITLE | 0.98+ |
DataDog | ORGANIZATION | 0.98+ |
40 customer reported issues | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - Keep Data Private
>> Paige: Hello everybody and thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled Keep Data Private Prepare and Analyze Without Unencrypting With Voltage SecureData for Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica, and I'll be your host for this session. Joining me is Rich Gaston, Global Solutions Architect, Security, Risk, and Government at Voltage. And before we begin, I encourage you to submit your questions or comments during the virtual session, you don't have to wait till the end. Just type your question as it occurs to you, or comment, in the question box below the slide and then click Submit. There'll be a Q&A session at the end of the presentation where we'll try to answer as many of your questions as we're able to get to during the time. Any questions that we don't address we'll do our best to answer offline. Now, if you want, you can visit the Vertica Forum to post your questions there after the session. Now, that's going to take the place of the Developer Lounge, and our engineering team is planning to join the Forum, to keep the conversation going. So as a reminder, you can also maximize your screen by clicking the double arrow button, in the lower-right corner of the slides. That'll allow you to see the slides better. And before you ask, yes, this virtual session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. All right, let's get started. Over to you, Rich. >> Rich: Hey, thank you very much, Paige, and appreciate the opportunity to discuss this topic with the audience. My name is Rich Gaston and I'm a Global Solutions Architect, within the Micro Focus team, and I work on global Data privacy and protection efforts, for many different organizations, looking to take that journey toward breach defense and regulatory compliance, from platforms ranging from mobile to mainframe, everything in between, cloud, you name it, we're there in terms of our solution sets. Vertica is one of our major partners in this space, and I'm very excited to talk with you today about our solutions on the Vertica platform. First, let's talk a little bit about what you're not going to learn today, and that is, on screen you'll see, just part of the mathematics that goes into, the format-preserving encryption algorithm. We are the originators and authors and patent holders on that algorithm. Came out of research from Stanford University, back in the '90s, and we are very proud, to take that out into the market through the NIST standard process, and license that to others. So we are the originators and maintainers, of both standards and athureader in the industry. We try to make this easy and you don't have to learn any of this tough math. Behind this there are also many other layers of technology. They are part of the security, the platform, such as stateless key management. That's a really complex area, and we make it very simple for you. We have very mature and powerful products in that space, that really make your job quite easy, when you want to implement our technology within Vertica. So today, our goal is to make Data protection easy for you, to be able to understand the basics of Voltage Secure Data, you're going to be learning how the Vertica UDx, can help you get started quickly, and we're going to see some examples of how Vertica plus Voltage Secure Data, are going to be working together, in our customer cases out in the field. First, let's take you through a quick introduction to Voltage Secure Data. The business drivers and what's this all about. First of all, we started off with Breach Defense. We see that despite continued investments, in personal perimeter and platform security, Data breaches continue to occur. Voltage Secure Data plus Vertica, provides defense in depth for sensitive Data, and that's a key concept that we're going to be referring to. in the security field defense in depth, is a standard approach to be able to provide, more layers of protection around sensitive assets, such as your Data, and that's exactly what Secure Data is designed to do. Now that we've come through many of these breach examples, and big ticket items, getting the news around breaches and their impact, the business regulators have stepped up, and regulatory compliance, is now a hot topic in Data privacy. Regulations such as GDPR came online in 2018 for the EU. CCPA came online just this year, a couple months ago for California, and is the de-facto standard for the United States now, as organizations are trying to look at, the best practices for providing, regulatory compliance around Data privacy and protection. These gives massive new rights to consumers, but also obligations to organizations, to protect that personal Data. Secure Data Plus Vertica provides, fine grained authorization around sensitive Data, And we're going to show you exactly how that works, within the Vertica platform. At the bottom, you'll see some of the snippets there, of the news articles that just keep racking up, and our goal is to keep you off the news, to keep your company safe, so that you can have the assurance, that even if there is an unintentional, or intentional breach of Data out of the corporation, if it is protected by voltage Secure Data, it will be of no value to those hackers, and then you have no impact, in terms of risk to the organization. What do we mean by defense in depth? Let's take a look first at the encryption types, and the benefits that they provide, and we see our customers implementing, all kinds of different protection mechanisms, within the organization. You could be looking at disk level protection, file system protection, protection on the files themselves. You could protect the entire Database, you could protect our transmissions, as they go from the client to the server via TLS, or other protected tunnels. And then we look at Field-level Encryption, and that's what we're talking about today. That's all the above protections, at the perimeter level at the platform level. Plus, we're giving you granular access control, to your sensitive Data. Our main message is, keep the Data protected for at the earliest possible point, and only access it, when you have a valid business need to do so. That's a really critical aspect as we see Vertica customers, loading terabytes, petabytes of Data, into clusters of Vertica console, Vertica Database being able to give access to that Data, out to a wide variety of end users. We started off with organizations having, four people in an office doing Data science, or analytics, or Data warehousing, or whatever it's called within an organization, and that's now ballooned out, to a new customer coming in and telling us, we're going to have 1000 people accessing it, plus service accounts accessing Vertica, we need to be able to provide fine level access control, and be able to understand what are folks doing with that sensitive Data? And how can we Secure it, the best practices possible. In very simple state, voltage protect Data at rest and in motion. The encryption of Data facilitates compliance, and it reduces your risk of breach. So if you take a look at what we mean by feel level, we could take a name, that name might not just be in US ASCII. Here we have a sort of Latin one extended, example of Harold Potter, and we could take a look at the example protected Data. Notice that we're taking a character set approach, to protecting it, meaning, I've got an alphanumeric option here for the format, that I'm applying to that name. That gives me a mix of alpha and numeric, and plus, I've got some of that Latin one extended alphabet in there as well, and that's really controllable by the end customer. They can have this be just US ASCII, they can have it be numbers for numbers, you can have a wide variety, of different protection mechanisms, including ignoring some characters in the alphabet, in case you want to maintain formatting. We've got all the bells and whistles, that you would ever want, to put on top of format preserving encryption, and we continue to add more to that platform, as we go forward. Taking a look at tax ID, there's an example of numbers for numbers, pretty basic, but it gives us the sort of idea, that we can very quickly and easily keep the Data protected, while maintaining the format. No schema changes are going to be required, when you want to protect that Data. If you look at credit card number, really popular example, and the same concept can be applied to tax ID, often the last four digits will be used in a tax ID, to verify someone's identity. That could be on an automated telephone system, it could be a customer service representative, just trying to validate the security of the customer, and we can keep that Data in the clear for that purpose, while protecting the entire string from breach. Dates are another critical area of concern, for a lot of medical use cases. But we're seeing Date of Birth, being included in a lot of Data privacy conversations, and we can protect dates with dates, they're going to be a valid date, and we have some really nifty tools, to maintain offsets between dates. So again, we've got the real depth of capability, within our encryption, that's not just saying, here's a one size fits all approach, GPS location, customer ID, IP address, all of those kinds of Data strings, can be protected by voltage Secure Data within Vertica. Let's take a look at the UDx basics. So what are we doing, when we add Voltage to Vertica? Vertica stays as is in the center. In fact, if you get the Vertical distribution, you're getting the Secure Data UDx onboard, you just need to enable it, and have Secure Data virtual appliance, that's the box there on the middle right. That's what we come in and add to the mix, as we start to be able to add those capabilities to Vertica. On the left hand side, you'll see that your users, your service accounts, your analytics, are still typically doing Select, Update, Insert, Delete, type of functionality within Vertica. And they're going to come into Vertica's access control layer, they're going to also access those services via SQL, and we simply extend SQL for Vertica. So when you add the UDx, you get additional syntax that we can provide, and we're going to show you examples of that. You can also integrate that with concepts, like Views within Vertica. So that we can say, let's give a view of Data, that gives the Data in the clear, using the UDx to decrypt that Data, and let's give everybody else, access to the raw Data which is protected. Third parties could be brought in, folks like contractors or folks that aren't vetted, as closely as a security team might do, for internal sensitive Data access, could be given access to the Vertical cluster, without risk of them breaching and going into some area, they're not supposed to take a look at. Vertica has excellent control for access, down even to the column level, which is phenomenal, and really provides you with world class security, around the Vertical solution itself. Secure Data adds another layer of protection, like we're mentioning, so that we can have Data protected in use, Data protected at rest, and then we can have the ability, to share that protected Data throughout the organization. And that's really where Secure Data shines, is the ability to protect that Data on mainframe, on mobile, and open systems, in the cloud, everywhere you want to have that Data move to and from Vertica, then you can have Secure Data, integrated with those endpoints as well. That's an additional solution on top, the Secure Data Plus Vertica solution, that is bundled together today for a sales purpose. But we can also have that conversation with you, about those wider Secure Data use cases, we'd be happy to talk to you about that. Security to the virtual appliance, is a lightweight appliance, sits on something like eight cores, 16 gigs of RAM, 100 gig of disk or 200 gig of disk, really a lightweight appliance, you can have one or many. Most customers have four in production, just for redundancy, they don't need them for scale. But we have some customers with 16 or more in production, because they're running such high volumes of transaction load. They're running a lot of web service transactions, and they're running Vertica as well. So we're going to have those virtual appliances, as co-located around the globe, hooked up to all kinds of systems, like Syslog, LDAP, load balancers, we've got a lot of capability within the appliance, to fit into your enterprise IP landscape. So let me get you directly into the neat, of what does the UDx do. If you're technical and you know SQL, this is probably going to be pretty straightforward to you, you'll see the copy command, used widely in Vertica to get Data into Vertica. So let's try to protect that Data when we're ingesting it. Let's grab it from maybe a CSV file, and put it straight into Vertica, but protected on the way and that's what the UDx does. We have Voltage Secure protectors, an added syntax, like I mentioned, to the Vertica SQL. And that allows us to say, we're going to protect the customer first name, using the parameters of hyper alphanumeric. That's our internal lingo of a format, within Secure Data, this part of our API, the API is require very few inputs. The format is the one, that you as a developer will be supplying, and you'll have different ones for maybe SSN, you'll have different formats for street address, but you can reuse a lot of your formats, across a lot of your PII, PHI Data types. Protecting after ingest is also common. So I've got some Data, that's already been put into a staging area, perhaps I've got a landing zone, a sandbox of some sort, now I want to be able to move that, into a different zone in Vertica, different area of the schema, and I want to have that Data protected. We can do that with the update command, and simply again, you'll notice Voltage Secure protect, nothing too wild there, basically the same syntax. We're going to query unprotected Data. How do we search once I've encrypted all my Data? Well, actually, there's a pretty nifty trick to do so. If you want to be able to query unprotected Data, and we have the search string, like a phone number there in this example, simply call Voltage Secure protect on that, now you'll have the cipher text, and you'll be able to search the stored cipher text. Again, we're just format preserving encrypting the Data, and it's just a string, and we can always compare those strings, using standard syntax and SQL. Using views to decrypt Data, again a powerful concept, in terms of how to make this work, within the Vertica Landscape, when you have a lot of different groups of users. Views are very powerful, to be able to point a BI tool, for instance, business intelligence tools, Cognos, Tableau, etc, might be accessing Data from Vertica with simple queries. Well, let's point them to a view that does the hard work, and uses the Vertical nodes, and its horsepower of CPU and RAM, to actually run that Udx, and do the decryption of the Data in use, temporarily in memory, and then throw that away, so that it can't be breached. That's a nice way to keep your users active and working and going forward, with their Data access and Data analytics, while also keeping the Data Secure in the process. And then we might want to export some Data, and push it out to someone in a clear text manner. We've got a third party, needs to take the tax ID along with some Data, to do some processing, all we need to do is call Voltage Secure Access, again, very similar to the protect call, and you're writing the parameter again, and boom, we have decrypted the Data and used again, the Vertical resources of RAM and CPU and horsepower, to do the work. All we're doing with Voltage Secure Data Appliance, is a real simple little key fetch, across a protected tunnel, that's a tiny atomic transaction, gets done very quick, and you're good to go. This is it in terms of the UDx, you have a couple of calls, and one parameter to pass, everything else is config driven, and really, you're up and running very quickly. We can even do demos and samples of this Vertical Udx, using hosted appliances, that we put up for pre sales purposes. So folks want to get up and get a demo going. We could take that Udx, configure it to point to our, appliance sitting on the internet, and within a couple of minutes, we're up and running with some simple use cases. Of course, for on-prem deployment, or deployment in the cloud, you'll want your own appliance in your own crypto district, you have your own security, but it just shows, that we can easily connect to any appliance, and get this working in a matter of minutes. Let's take a look deeper at the voltage plus Vertica solution, and we'll describe some of the use cases and path to success. First of all your steps to, implementing Data-centric security and Vertica. Want to note there on the left hand side, identify sensitive Data. How do we do this? I have one customer, where they look at me and say, Rich, we know exactly what our sensitive Data is, we develop the schema, it's our own App, we have a customer table, we don't need any help in this. We've got other customers that say, Rich, we have a very complex Database environment, with multiple Databases, multiple schemas, thousands of tables, hundreds of thousands of columns, it's really, really complex help, and we don't know what people have been doing exactly, with some of that Data, We've got various teams that share this resource. There, we do have additional tools, I wanted to give a shout out to another microfocus product, which is called Structured Data Manager. It's a great tool that helps you identify sensitive Data, with some really amazing technology under the hood, that can go into a Vertica repository, scan those tables, take a sample of rows or a full table scan, and give you back some really good reports on, we think this is sensitive, let's go confirm it, and move forward with Data protection. So if you need help on that, we've got the tools to do it. Once you identify that sensitive Data, you're going to want to understand, your Data flows and your use cases. Take a look at what analytics you're doing today. What analytics do you want to do, on sensitive Data in the future? Let's start designing our analytics, to work with sensitive Data, and there's some tips and tricks that we can provide, to help you mitigate, any kind of concerns around performance, or any kind of concerns around rewriting your SQL. As you've noted, you can just simply insert our SQL additions, into your code and you're off and running. You want to install and configure the Udx, and secure Data software plants. Well, the UDx is pretty darn simple. The documentation on Vertica is publicly available, you could see how that works, and what you need to configure it, one file here, and you're ready to go. So that's pretty straightforward to process, either grant some access to the Udx, and that's really up to the customer, because there are many different ways, to handle access control in Vertica, we're going to be flexible to fit within your model, of access control and adding the UDx to your mix. Each customer is a little different there, so you might want to talk with us a little bit about, the best practices for your use cases. But in general, that's going to be up and running in just a minute. The security software plants, hardened Linux appliance today, sits on-prem or in the cloud. And you can deploy that. I've seen it done in 15 minutes, but that's what the real tech you had, access to being able to generate a search, and do all this so that, your being able to set the firewall and all the DNS entries, the basically blocking and tackling of a software appliance, you get that done, corporations can take care of that, in just a couple of weeks, they get it all done, because they have wait waiting on other teams, but the software plants are really fast to get stood up, and they're very simple to administer, with our web based GUI. Then finally, you're going to implement your UDx use cases. Once the software appliance is up and running, we can set authentication methods, we could set up the format that you're going to use in Vertica, and then those two start talking together. And it should be going in dev and test in about half a day, and then you're running toward production, in just a matter of days, in most cases. We've got other customers that say, Hey, this is going to be a bigger migration project for us. We might want to split this up into chunks. Let's do the real sensitive and scary Data, like tax ID first, as our sort of toe in the water approach, and then we'll come back and protect other Data elements. That's one way to slice and dice, and implement your solution in a planned manner. Another way is schema based. Let's take a look at this section of the schema, and implement protection on these Data elements. Now let's take a look at the different schema, and we'll repeat the process, so you can iteratively move forward with your deployment. So what's the added value? When you add full Vertica plus voltage? I want to highlight this distinction because, Vertica contains world class security controls, around their Database. I'm an old time DBA from a different product, competing against Vertica in the past, and I'm really aware of the granular access controls, that are provided within various platforms. Vertica would rank at the very top of the list, in terms of being able to give me very tight control, and a lot of different AWS methods, being able to protect the Data, in a lot of different use cases. So Vertica can handle a lot of your Data protection needs, right out of the box. Voltage Secure Data, as we keep mentioning, adds that defense in-Depth, and it's going to enable those, enterprise wide use cases as well. So first off, I mentioned this, the standard of FF1, that is format preserving encryption, we're the authors of it, we continue to maintain that, and we want to emphasize that customers, really ought to be very, very careful, in terms of choosing a NIST standard, when implementing any kind of encryption, within the organization. So 8 ES was one of the first, and Hallmark, benchmark encryption algorithms, and in 2016, we were added to that mix, as FF1 with CS online. If you search NIST, and Voltage Security, you'll see us right there as the author of the standard, and all the processes that went along with that approval. We have centralized policy for key management, authentication, audit and compliance. We can now see that Vertica selected or fetch the key, to be able to protect some Data at this date and time. We can track that and be able to give you audit, and compliance reporting against that Data. You can move protected Data into and out of Vertica. So if we ingest via Kafka, and just via NiFi and Kafka, ingest on stream sets. There are a variety of different ingestion methods, and streaming methods, that can get Data into Vertica. We can integrate secure Data with all of those components. We're very well suited to integrate, with any Hadoop technology or any big Data technology, as we have API's in a variety of languages, bitness and platforms. So we've got that all out of the box, ready to go for you, if you need it. When you're moving Data out of Vertica, you might move it into an open systems platform, you might move it to the cloud, we can also operate and do the decryption there, you're going to get the same plaintext back, and if you protect Data over in the cloud, and move it into Vertica, you're going to be able to decrypt it in Vertica. That's our cross platform promise. We've been delivering on that for many, many years, and we now have many, many endpoints that do that, in production for the world's largest organization. We're going to preserve your Data format, and referential integrity. So if I protect my social security number today, I can protect another batch of Data tomorrow, and that same ciphertext will be generated, when I put that into Vertica, I can have absolute referential integrity on that Data, to be able to allow for analytics to occur, without even decrypting Data in many cases. And we have decrypt access for authorized users only, with the ability to add LDAP authentication authorization, for UDx users. So you can really have a number of different approaches, and flavors of how you implement voltage within Vertica, but what you're getting is the additional ability, to have that confidence, that we've got the Data protected at rest, even if I have a DBA that's not vetted or someone new, or I don't know where this person is from a third party, and being provided access as a DBA level privilege. They could select star from all day long, and they're going to get ciphertext, they're going to have nothing of any value, and if they want to use the UDF to decrypt it, they're going to be tracked and traced, as to their utilization of that. So it allows us to have that control, and additional layer of security on your sensitive Data. This may be required by regulatory agencies, and it's seeming that we're seeing compliance audits, get more and more strict every year. GDPR was kind of funny, because they said in 2016, hey, this is coming, they said in 2018, it's here, and now they're saying in 2020, hey, we're serious about this, and the fines are mounting. And let's give you some examples to kind of, help you understand, that these regulations are real, the fines are real, and your reputational damage can be significant, if you were to be in breach, of a regulatory compliance requirements. We're finding so many different use cases now, popping up around regional protection of Data. I need to protect this Data so that it cannot go offshore. I need to protect this Data, so that people from another region cannot see it. That's all the kind of capability that we have, within secure Data that we can add to Vertica. We have that broad platform support, and I mentioned NiFi and Kafka, those would be on the left hand side, as we start to ingest Data from applications into Vertica. We can have landing zone approaches, where we provide some automated scripting at an OS level, to be able to protect ETL batch transactions coming in. We could protect within the Vertica UDx, as I mentioned, with the copy command, directly using Vertica. Everything inside that dot dash line, is the Vertical Plus Voltage Secure Data combo, that's sold together as a single package. Additionally, we'd love to talk with you, about the stuff that's outside the dash box, because we have dozens and dozens of endpoints, that could protect and access Data, on many different platforms. And this is where you really start to leverage, some of the extensive power of secure Data, to go across platform to handle your web based apps, to handle apps in the cloud, and to handle all of this at scale, with hundreds of thousands of transactions per second, of format preserving encryption. That may not sound like much, but when you take a look at the algorithm, what we're doing on the mathematics side, when you look at everything that goes into that transaction, to me, that's an amazing accomplishment, that we're trying to reach those kinds of levels of scale, and with Vertica, it scales horizontally. So the more nodes you add, the more power you get, the more throughput you're going to get, from voltage secure Data. I want to highlight the next steps, on how we can continue to move forward. Our secure Data team is available to you, to talk about the landscape, your use cases, your Data. We really love the concept that, we've got so many different organizations out there, using secure Data in so many different and unique ways. We have vehicle manufacturers, who are protecting not just the VIN, not just their customer Data, but in fact they're protecting sensor Data from the vehicles, which is sent over the network, down to the home base every 15 minutes, for every vehicle that's on the road, and every vehicle of this customer of ours, since 2017, has included that capability. So now we're talking about, an additional millions and millions of units coming online, as those cars are sold and distributed, and used by customers. That sensor Data is critical to the customer, and they cannot let that be ex-filled in the clear. So they protect that Data with secure Data, and we have a great track record of being able to meet, a variety of different unique requirements, whether it's IoT, whether it's web based Apps, E-commerce, healthcare, all kinds of different industries, we would love to help move the conversations forward, and we do find that it's really a three party discussion, the customer, secure Data experts in some cases, and the Vertica team. We have great enablement within Vertica team, to be able to explain and present, our secure Data solution to you. But we also have that other ability to add other experts in, to keep that conversation going into a broader perspective, of how can I protect my Data across all my platforms, not just in Vertica. I want to give a shout out to our friends at Vertica Academy. They're building out a great demo and training facilities, to be able to help you learn more about these UDx's, and how they're implemented. The Academy, is a terrific reference and resource for your teams, to be able to learn more, about the solution in a self guided way, and then we'd love to have your feedback on that. How can we help you more? What are the topics you'd like to learn more about? How can we look to the future, in protecting unstructured Data? How can we look to the future, of being able to protect Data at scale? What are the requirements that we need to be meeting? Help us through the learning processes, and through feedback to the team, get better, and then we'll help you deliver more solutions, out to those endpoints and protect that Data, so that we're not having Data breach, we're not having regulatory compliance concerns. And then lastly, learn more about the Udx. I mentioned, that all of our content there, is online and available to the public. So vertica.com/secureData , you're going to be able to walk through the basics of the UDX. You're going to see how simple it is to set up, what the UDx syntax looks like, how to grant access to it, and then you'll start to be able to figure out, hey, how can I start to put this, into a PLC in my own environment? Like I mentioned before, we have publicly available hosted appliance, for demo purposes, that we can make available to you, if you want to PLC this. Reach out to us. Let's get a conversation going, and we'll get you the address and get you some instructions, we can have a quick enablement session. We really want to make this accessible to you, and help demystify the concept of encryption, because when you see it as a developer, and you start to get your hands on it and put it to use, you can very quickly see, huh, I could use this in a variety of different cases, and I could use this to protect my Data, without impacting my analytics. Those are some of the really big concerns that folks have, and once we start to get through that learning process, and playing around with it in a PLC way, that we can start to really put it to practice into production, to say, with confidence, we're going to move forward toward Data encryption, and have a very good result, at the end of the day. This is one of the things I find with customers, that's really interesting. Their biggest stress, is not around the timeframe or the resource, it's really around, this is my Data, I have been working on collecting this Data, and making it available in a very high quality way, for many years. This is my job and I'm responsible for this Data, and now you're telling me, you're going to encrypt that Data? It makes me nervous, and that's common, everybody feels that. So we want to have that conversation, and that sort of trial and error process to say, hey, let's get your feet wet with it, and see how you like it in a sandbox environment. Let's now take that into analytics, and take a look at how we can make this, go for a quick 1.0 release, and let's then take a look at, future expansions to that, where we start adding Kafka on the ingest side. We start sending Data off, into other machine learning and analytics platforms, that we might want to utilize outside of Vertica, for certain purposes, in certain industries. Let's take a look at those use cases together, and through that journey, we can really chart a path toward the future, where we can really help you protect that Data, at rest, in use, and keep you safe, from both the hackers and the regulators, and that I think at the end of the day, is really what it's all about, in terms of protecting our Data within Vertica. We're going to have a little couple minutes for Q&A, and we would encourage you to have any questions here, and we'd love to follow up with you more, about any questions you might have, about Vertica Plus Voltage Secure Data. They you very much for your time today.
SUMMARY :
and our engineering team is planning to join the Forum, and our goal is to keep you off the news,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Vertica | ORGANIZATION | 0.99+ |
100 gig | QUANTITY | 0.99+ |
16 | QUANTITY | 0.99+ |
16 gigs | QUANTITY | 0.99+ |
200 gig | QUANTITY | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
2016 | DATE | 0.99+ |
Paige | PERSON | 0.99+ |
Rich Gaston | PERSON | 0.99+ |
dozens | QUANTITY | 0.99+ |
2018 | DATE | 0.99+ |
Vertica Academy | ORGANIZATION | 0.99+ |
2020 | DATE | 0.99+ |
SQL | TITLE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
1000 people | QUANTITY | 0.99+ |
Hallmark | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Harold Potter | PERSON | 0.99+ |
Rich | PERSON | 0.99+ |
millions | QUANTITY | 0.99+ |
Stanford University | ORGANIZATION | 0.99+ |
15 minutes | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
Each customer | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
California | LOCATION | 0.99+ |
Kafka | TITLE | 0.99+ |
Vertica | TITLE | 0.99+ |
Latin | OTHER | 0.99+ |
tomorrow | DATE | 0.99+ |
2017 | DATE | 0.99+ |
eight cores | QUANTITY | 0.99+ |
two | QUANTITY | 0.98+ |
GDPR | TITLE | 0.98+ |
first | QUANTITY | 0.98+ |
one customer | QUANTITY | 0.98+ |
Tableau | TITLE | 0.98+ |
United States | LOCATION | 0.97+ |
this week | DATE | 0.97+ |
Vertica | LOCATION | 0.97+ |
4/2 | DATE | 0.97+ |
Linux | TITLE | 0.97+ |
one file | QUANTITY | 0.96+ |
vertica.com/secureData | OTHER | 0.96+ |
four | QUANTITY | 0.95+ |
about half a day | QUANTITY | 0.95+ |
Cognos | TITLE | 0.95+ |
four people | QUANTITY | 0.94+ |
Udx | ORGANIZATION | 0.94+ |
one way | QUANTITY | 0.94+ |
UNLIST TILL 4/2 - A Deep Dive into the Vertica Management Console Enhancements and Roadmap
>> Jeff: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "A Deep Dive "into the Vertica Mangement Console Enhancements and Roadmap." I'm Jeff Healey of Vertica Marketing. I'll be your host for this breakout session. Joining me are Bhavik Gandhi and Natalia Stavisky from Vertica engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively visit Vertica Forums at forum.vertica.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going well after the event. Also, a reminder that you can maximize the screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to you on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Bhavik. >> Bhavik: All right. So hello, and welcome, everybody doing this presentation of "Deep Dive into the Vertica Management Console Enhancements and Roadmap." Myself, Bhavik, and my team member, Natalia Stavisky, will go over a few useful announcements on Vertica Management Console, discussing a few real scenarios. All right. So today we will go forward with the brief introduction about the Management Console, then we will discuss the benefits of using Management Console by going over a couple of user scenarios for the query taking too long to run and receiving email alerts from Management Console. Then we will go over a few MC features for what we call Eon Mode databases, like provisioning and reviving the Eon Mode databases from MC, managing the subcluster and understanding the Depot. Then we will go over some of the future announcements on MC that we are planning. All right, so let's get started. All right. So, do you want to know about how to provision a new Vertica cluster from MC? How to analyze and understand a database workload by monitoring the queries on the database? How do you balance the resource pools and use alerts and thresholds on MC? So, the Management Console is basically our answer and we'll talk about its capabilities and new announcements in this presentation. So just to give a brief overview of the Management Console, who uses Management Console, it's generally used by IT administrators and DB admins. Management Console can be used to monitor both Eon Mode and Enterprise Mode databases. Why to use Management Console? You can use Management Console for provisioning Vertica databases and cluster. You can manage the already existing Vertica databases and cluster you have, and you can use various tools on Management Console like query execution, Database Designer, Workload Analyzer, and set up alerts and thresholds to get notified by some of your activities on the MC. So let's go over a few benefits of using Management Console. Okay. So using Management Console, you can view and optimize resource pool usage. Management Console helps you to identify some critical conditions on your Vertica cluster. Additionally, you can set up various thresholds thresholds in MC and get other data if those thresholds are triggered on the database. So now let's dig into the couple of scenarios. So for the first scenario, we will discuss about queries taking too long and using workload analyzer to possibly help to solve the problem. In the second scenario, we will go over alert email that you received from your Management Console and analyzing the problem and taking required actions to solve the problem. So let's go over the scenario where queries are taking too long to run. So in this example, we have this one query that we are running using the query execution on MC. And for some reason we notice that it's taking about 14.8 seconds seconds to execute this query, which is higher than the expected run time of the query. The query that we are running happens to be the query used by MC during the extended monitoring. Notice that the table name and the schema name which is ds_requests_issued, and, is the schema used for extended monitoring. Now in 10.0 MC we have redesigned the Workload Analyzer and Recommendations feature to show the recommendations and allow you to execute those recommendations. In our example, we have taken the table name and figured the tuning descriptions to see if there are any tuning recommendations related to this table. As we see over here, there are three tuning recommendations available for that table. So now in 10.0 MC, you can select those recommendations and then run them. So let's run the recommendations. All right. So once recommendations are run successfully, you can go and see all the processed recommendations that you have run previously. Over here we see that there are three recommendations that we had selected earlier have successfully processed. Now we take the same query and run it on the query execution on MC and hey, it's running really faster and we see that it takes only 0.3 seconds to run the query and, which is about like 98% decrease in original runtime of the query. So in this example we saw that using a Workload Analyzer tool on MC you can possibly triage and solve issue for your queries which are taking to long to execute. All right. So now let's go over another user scenario where DB admin's received some alert email messages from MC and would like to understand and analyze the problem. So to know more about what's going on on the database and proactively react to the problems, DB admins using the Management Console can create set of thresholds and get alerted about the conditions on the database if the threshold values is reached and then respond to the problem thereafter. Now as a DB admin, I see some email message notifications from MC and upon checking the emails, I see that there are a couple of email alerts received from MC on my email. So one of the messages that I received was for Query Resource Rejections greater than 5, pool, midpool7. And then around the same time, I received another email from the MC for the Failed Queries greater than 5, and in this case I see there are 80 failed queries. So now let's go on the MC and investigate the problem. So before going into the deep investigation about failures, let's review the threshold settings on MC. So as we see, we have set up the thresholds under the database settings page for failed queries in the last 10 minutes greater than 5 and MC should send an email to the individual if the threshold is triggered. And also we have a threshold set up for queries and resource rejections in the last five minutes for midpool7 set to greater than 5. There are various other thresholds on this page that you can set if you desire to. Now let's go and triage those email alerts about the failed queries and resource rejections that we had received. To analyze the failed queries, let's take a look at the query statistics page on the database Overview page on MC. Let's take a look at the Resource Pools graph and especially for the failed queries for each resource pools. And over to the right under the failed query section, I see about like, in the last 24 hours, there are about 6,000 failed queries for midpool7. And now I switch to view to see the statistics for each user and on this page I see for User MaryLee on the right hand side there are a high number of failed queries in last 24 hours. And to know more about the failed queries for this user, I can click on the graph for this user and get the reasons behind it. So let's click on the graph and see what's going on. And so clicking on this graph, it takes me to the failed queries view on the Query Monitoring page for database, on Database activities tab. And over here, I see there are a high number of failed queries for this user, MaryLee, with the reasons stated as, exceeding high limit. To drill down more and to know more reasons behind it, I can click on the plus icon on the left hand side for each failed queries to get the failure reason for each node on the database. So let's do that. And clicking the plus icon, I see for the two nodes that are listed, over here it says there are insufficient resources like memory and file handles for midpool7. Now let's go and analyze the midpool7 configurations and activities on it. So to do so, I will go over to the Resource Pool Monitoring view and select midpool7. I see the resource allocations for this resource pool is very low. For example, the max memory is just 1MB and the max concurrency is set to 0. Hmm, that's very odd configuration for this resource pool. Also in the bottom right graph for the resource rejections for midpool7, the graph shows very high values for resource rejection. All right. So since we saw some odd configurations and odd resource allocations for midpool7, I would like to see when this resource, when the settings were changed on the resource pools. So to do this, I can preview the audit logs on, are available on the Management Console. So I can go onto the Vertica Audit Logs and see the logs for the resource pool. So I just (mumbles) for the logs and figuring the logs for midpool7. I see on February 17th, the memory and other attributes for midpool7 were modified. So now let's analyze the resource activity for midpool7 around the time when the configurations were changed. So in our case we are using extended monitoring on MC for this database, so we can go back in time and see the statistics over the larger time range for midpool7. So viewing the activities for midpool7 around February 17th, around the time when these configurations were changed, we see a decrease in resource pool usage. Also, on the bottom right, we see the resource rejections for this midpool7 have an increase, linear increase, after the configurations were changed. I can select a point on the graph to get the more details about the resource rejections. Now to analyze the effects of the modifications on midpool7. Let's go over to the Query Monitoring page. All right, I will adjust the time range around the time when the configurations were changed for midpool7 and completed activities queries for user MaryLee. And I see there are no completed queries for this user. Now I'm taking a look at the Failed Queries tab and adjusting the time range around the time when the configurations were changed. I can do so because we are using extended monitoring. So again, adjusting the time, I can see there are high number of failed queries for this user. There about about like 10,000 failed queries for this user after the configurations were changed on this resource pool. So now let's go and modify the settings since we know after the configurations were changed, this user was not able to run the queries. So you can change the resource pool settings of using Management Console's database settings page and under the Resource Pools tab. So selecting the midpool7, I see the same odd configurations for this resource pool that we saw earlier. So now let's go and modify it, the settings. So I will increase the max memory and modify the settings for midpool7 so that it has adequate resources to run the queries for the user. Hit apply on the right hand top to see the settings. Now let's do the validation after we change the resource pool attributes. So let's go over to the same query monitoring page and see if MaryLee user is able to run the queries for midpool7. We see that now, after the configuration, after the change, after we changed the configuration for midpool7, the user can run the queries successfully and the count for Completed Queries has increased after we modified the settings for this midpool7 resource pool. And also viewing the resource pool monitoring page, we can validate that after the new configurations for midpool7 has been applied and also the resource pool usage after the configuration change has increased. And also on the bottom right graph, we can see that the resource rejections for midpool7 has decreased over the time after we modified the settings. And since we are using extended monitoring for this database, I can see that the trend in data for these resource pools, the before and after effects of modifying the settings. So initially when the settings were changed, there were high resource rejections and after we again modified the settings, the resource rejections went down. Right. So now let's go work with the provisioning and reviving the Eon Mode Vertica database cluster using the Management Console on different platform. So Management Console supports provisioning and reviving of Eon Mode databases on various cloud environments like AWS, the Google Cloud Platform, and Pure Storage. So for Google, for provisioning the Vertica Management Console on Google Cloud Platform you can use launch a template. Or on AWS environment you can use the cloud formation templates available for different OS's. Once you have provisioned Vertica Management Console, you can provision the Vertica cluster and databases from MC itself. So you can provision a Vertica cluster, you can select the Create new database button available on the homepage. This will open up the wizard to create a new database and cluster. In this example, we are using we are using the Google Cloud Platform. So the wizard will ask me for varius authentication parameters for the Google Cloud Platform. And if you're on AWS, it'll ask you for the authentication parameters for the AWS environment. And going forward on the Wizard, it'll ask me to select the instance Type. I will select for the new Vertica cluster. And also provide the communal location url for my Eon Mode database and all the other preferences related to the new cluster. Once I have selected all the preferences for my new cluster I can preview the settings and I can hit, if I am, I can hit Create if all looks okay. So if I hit Create, this will create a new, MC will create a new GCP instances because we are on the GCP environment in this example. It will create a cluster on this instance, it'll create a Vertica Eon Mode Database on this cluster. And it will, additionally, you can load the test data on it if you like to. Now let's go over and revive the existing Eon Mode database from the communal location. So you can do it the same using the Management Console by selecting the Revive Eon Mode database button on the homepage. This will again open up the wizard for reviving the Eon Mode database. Again, in this example, since we are using GCP Platform, it will ask me for the Google Cloud storage authentication attributes. And for reviving, it will ask me for the communal location so I can enter the Google Storage bucket and my folder and it will discover all the Eon Mode databases located under this folder. And I can select one of the databases that I would like to revive. And it will ask me for other Vertica preferences and for this video, for this database reviving. And once I enter all the preferences and review all the preferences I can hit Revive the database button on the Wizard. So after I hit Revive database it will create the GCP instances. The number of GCP instances that I created would be seen as the number of hosts on the original Vertica cluster. It will install the Vertica cluster on this data, on this instances and it will revive the database and it will start the database. And after starting the database, it will be imported on the MC so you can start monitoring on it. So in this example, we saw you can provision and revive the Vertica database on the GCP Platform. Additionally, you can use AWS environment to provision and revive. So now since we have the Eon Mode database on MC, Natalia will go over some Eon Mode features on MC like managing subcluster and Depot activity monitoring. Over to you, Natalia. >> Natalia: Okay, thank you. Hello, my name is Natalia Stavisky. I am also a member of Vertica Management Console Team. And I will talk today about the work I did to allow users to manage subclusters using the Management Console, and also the work I did to help users understand what's going on in their Depot in the Vertica Eon Mode database. So let's look at the picture of the subclusters. On the Manage page of Vertica Management Console, you can see here is a page that has blue tabs, and the tab that's active is Subclusters. You can see that there are two subclusters are available in this database. And for each of the subclusters, you can see subcluster properties, whether this is the primary subcluster or secondary. In this case, primary is the default subcluster. It's indicated by a star. You can see what nodes belong to each subcluster. You can see the node state and node statistics. You can also easily add a new subcluster. And we're quickly going to do this. So once you click on the button, you'll launch the wizard that'll take you through the steps. You'll enter the name of the subcluster, indicate whether this is secondary or primary subcluster. I should mention that Vertica recommends having only one primary subcluster. But we have both options here available. You will enter the number of nodes for your subcluster. And once the subcluster has been created, you can manage the subcluster. What other options for managing subcluster we have here? You can scale up an existing subcluster and that's a similar approach, you launch the wizard and (mumbles) nodes. You want to add to your existing subcluster. You can scale down a subcluster. And MC validates requirements for maintaining minimal number of nodes to prevent database shutdown. So if you can not remove any nodes from a subcluster, this option will not be available. You can stop a subcluster. And depending on whether this is a primary subcluster or secondary subcluster, this option may be available or not available. Like in this picture, we can see that for the default subcluster this option is not available. And this is because shutting down the default subcluster will cause the database to shut down as well. You can terminate a subcluster. And again, the MC warns you not to terminate the primary subcluster and validates requirements for maintaining minimal number of nodes to prevent database shutdown. So now we are going to talk a little more about how the MC helps you to understand what's going on in your Depot. So Depot is one of the core of Eon Mode database. And what are the frequently asked questions about the Depot? Is the Depot size sufficient? Are a subset of users putting a high load on the database? What tables are fetched and evicted repeatedly, we call it "re-fetched," in Depot? So here in the Depot Activity Monitoring page, we now have four tabs that allow you to answer those questions. And we'll go a little more in detail through each of them, but I'll just mention what they are for now. At a Glance shows you basic Depot configuration and also shows you query executing. Depot Efficiency, we'll talk more about that and other tabs. Depot Content, that shows you what tables are currently in your Depot. And Depot Pinning allows you to see what pinning policies have been created and to create new pinning policies. Now let's go through a scenario. Monitoring performance of workloads on one subcluster. As you know, Eon Mode database allows you to have multiple subclusters and we'll explore how this feature is useful and how we can use the Management Console to make decisions regarding whether you would like to have multiple subclusters. So here we have, in my setup, a single subcluster called default_subcluster. It has two users that are running queries that are accessing tables, mostly in schema public. So the query started executing and we can see that after fetching tables from Communal, which is the red line, the rest of the time the queries are executing in Depot. The green line is indicating queries running in Depot. The all nodes Depot is about 88% full, a steady flow, and the depot size seems to be sufficient for query executions from Depot only. That's the good case scenario. Now at around 17 :15, user Sherry got an urgent request to generate a report. And at, she started running her queries. We can see that picture is quite different now. The tables Sherry is querying are in a different schema and are much larger. Now we can see multiple lines in different colors. We can see a bunch of fetches and evictions which are indicated by blue and purple bars, and a lot of queries are now spilling into Communal. This is the red and orange lines. Orange line is an indicator of a query running partially in Depot and partially getting fetched from Communal. And the red line is data fetched from Communal storage. Let's click on the, one of the lines. Each data point, each point on the line, it'll take you to the Query Details page where you can see more about what's going on. So this is the page that shows us what queries have been run in this particular time interval which is on top of this page in orange color. So that's about one minute time interval and now we can see user Sherry among the users that are running queries. Sherry's queries involve large tables and are running against a different schema. We can see the clickstream schema in the name of the, in part of the query request. So what is happening, there is not enough Depot space for both the schema that's already in use and the one Sherry needs. As a result, evictions and fetches have started occurring. What other questions we can ask ourself to help us understand what's going on? So how about, what tables are most frequently re-fetched? So for that, we will go to the Depot Efficiency page and look at the middle, the middle chart here. We can see the larger version of this chart if we expand it. So now we have 10 tables listed that are most frequently being re-fetched. We can see that there is a clickstream schema and there are other schemas so all of those tables are being used in the queries, fetched, and then there is not enough space in the Depot, they getting evicted and they get re-fetched again. So what can be done to enable all queries to run in Depot? Option one can be increase the Depot size. So we can do this by running the following queries, which (mumbles) which nodes and storage location and the new Depot size. And I should mention that we can run this query from the Management Console from the query execution page. So this would have helped us to increase the Depot size. What other options do we have, for example, when increasing Depot size is not an option? We can also provision a second subcluster to isolate workloads like Sherry's. So we are going to do this now and we will provision a second subcluster using the Manage page. Here we're creating subcluster for Sherry or for workloads like hers. And we're going to create a (mumbles). So Sherry's subcluster has been created. We can see it here, added to the list of the subclusters. It's a secondary subcluster. Sherry has been instructed to use the new SherrySubcluster for her work. Now let's see what happened. We'll go again at Depot Activity page and we'll look at the At a Glance tab. We can see that around >> 18: 07, Sherry switched to running her queries on SherrySubcluster. On top of this page, you can see subcluster selected. So we currently have two subclusters and I'm looking, what happened to SherrySubcluster once it has been provisioned? So Sherry started using it and the lines after initial fetching from Depot, which was from Communal, which was the red line, after that, all Sherry's queries fit in Depot, which is indicated by green line. Also the Depot is pretty full on those nodes, about 90% full. But the queries are processed efficiently, there is no spilling into Communal. So that's a good case scenario. Let's now go back and take a look at the original subcluster, default subcluster. So on the left portion of the chart we can see multiple lines, that was activity before Sherry switched to her own designated subcluster. At around 18:07, after Sherry switched from the subcluster to using her designated subcluster, there is no, she is no longer using the subcluster, she is not putting a load in it. So the lines after that are turning a green color, which means the queries that are still running in default subcluster are all running in Depot. We can also see that Depot fetches and evictions bars, those purple and blue bars, are no longer showing significant numbers. Also we can check the second chart that shows Communal Storage Access. And we can see that the bars have also dropped, so there is no significant access for Communal Storage. So this problem has been solved. Each of the subclusters are serving queries from Depot and that's our most efficient scenario. Let's also look at the other tabs that we have for Depot monitoring. Let's look at Depot Efficiency tab. It has six charts and I'll go through each one of them quickly. Files Reads by Location gives an indicator of where the majority of query execution took place in Depot or in Communal. Top 10 Re-Fetches into Depot, and imagine the charts earlier in our user case, it shows tables that are most frequently fetched and evicted and then fetched again. These are good candidates to get pinned if increasing Depot size is not an option. Note that both of these charts have an option to select time interval using calendar widget. So you can get the information about the activity that happened during that time interval. Depot Pinning shows what portion of your Depot is pinned, both by byte count and by table count. And the three tables at the bottom show Depot structure. How long tables stay in Depot, we would like tables to be fetched in Depot and stay there for a long time, how often they are accessed, again, the tables in Depot, we would like to see them accessed frequently, and what the size range of tables in Depot. Depot Content. This tab allows us to search for tables that are currently in Depot and also to see stats like table size in Depot. How often tables are accessed and when were they last accessed. And the same information that's available for tables in Depot is also available on projections and partition levels for those tables. Depot Pinning. This tab allows users to see what policies are currently existing and so you can do this by clicking on the first little button and click search. This'll show you all existing policies that are already created. The second option allows you to search for a table and create a policy. You can also use the action column to modify existing policies or delete them. And the third option provides details about most frequently re-fetched tables, including fetch count, total access count, and number of re-fetched bytes. So all this information can help to make decisions regarding pinning specific tables. So that's about it about the Depot. And I should mention that the server team also has a very good presentation on the, webinar, on the Eon Mode database Depot management and subcluster management. that strongly recommend it to attend or download the slide presentation. Let's talk quickly about the Management Console Roadmap, what we are planning to do in the future. So we are going to continue focusing on subcluster management, there is still a lot of things we can do here. Promoting/demoting subclusters. Load balancing across subclusters, scheduling subcluster actions, support for large cluster mode. We'll continue working on Workload Analyzer enhancement recommendation, on backup and restore from the MC. Building custom thresholds, and Eon on HDFS support. Okay, so we are ready now to take any questions you may have now. Thank you.
SUMMARY :
for the virtual Vertica BDC 2020. and all the other preferences related to the new cluster. and the depot size seems to be sufficient So on the left portion of the chart
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Natalia Stavisky | PERSON | 0.99+ |
Sherry | PERSON | 0.99+ |
MaryLee | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
Natalia | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
February 17th | DATE | 0.99+ |
second scenario | QUANTITY | 0.99+ |
10 tables | QUANTITY | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
1MB | QUANTITY | 0.99+ |
two users | QUANTITY | 0.99+ |
first scenario | QUANTITY | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Bhavik | PERSON | 0.99+ |
80 failed queries | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Depot | ORGANIZATION | 0.99+ |
third | QUANTITY | 0.99+ |
Each | QUANTITY | 0.99+ |
six charts | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
each point | QUANTITY | 0.99+ |
three recommendations | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Bhavik Gandhi | PERSON | 0.99+ |
midpool7 | TITLE | 0.99+ |
two nodes | QUANTITY | 0.99+ |
second chart | QUANTITY | 0.99+ |
two subclusters | QUANTITY | 0.98+ |
second subcluster | QUANTITY | 0.98+ |
Each data point | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
both options | QUANTITY | 0.98+ |
4/2 | DATE | 0.98+ |
Eon | ORGANIZATION | 0.97+ |
this week | DATE | 0.97+ |
each subcluster | QUANTITY | 0.97+ |
about 90% | QUANTITY | 0.97+ |
three tables | QUANTITY | 0.96+ |
0 | QUANTITY | 0.96+ |
about 14.8 seconds seconds | QUANTITY | 0.96+ |
one subcluster | QUANTITY | 0.95+ |
UNLIST TILL 4/2 - The Next-Generation Data Underlying Architecture
>> Paige: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, Vertica next generation architecture. I'm Paige Roberts, open social relationship Manager at Vertica, I'll be your host for this session. And joining me is Vertica Chief Architect, Chuck Bear, before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment, in the question box that's below the slides and click submit. So as you think about it, go ahead and type it in, there'll be a Q&A session at the end of the presentation, where we'll answer as many questions, as we're able to during the time. Any questions that we don't get a chance to address, we'll do our best to answer offline. Or alternatively, you can visit the Vertica forums to post your questions there, after the session. Our engineering team is planning to join the forum and keep the conversation going, so you can, it's just sort of like the developers lounge would be in delight conference. It gives you a chance to talk to our engineering team. Also, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this virtual session is being recorded, and it will be available to view on demand this week, we'll send you a notification, as soon as it's ready. Okay, now, let's get started, over to you, Chuck. >> Chuck: Thanks for the introduction, Paige, Vertica vision is to help customers, get value from structured data. This vision is simple, it doesn't matter what vertical the customer is in. They're all analytics companies, it doesn't matter what the customers environment is, as data is generated everywhere. We also can't do this alone, we know that you need other tools and people to build a complete solution. You know our database is key to delivering on the vision because we need a database that scales. When you start a new database company, you aren't going to win against 30 year old products on features. But from day one, we had something else, an architecture built for analytics performance. This architecture was inspired by the C-store project, combining the best design ideas from academics and industry veterans like Dr. Mike Stonebreaker. Our storage is optimized for performance, we use many computers in parallel. After over 10 years of refinements against various customer workloads, much of the design held up and serendipitously, the fact that we don't store in place updates set Vertica up for success in the cloud as well. These days, there are other tools that embody some of these design ideas. But we have other strengths that are more important than the storage format, where the only good analytics database that runs both on premise and in the cloud, giving customers the option to migrate their workloads, in most convenient and economical environment, or a full data management solution, not just the query tool. Unlike some other choices, ours comes with integration with a sequel ecosystem and full professional support. We organize our product roadmap into four key pillars, plus the cross cutting concerns of open integration and performance and scale. We have big plans to strengthen Vertica, while staying true to our core. This presentation is primarily about the separation pillar, and performance and scale, I'll cover our plans for Eon, our data management architecture, Mart analytic clusters, or fifth generation query executer, and our data storage layer. Let's start with how Vertica manages data, one of the central design points for Vertica was shared nothing, a design that didn't utilize a dedicated hardware shared disk technology. This quote here is how Mike put it politely, but around the Vertica office, shared disk with an LMTB over Mike's dead body. And we did get some early field experience with shared disk, customers, well, in fact will learn on anything if you let them. There were misconfigurations that required certified experts, obscure bugs extent. Another thing about the shared nothing designed for commodity hardware though, and this was in the papers, is that all the data management features like fault tolerance, backup and elasticity have to be done in software. And no matter how much you do, procuring, configuring and maintaining the machines with disks is harder. The software configuration process to add more service may be simple, but capacity planning, racking and stacking is not. The original allure of shared storage returned, this time though, the complexity and economics are different. It's cheaper, even provision storage with a few clicks and only pay for what you need. It expands, contracts and brings the maintenance of the storage close to a team is good at it. But there's a key difference, it's an object store, an object stores don't support the API's and access patterns used by most database software. So another Vertica visionary Ben, set out to exploit Vertica storage organization, which turns out to be a natural fit for modern cloud shared storage. Because Vertica data files are written once and not updated, they match the object storage model perfectly. And so today we have Eon, Eon uses shared storage to hold Vertica data with local disk depot's that act as caches, ensuring that we can get the performance that our customers have come to expect. Essentially Eon in enterprise behave similarly, but we have the benefit of flexible storage. Today Eon has the features our customers expect, it's been developed in tune for years, we have successful customers such as Redpharma, and if you'd like to know more about Eon has helped them succeed in Amazon cloud, I highly suggest reading their case study, which you can find on vertica.com. Eon provides high availability and flexible scaling, sometimes on premise customers with local disks get a little jealous of how recovery and sub-clusters work in Eon. Though we operate on premise, particularly on pure storage, but enterprise also had strengths, the most obvious being that you don't need and short shared storage to run it. So naturally, our vision is to converge the two modes, back into a single Vertica. A Vertica that runs any combination of local disks and shared storage, with full flexibility and portability. This is easy to say, but over the next releases, here's what we'll do. First, we realize that the query executer, optimizer and client drivers and so on, are already the same. Just the transaction handling and data management is different. But there's already more going on, we have peer-to-peer depot operations and other internode transfers. And enterprise also has a network, we could just get files from remote nodes over that network, essentially mimicking the behavior and benefits of shared storage with the layer of software. The only difference at the end of it, will be which storage hold the master copy. In enterprise, the nodes can't drop the files because they're the master copy. Whereas in Eon they can be evicted because it's just the cache, the masters, then shared storage. And in keeping with versus current support for multiple storage locations, we can intermix these approaches at the table level. Getting there as a journey, and we've already taken the first steps. One of the interesting design ideas of the C-store paper is the idea that redundant copies, don't have to have the same physical organization. Different copies can be optimized for different queries, sorted in different ways. Of course, Mike also said to keep the recovery system simple, because it's hard to debug, whenever the recovery system is being used, it's always in a high pressure situation. This turns out to be a contradiction, and the latter idea was better. No down performing stuff, if you don't keep the storage the same. Recovery hardware if you have, to reorganize data in the process. Even query optimization is more complicated. So over the past couple releases, we got rid of non identical buddies. But the storage files can still diverge at the fifth level, because tuple mover operations are synchronized. The same record can end up in different files than different nodes. The next step in our journey, is to make sure both copies are identical. This will help with backup and restore as well, because the second copy doesn't need backed up, or if it is backed up, it appears identical to the deduplication that is going to look present in both backup systems. Simultaneously, we're improving the Vertica networking service to support this new access pattern. In conjunction with identical storage files, we will converge to a recovery system that instantaneous nodes can process queries immediately, by retrieving data they need over the network from the redundant copies as they do in Eon day with even higher performance. The final step then is to unify the catalog and transaction model. Related concepts such as segment and shard, local catalog and shard catalog will be coalesced, as they're really represented the same concepts all along, just in different modes. In the catalog, we'll make slight changes to the definition of a projection, which represents the physical storage organization. The new definition simplifies segmentation and introduces valuable granularities of sharding to support evolution over time, and offers a straightforward migration path for both Eon and enterprise. There's a lot more to our Eon story than just the architectural roadmap. If you missed yesterday's Vertica, in Eon mode presentation about supported cloud, on premise storage option, replays are available. Be sure to catch the upcoming presentation on sizing and configuring vertica and in beyond doors. As we've seen with Eon, Vertica can separate data storage from the compute nodes, allowing machines to quickly fill in for each other, to rebuild fault tolerance. But separating compute and storage is used for much, much more. We now offer powerful, flexible ways for Vertica to add servers and increase access to the data. Vertica nine, this feature is called sub-clusters. It allows computing capacity to be added quickly and incrementally, and isolates workloads from each other. If your exploratory analytics team needs direct access to the source data, they need a lot of machines and not the same number all the time, and you don't 100% trust the kind of queries and user defined functions, they might be using sub-clusters as the solution. While there's much more expensive information available in our other presentation. I'd like to point out the highlights of our latest sub-cluster best practices. We suggest having a primary sub-cluster, this is the one that runs all the time, if you're loading data around the clock. It should be sized for the ETL workloads and also determines the natural shard count. Additional read oriented secondary sub-clusters can be added for real time dashboards, reports and analytics. That way, subclusters can be added or deep provisioned, without disruption to other users. The sub-cluster features of Vertica 9.3 are working well for customers. Yesterday, the Trade Desk presented their use case for Vertica over 300,000 in 5 sub clusters running in the cloud. If you missed a presentation, check out the replay. But we have plans beyond sub-clusters, we're extending sub-clusters to real clusters. For the Vertica savvy, this means the clusters bump, share the same spread ring network. This will provide further isolation, allowing clusters to control their own independent data sets. While replicating all are part of the data from other clusters using a publish subscribe mechanism. Synchronizing data between clusters is a feature customers want to understand the real business for themselves. This vision effects are designed for ancillary aspects, how we should assign resource pools, security policies and balance client connection. We will be simplifying our data segmentation strategy, so that when data that originate in the different clusters meet, they'll still get fully optimized joins, even if those clusters weren't positioned with the same number of nodes per shard. Having a broad vision for data management is a key component to political success. But we also take pride in our execution strategy, when you start a new database from scratch as we did 15 years ago, you won't compete on features. Our key competitive points where speed and scale of analytics, we set a target of 100 x better query performance in traditional databases with path loads. Our storage architecture provides a solid foundation on which to build toward these goals. Every query starts with data retrieval, keeping data sorted, organized by column and compressed by using adaptive caching, to keep the data retrieval time in IO to the bare minimum theoretically required. We also keep the data close to where it will be processed, and you clusters the machines to increase throughput. We have partition pruning a robust optimizer evaluate active use segmentation as part of the physical database designed to keep records close to the other relevant records. So the solid foundation, but we also need optimal execution strategies and tactics. One execution strategy which we built for a long time, but it's still a source of pride, it's how we process expressions. Databases and other systems with general purpose expression evaluators, write a compound expression into a tree. Here I'm using A plus one times B as an example, during execution, if your CPU traverses the tree and compute sub-parts from the whole. Tree traversal often takes more compute cycles than the actual work to be done. Especially in evaluation is a very common operation, so something worth optimizing. One instinct that engineers have is to use what we call, just-in-time or JIT compilation, which means generating code form the CPU into the specific activity expression, and add them. This replaces the tree of boxes that are custom made box for the query. This approach has complexity bugs, but it can be made to work. It has other drawbacks though, it adds a lot to query setup time, especially for short queries. And it pretty much eliminate the ability of mere models, mere mortals to develop user defined functions. If you go back to the problem we're trying to solve, the source of the overhead is the tree traversal. If you increase the batch of records processed in each traversal step, this overhead is amortized until it becomes negligible. It's a perfect match for a columnar storage engine. This also sets the CPU up for efficiency. The CPUs look particularly good, at following the same small sequence of instructions in a tight loop. In some cases, the CPU may even be able to vectorize, and apply the same processing to multiple records to the same instruction. This approach is easy to implement and debug, user defined functions are possible, then generally aligned with the other complexities of implementing and improving a large system. More importantly, the performance, both in terms of query setup and record throughput is dramatically improved. You'll hear me say that we look at research and industry for inspiration. In this case, our findings in line with academic binding. If you'd like to read papers, I recommend everything you always wanted to know about compiled and vectorized queries, don't afraid to ask, so we did have this idea before we read that paper. However, not every decision we made in the Vertica executer that the test of time as well as the expression evaluator. For example, sorting and grouping aren't susceptible to vectorization because sort decisions interrupt the flow. We have used JIT compiling on that for years, and Vertica 401, and it provides modest setups, but we know we can do even better. But who we've embarked on a new design for execution engine, which I call EE five, because it's our best. It's really designed especially for the cloud, now I know what you're thinking, you're thinking, I just put up a slide with an old engine, a new engine, and a sleek play headed up into the clouds. But this isn't just marketing hype, here's what I mean, when I say we've learned lessons over the years, and then we're redesigning the executer for the cloud. And of course, you'll see that the new design works well on premises as well. These changes are just more important for the cloud. Starting with the network layer in the cloud, we can't count on all nodes being connected to the same switch. Multicast doesn't work like it does in a custom data center, so as I mentioned earlier, we're redesigning the network transfer layer for the cloud. Storage in the cloud is different, and I'm not referring here to the storage of persistent data, but to the storage of temporary data used only once during the course of query execution. Our new pattern is designed to take into account the strengths and weaknesses of cloud object storage, where we can't easily do a path. Moving on to memory, many of our access patterns are reasonably effective on bare metal machines, that aren't the best choice on cloud hyperbug that have overheads, page faults or big gap. Here again, we found we can improve performance, a bit on dedicated hardware, and even more in the cloud. Finally, and this is true in all environments, core counts have gone up. And not all of our algorithms take full advantage, there's a lot of ground to cover here. But I think sorting in the perfect example to illustrate these points, I mentioned that we use JIT in sorting. We're getting rid of JIT in favor of a data format that can be treated efficiently, independent of what the data types are. We've drawn on the best, most modern technology from academia and industry. We've got our own analysis and testing, you know what we chose, we chose parallel merge sort, anyone wants to take a guess when merge sort was invented. It was invented in 1948, or at least documented that way, like computing context. If you've heard me talk before, you know that I'm fascinated by how all the things I worked with as an engineer, were invented before I was born. And in Vertica , we don't use the newest technologies, we use the best ones. And what is noble about Vertica is the way we've combined the best ideas together into a cohesive package. So all kidding about the 1940s aside, or he redesigned is actually state of the art. How do we know the sort routine is state of the art? It turns out, there's a pretty credible benchmark or at the appropriately named historic sortbenchmark.org. Anyone with resources looking for fame for their product or academic paper can try to set the record. Record is last set in 2016 with Tencent Sort, 100 terabytes in 99 seconds. Setting the records it's hard, you have to come up with hundreds of machines on a dedicated high speed switching fabric. There's a lot to a distributed sort, there all have core sorting algorithms. The authors of the paper conveniently broke out of the time spent in their sort, 67 out of 99 seconds want to know local sorting. If we break this out, divided by two CPUs and each of 512 nodes, we find that each CPU so there's almost a gig and a half per second. This is for what's called an indy sort, like an Indy race car, is in general purpose. It only handles fixed hundred five records with 10 byte key. There is a record length can vary, then it's called daytona sort, a 10 set daytona sort, is a little slower. One point is 10 gigabytes per second per CPU, now for Verrtica, We have a wide variety ability in record sizes, and more interesting data types, but still no harm in setting us like phone numbers, comfortable to the world record. On my 2017 era AMD desktop CPU, the Vertica EE5 sort to store about two and a half gigabytes per second. Obviously, this test isn't apply to apples because they use their own open power chip. But the number of DRM channels is the same, so it's pretty close the number that says we've hit on the right approach. And it performs this way on premise, in the cloud, and we can adapt it to cloud temp space. So what's our roadmap for integrating EE5 into the product and compare replacing the query executed the database to replacing the crankshaft and other parts of the engine of a car while it's been driven. We've actually done it before, between Vertica three and a half and five, and then we never really stopped changing it, now we'll do it again. The first part in replacing with algorithm called storage merge, which combines sorted data from disk. The first time has was two that are in vertical in incoming 10.0 patch that will be EE5 or resegmented storage merge, and then convert sorting and grouping into do out. There the performance results so far, in cases where the Vertica execute is doing well today, simple environments with simple data patterns, such as this simple capitalistic query, there's a lot of speed up, when we ship the segmentation code, which didn't quite make the freeze as much like to bump longer term, what we do is grouping into the storage of large operations, we'll get to where we think we ought to be, given a theoretical minimum work the CPUs need to do. Now if we look at a case where the current execution isn't doing as well, we see there's a much stronger benefit to the code shipping in Vertica 10. In fact, it turns a chart bar sideways to try to help you see the difference better. This case also benefit from the improvements in 10 product point releases and beyond. They will not happening to the vertical query executer, That was just the taste. But now I'd like to switch to the roadmap first for our adapters layer. I'll start with a story about, how our storage access layer evolved. If you go back to the academic ideas, if you start paper that persuaded investors to fund Vertica, read optimized store was the part that had substantiation in the form of performance data. Much of the paper was speculative, but we tried to follow it anyway. That paper talked about the WS with RS, The rights are in the read store, and how they work together for transaction processing and how there was a supernova. In all honesty, Vertica engineers couldn't figure out from the paper what to do next, incase you want to try, and we asked them they would like, We never got enough clarification to build it that way. But here's what we built, instead. We built the ROS, read optimized store, introduction on steep major revision. It's sorted, ordered columnar and compressed that follows a table partitioning that worked even better than the we are as described in the paper. We also built the last byte optimized store, we built four versions of this over the years actually. But this was the best one, it's not a set of interrelated V tree. It's just an append only, insertion order remember your way here, am sorry, no compression, no base, no partitioning. There is, however, a tuple over which does what we call move out. Move the data from WOS to ROS, sorting and compressing. Let's take a moment to compare how they behave, when you load data directly to the ROS, there's a data parsing operation. Then we finished the sorting, and then compressing right out the columnar data files to stay storage. The next query through executes against the ROS and it runs as it should because the ROS is read optimized. Let's repeat the exercise for WOS, the load operation response before the sorting and compressing, and before the data is written to persistent storage. Now it's possible for a query to come along, and the query could be responsible for sorting the lost data in addition to its other processes. Effect on query isn't predictable until the TM comes along and writes the data to the ROS. Over the years, we've done a lot of comparisons between ROS and WOS. ROS has always been better for sustained load throughput, it achieves much higher records per second without pushing back against the client and hasn't Vertica for when we developed the first usable merge out algorithm. ROS has always been better for predictable query performance, the ROS has never had the same management complexity and limitations as WOS. You don't have to pick a memory size and figure out which transactions get to use the pool. A non persistent nature of ROS always cause headaches when there are unexpected cluster shutdowns. We also looked at field usage data, we found that few customers were using a lot, especially among those that studied the issue carefully. So how we set out on a mission to improve the ROS to the point where it was always better than both the WOS and the profit of the past. And now it's true, ROS is better than the WOS and the loss of a couple of years ago. We implemented storage bundling, better catalog object storage and better tuple mover merge outs. And now, after extensive Q&A and customer testing, we've now succeeded, and in Vertica 10, we've removed the whys. Let's talk for a moment about simplicity, one of the best things Mike Stonebreaker said is no knobs. Anyone want to guess how many knobs we got rid of, and we took the WOS out of the product. 22 were five knobs to control whether it didn't went to ROS as well. Six controlling the ROS itself, Six more to set policies for the typical remove out and so on. In my honest opinion is still wasn't enough control over to achieve excess in a multi tenant environment, the big reason to get rid of the WOS for simplicity. Make the lives of DBAs and users better, we have a long way to go, but we're doing it. On my desk, I keep a jar with the knob in it for each knob in Vertica. When developers add a knob to the product, they have to add a knob to the jar. When they remove a knob, they get to choose one to take out, We have a lot of work to do, but I'm thrilled to report that in 15 years 10 is the first release with a number of knobs ticked downward. Get back to the WOS, I've said the most important thing get rid of it for last. We're getting rid of it so we can deliver our vision of the future to our customer. Remember how he said an Eon and sub-clusters we got all these benefits from shared storage? Guess what can't live in shared storage, the WOS. Remember how it's been a big part of the future was keeping the copies that identical to the primary copy? Independent actions of the WOS took a little at the root of the divergence between copies of the data. You have to admit it when you're wrong. That was in the original design and held up to the a selling point of time, without onto the idea of a separate ROS and WOS for too long. In Vertica, 10, we can finally bid, good reagents. I've covered a lot of ground, so let's put all the pieces together. I've talked a lot about our vision and how we're achieving it. But we also still pay attention to tactical detail. We've been fine tuning our memory management model to enhance performance. That involves revisiting tens of thousands of satellite of code, much like painting the inside of a large building with small paintbrushes. We're getting results as shown in the chart in Vertica nine, concurrent monitoring queries use memory from the global catalog tool, and Vertica 10, they don't. This is only one example of an important detail we're improving. We've also reworked the monitoring tables without network messages into two parts. The increased data we're collecting and analyzing and our quality assurance processes, we're improving on everything. As the story goes, I still have my grandfather's axe, of course, my father had to replace the handle, and I had to replace the head. Along the same lines, we still have Mike Stonebreaker Vertica. We didn't replace the query optimizer twice the debate database designer and storage layer four times each. The query executed is and it's a free design, like charted out how our code has changed over the years. I found that we don't have much from a long time ago, I did some digging, and you know what we have left in 2007. We have the original curly braces, and a little bit of percent code for handling dates and times. To deliver on our mission to help customers get value from their structured data, with high performance of scale, and in diverse deployment environments. We have the sound architecture roadmap, reviews the best execution strategy and solid tactics. On the architectural front, we're converging in an enterprise, we're extending smart analytic clusters. In query processing, we're redesigning the execution engine for the cloud, as I've told you. There's a lot more than just the fast engine. that you want to learn about our new data support for complex data types, improvements to the query optimizer statistics, or extension to live aggregate projections and flatten tables. You should check out some of the other engineering talk that the big data conference. We continue to stay on top of the details from low level CPU and memory too, to the monitoring management, developing tighter feedback cycles between development, Q&A and customers. And don't forget to check out the rest of the pillars of our roadmap. We have new easier ways to get started with Vertica in the cloud. Engineers have been hard at work on machine learning and security. It's easier than ever to use Vertica with third Party product, as a variety of tools integrations continues to increase. Finally, the most important thing we can do, is to help people get value from structured data to help people learn more about Vertica. So hopefully I left plenty of time for Q&A at the end of this presentation. I hope to hear your questions soon.
SUMMARY :
and keep the conversation going, and apply the same processing to multiple records
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Mike | PERSON | 0.99+ |
Mike Stonebreaker | PERSON | 0.99+ |
2007 | DATE | 0.99+ |
Chuck Bear | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
2016 | DATE | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
Chuck | PERSON | 0.99+ |
second copy | QUANTITY | 0.99+ |
99 seconds | QUANTITY | 0.99+ |
67 | QUANTITY | 0.99+ |
100% | QUANTITY | 0.99+ |
1948 | DATE | 0.99+ |
Ben | PERSON | 0.99+ |
two modes | QUANTITY | 0.99+ |
Redpharma | ORGANIZATION | 0.99+ |
first time | QUANTITY | 0.99+ |
first steps | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
two parts | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
five knobs | QUANTITY | 0.99+ |
100 terabytes | QUANTITY | 0.99+ |
both copies | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
each knob | QUANTITY | 0.99+ |
WS | ORGANIZATION | 0.99+ |
AMD | ORGANIZATION | 0.99+ |
Eon | ORGANIZATION | 0.99+ |
1940s | DATE | 0.99+ |
today | DATE | 0.99+ |
One point | QUANTITY | 0.99+ |
first part | QUANTITY | 0.99+ |
fifth level | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
yesterday | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Six | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
512 nodes | QUANTITY | 0.98+ |
ROS | TITLE | 0.98+ |
over 10 years | QUANTITY | 0.98+ |
Yesterday | DATE | 0.98+ |
15 years ago | DATE | 0.98+ |
twice | QUANTITY | 0.98+ |
sortbenchmark.org | OTHER | 0.98+ |
first release | QUANTITY | 0.98+ |
two CPUs | QUANTITY | 0.97+ |
Vertica 10 | TITLE | 0.97+ |
100 x | QUANTITY | 0.97+ |
WOS | TITLE | 0.97+ |
vertica.com | OTHER | 0.97+ |
10 byte | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
one | QUANTITY | 0.97+ |
5 sub clusters | QUANTITY | 0.97+ |
two | QUANTITY | 0.97+ |
one example | QUANTITY | 0.97+ |
over 300,000 | QUANTITY | 0.96+ |
Dr. | PERSON | 0.96+ |
One | QUANTITY | 0.96+ |
tens of thousands of satellite | QUANTITY | 0.96+ |
EE5 | COMMERCIAL_ITEM | 0.96+ |
fifth generation | QUANTITY | 0.96+ |
UNLIST TILL 4/2 - Vertica @ Uber Scale
>> Sue: Hi, everybody. Thank you for joining us today, for the Virtual Vertica BDC 2020. This breakout session is entitled "Vertica @ Uber Scale" My name is Sue LeClaire, Director of Marketing at Vertica. And I'll be your host for this webinar. Joining me is Girish Baliga, Director I'm sorry, user, Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q and A session, at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can also Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button, in the lower right corner of the slides. And yet, this virtual session is being recorded, and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Girish over to you. >> Girish: Thanks a lot Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real time analytics teams at Uber. Vertica is one of the main platforms that we support, and Vertica powers a lot of core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases, across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability, using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I will be sharing an easy way for you to take advantage of, many of the ideas and solutions that I'm going to present today, that you can apply to your own Vertica deployments in your companies. So stick around and put on your seat belts, and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems, and enabling people all over the world to solve their local problems, their local needs, their local issues, in a manner that's efficient, fast and reliable. As our CEO Dara has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So, across our various business lines, we have over 110 million monthly users, who use our rides, services, or eat services, and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the ride business, have over 20 million trips per day. And that each business is also catching up, particularly during the recent times that we've been having. And so, I hope these numbers give you a scale of the amount of data, that we process each and every day. And support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So, Uber to describe it very briefly, is a lot like Amazon. We are largely an operation and logistics company. And employee work based reflects that. So over 70% of our employees work in teams, which come under the umbrella of Community Operations and Centers of Excellence. So these are all folks working in various cities and towns that we operate around the world, and run the Uber businesses, as somewhat local businesses responding to local needs, local market conditions, local regulation and so forth. And Vertica is one of the most important tools, that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going, to deeply into any issues that they want to triage , to generate reports, to plan for the future, a whole lot of use cases. The second big class of users, are in our marketplace team. So marketplace is the engineering team, that backs our ride shared business. And as part of this, running this business, a key problem that they have to solve, is how to determine what prices to set, for particular rides, so that we have a good match between supply and demand. So obviously the real time pricing decisions they're made by serving systems, with very detailed and well crafted machine learning models. However, the training data that goes into this models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store, and serve out of Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and back-end systems, to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines, is dashboards and reporting. So we have a lot of dashboards, which are built by core data analysts teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run, periodically say once a week or once a day even, depending on the frequency of data that they need. And many of these are powered by the data, and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas, doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives, or marketing support, and so forth. So the analysis that backs a lot of these decisions, is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is, how we provide best in class algorithms, pricing, and matching. And a lot of the analysis that goes into, figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real time decisions, are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica usage spans a whole bunch of organizations and users, all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5000 weekly active, people who run queries at least once a week, to do some critical business role or problem to solve, that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps, and request for a ride or order food delivery on each platform, the apps are talking to our serving systems. And the serving systems use online storage systems, to store the data as the trips and eat orders are getting processed in real time. So for this, we primarily use an in house built, key value storage system called Schemaless, and an open source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support serving systems. So all of this operations generates a lot of data, that we then want to process and analyze, and use for our operational improvements. So, we have ingestion systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber a data lake is powered by Hadoop, with files stored on HDFS clusters. So once the raw data lines on the data lake, we then have ETL jobs that process these raw datasets, and generate, modeled and customize datasets which we then use for further analysis. So once these model datasets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer. So with internal tools, like QueryBuilder, which is a UI interface to write queries, and look at results. And it read over the front-end sites, and Dashbuilder, which is a dash, board building tool, and report management tool. So these are all various tools that we have built within Uber. And these can talk to Vertica and run SQL queries to power, whatever, dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights, that we show on dashboards as folks use, and it also powers reports that we run periodically. But more importantly, we have some core, properties and core feature sets that Vertica provides, which allows us to support many of these use cases, very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables onto Vertica. The core fact tables are all the trips, all the each orders and all these other line items for various businesses from Uber, stored as partitioned tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, career partners and so forth. So we have both these two kinds of datasets, which will load into Vertica. And we have full historical data, all the way since we launched these businesses to today. So that folks can do deeper longitudinal analysis, so they can look at patterns, like how the business has grown from month to month, year to year, the same month, over a year, over multiple years, and so forth. And, the really powerful thing about Vertica, is that most of these queries, you run the deep longitudinal queries, run very, very fast. And that's really why we love Vertica. Because we see query latency P90s. That is 90 percentile of all queries that we run on our platform, typically finish in under a minute. So that's very important for us because Vertica is used, primarily for interactive analytics use cases. And providing SQL query execution times under a minute, is critical for our users and business owners to get the most out of analytics and Big Data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important set of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards, that rely very heavily on being able to restrict datasets by geographic areas, cities, source destination pairs, heat maps, and so forth. And Vertica has a rich array of functions that we use very heavily. We also have, support for custom projections in Vertica. And this really helps us, have very good performance for critical datasets. So for instance, in some of our core fact tables, we have done a lot of query and analysis to figure out, how users run their queries, what kind of columns they use, what combination of columns they use, and what joints they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica, is very valuable for us. So we've also had some very successful collaborations, with the Vertica engineering team. About a year and a half back, we had open-sourced a Python Client, that we had built in house to talk to Vertica. We were using this Python Client in our business intelligence layer that I'd shown on the previous slide. And we had open-sourced it after working closely with Eng team. And now Vertica formally supports the Python Client as an open-source project, which you can download to and integrate into your systems. Another more recent example of collaboration is the Vertica Eon mode on GCP. So as most of or at least some of you know, Vertica Eon mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So Vertica team hustled on this, and provided us early preview version, which we've been testing out to see how performance, is impacted by running on the Cloud, and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this very soon. So here I have a visualization of an internal dashboard, that is powered solely by data and queries running on Vertica. So this GIF has sequence have different visualizations supported by this tool. So for instance, here you see a heat map, downgrading heat map of source of traffic demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lines. And then you can see how demand moves around. So, as the cycles through the various animations, you can basically see all the different kinds of insights, and query shapes that we send to Vertica, which powers this critical business dashboard for our operations teams. All right, so now how do we do all of this at scale? So, we started off with a single Vertica cluster, a few years back. So we had our data lake, the data would land into Vertica. So these are the core fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer, right? So this is a very simple, and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that, Uber is a pretty big company at this point, with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course, on queueing theory, you will realize that basically, even though you could have all the query is processed through a single serving system. You will tend to see larger and larger queue wait times, as the number of queries pile up. And what this means in practice for end users, is that they are basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself, is probably less than a minute, their query sitting in the queue for a bunch of minutes, and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point of failure. Now Vertica can handle single node failures very gracefully, and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon, you will see that, when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime, is when you're upgrading software in your clusters. So, essentially we're a global company, and we have users all around the world, we really cannot afford to have downtime, even for one hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we we might need to upgrade our machines, or we might need to replace storage or memory due to issues with the hardware in there, due to normal wear and tear, or due to abnormal issues. And so because of all of these things, having a single point of failure, having a single cluster was not really practical for us. So the next thing we did, was we set up multiple clusters, right? So we had a bunch of identities clusters, all of which have the same datasets. So then we would basically load data using ingestion pipelines from our data lake, onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Anytime we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster, if we had to, or we would just have some spare nodes, which would rotate into our production clusters and so forth. However, having multiple clusters, led to a new set of issues. So the first problem was that since we have multiple clusters, you would end up with inconsistent schema. So one of the things to understand about our platform, is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on Vertica clusters. So we have dataset owners and publishers, who manage their own datasets. Now exposing multiple clusters to these dataset owners. Turns out, it's not a great idea, right? Because they are not really aware of, the importance of having consistency of schemas and datasets across different clusters. So over time, what we saw was that the schema for the same tables would basically get out of order, because they were all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables in one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where, we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had ingestion pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the ingestion pipeline into cluster B failed, then the data there would be older than clusters A and C. So, when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results, than you would if you went to a or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts. But then that would lead to a bad situation for them where they would not able to fully trust the data that was, and the results and insights that were being returned by the SQL queries and Vertica systems. And then the third problem was, we had a lot of extra replication. So the 20/80 Rule, or maybe even the 90/10 Rule, applies to datasets on our clusters as well. So less than 10% of our datasets, for instance, in 90% of the queries, right? And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this set up where we had to do that, was obviously very suboptimal for us. So then what we did, was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing datasets or dataset owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that, all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that, when queries come in from the BI layer, the query was forwarded, smartly and with knowledge and data about which cluster up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent, almost single system view of our entire Vertica deployment. And the third bit, we had put in place, was the data manifest, which were the communication mechanism between ingestion and proxy. So the data manifest basically is a listing of, which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra replication problem. So now we basically have some big clusters, where all the core tables, and all the tables, in fact, are served. So any query that hits 90%, less so tables, goes to the big clusters. And most of the queries which hit 10% heavily queried important tables, can also be served by many other small clusters, so much more efficient use of resources. So this basically is the view that we have today, of Vertica within Uber, so external to our team, folks, just have an endpoint, where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries. And they are so to a proxy layer. So let's get a little more into details, about each of these layers. So, on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle, so the list of cities list of drivers, the list of users and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these datasets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each orders and so forth. So these are partition. So we have one partition roughly per day, for the last couple of years, and then we have more of a hierarchical partitions set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that, is because not all our data comes in at the same time. So we have updates for trips, going over the past two or three days, for instance, where people add ratings to their trips, or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last few days to make sure we capture all those updates. And we also update older partitions, if for instance, records were deleted for retention purposes, or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow dataset owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that, we take advantage of the data being clustered by day, so that we don't have to update all the data at once. So when we are recovering from an cluster event, like a version upgrade or software upgrade, or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, and copying all the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we, add this cluster to our serving pool, and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here, is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables, in all the clusters in phase one. And then when all the clusters give us access signals, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load, using Vertica Data Copy. So what this means is earlier, in a parallel pipelines scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica Data Copy, we just load it data into one cluster and then much more efficiently copy it, to the other clusters. So this has significantly reduced our ingestion overheads, and speed it up our load process. And as I mentioned as the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date, by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first, what we do is we stage these changes to make sure that these are correct. So this catches errors that are trying to do, an incompatible update, like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency. And provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for, replicated clusters to all our users. So the way we handle that is, as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query, and route the query to the appropriate cluster based on the manifest information. Also the proxy is aware of the health of individual clusters. So if for some reason a cluster is down for maintenance or upgrades, the proxy is aware of this information. And it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters, and do some load balancing to ensure that we award hotspots on various clusters. So the key takeaways that I have from the stock, are primarily these. So we started off with single cluster mode on Vertica, and we ran into a bunch of issues around scaling and availability due to cluster downtime. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. Then we run into issues around schema consistency, data staleness, and data replication. So we built an entire ecosystem around Vertica, with abstraction layers around data management and ingestion, and proxy. And with this setup, we were able to enforce consistency and improve storage utilization. So, hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber, and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have a interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem, is to use the Vertica Eon mode. So the Vertica Eon mode, allows you to set up multiple clusters with consistent data updates, and set them up at various different sizes to handle different query loads. And it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been, trying it out on DCP, and initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new. And hopefully you took away something that you can also apply to your systems. We have a few more time for some questions. So I'll pause for now and take any questions.
SUMMARY :
Any questions that we don't address, So the first issue that we have is that,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Girish Baliga | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Girish | PERSON | 0.99+ |
10% | QUANTITY | 0.99+ |
one hour | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Sue | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Dara | PERSON | 0.99+ |
first issue | QUANTITY | 0.99+ |
less than a minute | QUANTITY | 0.99+ |
MySQL | TITLE | 0.99+ |
First | QUANTITY | 0.99+ |
first problem | QUANTITY | 0.99+ |
third problem | QUANTITY | 0.99+ |
third bit | QUANTITY | 0.99+ |
less than 10% | QUANTITY | 0.99+ |
each platform | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
one cluster | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
second issue | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
today | DATE | 0.99+ |
second phase | QUANTITY | 0.99+ |
two kinds | QUANTITY | 0.99+ |
over 10,000 cities | QUANTITY | 0.99+ |
over 70% | QUANTITY | 0.99+ |
each business | QUANTITY | 0.99+ |
second thing | QUANTITY | 0.98+ |
second problem | QUANTITY | 0.98+ |
Vertica | TITLE | 0.98+ |
both | QUANTITY | 0.98+ |
Vertica Data Manager | TITLE | 0.98+ |
two-phase | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
90 percentile | QUANTITY | 0.98+ |
once a week | QUANTITY | 0.98+ |
each | QUANTITY | 0.98+ |
single point | QUANTITY | 0.97+ |
SQL | TITLE | 0.97+ |
once a day | QUANTITY | 0.97+ |
Redis | TITLE | 0.97+ |
one partition | QUANTITY | 0.97+ |
under a minute | QUANTITY | 0.97+ |
@ Uber Scale | ORGANIZATION | 0.96+ |
UNLIST TILL 4/2 - A Technical Overview of Vertica Architecture
>> Paige: Hello, everybody and thank you for joining us today on the Virtual Vertica BDC 2020. Today's breakout session is entitled A Technical Overview of the Vertica Architecture. I'm Paige Roberts, Open Source Relations Manager at Vertica and I'll be your host for this webinar. Now joining me is Ryan Role-kuh? Did I say that right? (laughs) He's a Vertica Senior Software Engineer. >> Ryan: So it's Roelke. (laughs) >> Paige: Roelke, okay, I got it, all right. Ryan Roelke. And before we begin, I want to be sure and encourage you guys to submit your questions or your comments during the virtual session while Ryan is talking as you think of them as you go along. You don't have to wait to the end, just type in your question or your comment in the question box below the slides and click submit. There'll be a Q and A at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to get back to you offline. Now, alternatively, you can visit the Vertica forums to post your question there after the session as well. Our engineering team is planning to join the forums to keep the conversation going, so you can have a chat afterwards with the engineer, just like any other conference. Now also, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides and before you ask, yes, this virtual session is being recorded and it will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now, let's get started. Over to you, Ryan. >> Ryan: Thanks, Paige. Good afternoon, everybody. My name is Ryan and I'm a Senior Software Engineer on Vertica's Development Team. I primarily work on improving Vertica's query execution engine, so usually in the space of making things faster. Today, I'm here to talk about something that's more general than that, so we're going to go through a technical overview of the Vertica architecture. So the intent of this talk, essentially, is to just explain some of the basic aspects of how Vertica works and what makes it such a great database software and to explain what makes a query execute so fast in Vertica, we'll provide some background to explain why other databases don't keep up. And we'll use that as a starting point to discuss an academic database that paved the way for Vertica. And then we'll explain how Vertica design builds upon that academic database to be the great software that it is today. I want to start by sharing somebody's approximation of an internet minute at some point in 2019. All of the data on this slide is generated by thousands or even millions of users and that's a huge amount of activity. Most of the applications depicted here are backed by one or more databases. Most of this activity will eventually result in changes to those databases. For the most part, we can categorize the way these databases are used into one of two paradigms. First up, we have online transaction processing or OLTP. OLTP workloads usually operate on single entries in a database, so an update to a retail inventory or a change in a bank account balance are both great examples of OLTP operations. Updates to these data sets must be visible immediately and there could be many transactions occurring concurrently from many different users. OLTP queries are usually key value queries. The key uniquely identifies the single entry in a database for reading or writing. Early databases and applications were probably designed for OLTP workloads. This example on the slide is typical of an OLTP workload. We have a table, accounts, such as for a bank, which tracks information for each of the bank's clients. An update query, like the one depicted here, might be run whenever a user deposits $10 into their bank account. Our second category is online analytical processing or OLAP which is more about using your data for decision making. If you have a hardware device which periodically records how it's doing, you could analyze trends of all your devices over time to observe what data patterns are likely to lead to failure or if you're Google, you might log user search activity to identify which links helped your users find the answer. Analytical processing has always been around but with the advent of the internet, it happened at scales that were unimaginable, even just 20 years ago. This SQL example is something you might see in an OLAP workload. We have a table, searches, logging user activity. We will eventually see one row in this table for each query submitted by users. If we want to find out what time of day our users are most active, then we could write a query like this one on the slide which counts the number of unique users running searches for each hour of the day. So now let's rewind to 2005. We don't have a picture of an internet minute in 2005, we don't have the data for that. We also don't have the data for a lot of other things. The term Big Data is not quite yet on anyone's radar and The Cloud is also not quite there or it's just starting to be. So if you have a database serving your application, it's probably optimized for OLTP workloads. OLAP workloads just aren't mainstream yet and database engineers probably don't have them in mind. So let's innovate. It's still 2005 and we want to try something new with our database. Let's take a look at what happens when we do run an analytic workload in 2005. Let's use as a motivating example a table of stock prices over time. In our table, the symbol column identifies the stock that was traded, the price column identifies the new price and the timestamp column indicates when the price changed. We have several other columns which, we should know that they're there, but we're not going to use them in any example queries. This table is designed for analytic queries. We're probably not going to make any updates or look at individual rows since we're logging historical data and want to analyze changes in stock price over time. Our database system is built to serve OLTP use cases, so it's probably going to store the table on disk in a single file like this one. Notice that each row contains all of the columns of our data in row major order. There's probably an index somewhere in the memory of the system which will help us to point lookups. Maybe our system expects that we will use the stock symbol and the trade time as lookup keys. So an index will provide quick lookups for those columns to the position of the whole row in the file. If we did have an update to a single row, then this representation would work great. We would seek to the row that we're interested in, finding it would probably be very fast using the in-memory index. And then we would update the file in place with our new value. On the other hand, if we ran an analytic query like we want to, the data access pattern is very different. The index is not helpful because we're looking up a whole range of rows, not just a single row. As a result, the only way to find the rows that we actually need for this query is to scan the entire file. We're going to end up scanning a lot of data that we don't need and that won't just be the rows that we don't need, there's many other columns in this table. Many information about who made the transaction, and we'll also be scanning through those columns for every single row in this table. That could be a very serious problem once we consider the scale of this file. Stocks change a lot, we probably have thousands or millions or maybe even billions of rows that are going to be stored in this file and we're going to scan all of these extra columns for every single row. If we tried out our stocks use case behind the desk for the Fortune 500 company, then we're probably going to be pretty disappointed. Our queries will eventually finish, but it might take so long that we don't even care about the answer anymore by the time that they do. Our database is not built for the task we want to use it for. Around the same time, a team of researchers in the North East have become aware of this problem and they decided to dedicate their time and research to it. These researchers weren't just anybody. The fruits of their labor, which we now like to call the C-Store Paper, was published by eventual Turing Award winner, Mike Stonebraker, along with several other researchers from elite universities. This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. That sounds exactly like what we want for our stocks use case. Reasoning about what makes our queries executions so slow brought our researchers to the Memory Hierarchy, which essentially is a visualization of the relative speeds of different parts of a computer. At the top of the hierarchy, we have the fastest data units, which are, of course, also the most expensive to produce. As we move down the hierarchy, components get slower but also much cheaper and thus you can have more of them. Our OLTP databases data is stored in a file on the hard disk. We scanned the entirety of this file, even though we didn't need most of the data and now it turns out, that is just about the slowest thing that our query could possibly be doing by over two orders of magnitude. It should be clear, based on that, that the best thing we can do to optimize our query's execution is to avoid reading unnecessary data from the disk and that's what the C-Store researchers decided to look at. The key innovation of the C-Store paper does exactly that. Instead of storing data in a row major order, in a large file on disk, they transposed the data and stored each column in its own file. Now, if we run the same select query, we read only the relevant columns. The unnamed columns don't factor into the table scan at all since we don't even open the files. Zooming out to an internet scale sized data set, we can appreciate the savings here a lot more. But we still have to read a lot of data that we don't need to answer this particular query. Remember, we had two predicates, one on the symbol column and one on the timestamp column. Our query is only interested in AAPL stock, but we're still reading rows for all of the other stocks. So what can we do to optimize our disk read even more? Let's first partition our data set into different files based on the timestamp date. This means that we will keep separate files for each date. When we query the stocks table, the database knows all of the files we have to open. If we have a simple predicate on the timestamp column, as our sample query does, then the database can use it to figure out which files we don't have to look at at all. So now all of our disk reads that we have to do to answer our query will produce rows that pass the timestamp predicate. This eliminates a lot of wasteful disk reads. But not all of them. We do have another predicate on the symbol column where symbol equals AAPL. We'd like to avoid disk reads of rows that don't satisfy that predicate either. And we can avoid those disk reads by clustering all the rows that match the symbol predicate together. If all of the AAPL rows are adjacent, then as soon as we see something different, we can stop reading the file. We won't see any more rows that can pass the predicate. Then we can use the positions of the rows we did find to identify which pieces of the other columns we need to read. One technique that we can use to cluster the rows is sorting. So we'll use the symbol column as a sort key for all of the columns. And that way we can reconstruct a whole row by seeking to the same row position in each file. It turns out, having sorted all of the rows, we can do a bit more. We don't have any more wasted disk reads but we can still be more efficient with how we're using the disk. We've clustered all of the rows with the same symbol together so we don't really need to bother repeating the symbol so many times in the same file. Let's just write the value once and say how many rows we have. This one length encoding technique can compress large numbers of rows into a small amount of space. In this example, we do de-duplicate just a few rows but you can imagine de-duplicating many thousands of rows instead. This encoding is great for reducing the amounts of disk we need to read at query time, but it also has the additional benefit of reducing the total size of our stored data. Now our query requires substantially fewer disk reads than it did when we started. Let's recap what the C-Store paper did to achieve that. First, we transposed our data to store each column in its own file. Now, queries only have to read the columns used in the query. Second, we partitioned the data into multiple file sets so that all rows in a file have the same value for the partition column. Now, a predicate on the partition column can skip non-matching file sets entirely. Third, we selected a column of our data to use as a sort key. Now rows with the same value for that column are clustered together, which allows our query to stop reading data once it finds non-matching rows. Finally, sorting the data this way enables high compression ratios, using one length encoding which minimizes the size of the data stored on the disk. The C-Store system combined each of these innovative ideas to produce an academically significant result. And if you used it behind the desk of a Fortune 500 company in 2005, you probably would've been pretty pleased. But it's not 2005 anymore and the requirements of a modern database system are much stricter. So let's take a look at how C-Store fairs in 2020. First of all, we have designed the storage layer of our database to optimize a single query in a single application. Our design optimizes the heck out of that query and probably some similar ones but if we want to do anything else with our data, we might be in a bit of trouble. What if we just decide we want to ask a different question? For example, in our stock example, what if we want to plot all the trade made by a single user over a large window of time? How do our optimizations for the previous query measure up here? Well, our data's partitioned on the trade date, that could still be useful, depending on our new query. If we want to look at a trader's activity over a long period of time, we would have to open a lot of files. But if we're still interested in just a day's worth of data, then this optimization is still an optimization. Within each file, our data is ordered on the stock symbol. That's probably not too useful anymore, the rows for a single trader aren't going to be clustered together so we will have to scan all of the rows in order to figure out which ones match. You could imagine a worse design but as it becomes crucial to optimize this new type of query, then we might have to go as far as reconfiguring the whole database. The next problem of one of scale. One server is probably not good enough to serve a database in 2020. C-Store, as described, runs on a single server and stores lots of files. What if the data overwhelms this small system? We could imagine exhausting the file system's inodes limit with lots of small files due to our partitioning scheme. Or we could imagine something simpler, just filling up the disk with huge volumes of data. But there's an even simpler problem than that. What if something goes wrong and C-Store crashes? Then our data is no longer available to us until the single server is brought back up. A third concern, another one of scalability, is that one deployment does not really suit all possible things and use cases we could imagine. We haven't really said anything about being flexible. A contemporary database system has to integrate with many other applications, which might themselves have pretty restricted deployment options. Or the demands imposed by our workloads have changed and the setup you had before doesn't suit what you need now. C-Store doesn't do anything to address these concerns. What the C-Store paper did do was lead very quickly to the founding of Vertica. Vertica's architecture and design are essentially all about bringing the C-Store designs into an enterprise software system. The C-Store paper was just an academic exercise so it didn't really need to address any of the hard problems that we just talked about. But Vertica, the first commercial database built upon the ideas of the C-Store paper would definitely have to. This brings us back to the present to look at how an analytic query runs in 2020 on the Vertica Analytic Database. Vertica takes the key idea from the paper, can we significantly improve query performance by changing the way our data is stored and give its users the tools to customize their storage layer in order to heavily optimize really important or commonly wrong queries. On top of that, Vertica is a distributed system which allows it to scale up to internet-sized data sets, as well as have better reliability and uptime. We'll now take a brief look at what Vertica does to address the three inadequacies of the C-Store system that we mentioned. To avoid locking into a single database design, Vertica provides tools for the database user to customize the way their data is stored. To address the shortcomings of a single node system, Vertica coordinates processing among multiple nodes. To acknowledge the large variety of desirable deployments, Vertica does not require any specialized hardware and has many features which smoothly integrate it with a Cloud computing environment. First, we'll look at the database design problem. We're a SQL database, so our users are writing SQL and describing their data in SQL way, the Create Table statement. Create Table is a logical description of what your data looks like but it doesn't specify the way that it has to be stored, For a single Create Table, we could imagine a lot of different storage layouts. Vertica adds some extensions to SQL so that users can go even further than Create Table and describe the way that they want the data to be stored. Using terminology from the C-Store paper, we provide the Create Projection statement. Create Projection specifies how table data should be laid out, including column encoding and sort order. A table can have multiple projections, each of which could be ordered on different columns. When you query a table, Vertica will answer the query using the projection which it determines to be the best match. Referring back to our stock example, here's a sample Create Table and Create Projection statement. Let's focus on our heavily optimized example query, which had predicates on the stock symbol and date. We specify that the table data is to be partitioned by date. The Create Projection Statement here is excellent for this query. We specify using the order by clause that the data should be ordered according to our predicates. We'll use the timestamp as a secondary sort key. Each projection stores a copy of the table data. If you don't expect to need a particular column in a projection, then you can leave it out. Our average price query didn't care about who did the trading, so maybe our projection design for this query can leave the trader column out entirely. If the question we want to ask ever does change, maybe we already have a suitable projection, but if we don't, then we can create another one. This example shows another projection which would be much better at identifying trends of traders, rather than identifying trends for a particular stock. Next, let's take a look at our second problem, that one, or excuse me, so how should you decide what design is best for your queries? Well, you could spend a lot of time figuring it out on your own, or you could use Vertica's Database Designer tool which will help you by automatically analyzing your queries and spitting out a design which it thinks is going to work really well. If you want to learn more about the Database Designer Tool, then you should attend the session Vertica Database Designer- Today and Tomorrow which will tell you a lot about what the Database Designer does and some recent improvements that we have made. Okay, now we'll move to our next problem. (laughs) The challenge that one server does not fit all. In 2020, we have several orders of magnitude more data than we had in 2005. And you need a lot more hardware to crunch it. It's not tractable to keep multiple petabytes of data in a system with a single server. So Vertica doesn't try. Vertica is a distributed system so will deploy multiple severs which work together to maintain such a high data volume. In a traditional Vertica deployment, each node keeps some of the data in its own locally-attached storage. Data is replicated so that there is a redundant copy somewhere else in the system. If any one node goes down, then the data that it served is still available on a different node. We'll also have it so that in the system, there's no special node with extra duties. All nodes are created equal. This ensures that there is no single point of failure. Rather than replicate all of your data, Vertica divvies it up amongst all of the nodes in your system. We call this segmentation. The way data is segmented is another parameter of storage customization and it can definitely have an impact upon query performance. A common way to segment data is by using a hash expression, which essentially randomizes the node that a row of data belongs to. But with a guarantee that the same data will always end up in the same place. Describing the way data is segmented is another part of the Create Projection Statement, as seen in this example. Here we segment on the hash of the symbol column so all rows with the same symbol will end up on the same node. For each row that we load into the system, we'll apply our segmentation expression. The result determines which segment the row belongs to and then we'll send the row to each node which holds the copy of that segment. In this example, our projection is marked KSAFE 1, so we will keep one redundant copy of each segment. When we load a row, we might find that its segment had copied on Node One and Node Three, so we'll send a copy of the row to each of those nodes. If Node One is temporarily disconnected from the network, then Node Three can serve the other copy of the segment so that the whole system remains available. The last challenge we brought up from the C-Store design was that one deployment does not fit all. Vertica's cluster design neatly addressed many of our concerns here. Our use of segmentation to distribute data means that a Vertica system can scale to any size of deployment. And since we lack any special hardware or nodes with special purposes, Vertica servers can run anywhere, on premise or in the Cloud. But let's suppose you need to scale out your cluster to rise to the demands of a higher workload. Suppose you want to add another node. This changes the division of the segmentation space. We'll have to re-segment every row in the database to find its new home and then we'll have to move around any data that belongs to a different segment. This is a very expensive operation, not something you want to be doing all that often. Traditional Vertica doesn't solve that problem especially well, but Vertica Eon Mode definitely does. Vertica's Eon Mode is a large set of features which are designed with a Cloud computing environment in mind. One feature of this design is elastic throughput scaling, which is the idea that you can smoothly change your cluster size without having to pay the expenses of shuffling your entire database. Vertica Eon Mode had an entire session dedicated to it this morning. I won't say any more about it here, but maybe you already attended that session or if you haven't, then I definitely encourage you to listen to the recording. If you'd like to learn more about the Vertica architecture, then you'll find on this slide links to several of the academic conference publications. These four papers here, as well as Vertica Seven Years Later paper which describes some of the Vertica designs seven years after the founding and also a paper about the innovations of Eon Mode and of course, the Vertica documentation is an excellent resource for learning more about what's going on in a Vertica system. I hope you enjoyed learning about the Vertica architecture. I would be very happy to take all of your questions now. Thank you for attending this session.
SUMMARY :
A Technical Overview of the Vertica Architecture. Ryan: So it's Roelke. in the question box below the slides and click submit. that the best thing we can do
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ryan | PERSON | 0.99+ |
Mike Stonebraker | PERSON | 0.99+ |
Ryan Roelke | PERSON | 0.99+ |
2005 | DATE | 0.99+ |
2020 | DATE | 0.99+ |
thousands | QUANTITY | 0.99+ |
2019 | DATE | 0.99+ |
$10 | QUANTITY | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Paige | PERSON | 0.99+ |
Node Three | TITLE | 0.99+ |
Today | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
each file | QUANTITY | 0.99+ |
Roelke | PERSON | 0.99+ |
each row | QUANTITY | 0.99+ |
Node One | TITLE | 0.99+ |
millions | QUANTITY | 0.99+ |
each hour | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
Second | QUANTITY | 0.99+ |
second category | QUANTITY | 0.99+ |
each column | QUANTITY | 0.99+ |
One technique | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
two predicates | QUANTITY | 0.99+ |
each node | QUANTITY | 0.99+ |
One server | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
C-Store | TITLE | 0.99+ |
second problem | QUANTITY | 0.99+ |
Ryan Role | PERSON | 0.99+ |
Third | QUANTITY | 0.99+ |
North East | LOCATION | 0.99+ |
each segment | QUANTITY | 0.99+ |
today | DATE | 0.98+ |
single entry | QUANTITY | 0.98+ |
each date | QUANTITY | 0.98+ |
ORGANIZATION | 0.98+ | |
one row | QUANTITY | 0.98+ |
one server | QUANTITY | 0.98+ |
single server | QUANTITY | 0.98+ |
single entries | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
20 years ago | DATE | 0.98+ |
two paradigms | QUANTITY | 0.97+ |
a day | QUANTITY | 0.97+ |
this week | DATE | 0.97+ |
billions of rows | QUANTITY | 0.97+ |
Vertica | TITLE | 0.97+ |
4/2 | DATE | 0.97+ |
single application | QUANTITY | 0.97+ |
each query | QUANTITY | 0.97+ |
Each projection | QUANTITY | 0.97+ |
UNLIST TILL 4/2 - Migrating Your Vertica Cluster to the Cloud
>> Jeff: Hello everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's break-out session has been titled, "Migrating Your Vertica Cluster to the Cloud." I'm Jeff Healey, and I'm in Vertica marketing. I'll be your host for this break-out session. Joining me here are Sumeet Keswani and Chris Daly, Vertica product technology engineers and key members of our customer success team. Before we begin, I encourage you to submit questions and comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. And alternatively, you can visit Vertica forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also as a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Sumeet. >> Sumeet: Thank you, Jeff. Hello everyone, my name is Sumeet Keswani, and I will be talking about planning to deploy or migrate your Vertica cluster to the Cloud. So you may be moving an on-prem cluster or setting up a new cluster in the Cloud. And there are several design and operational considerations that will come into play. You know, some of these are cost, which industry you are in, or which expertise you have, in which Cloud platform. And there may be a personal preference too. After that, you know, there will be some operational considerations like VM and cluster sizing, what Vertica mode you want to deploy, Eon or Enterprise. It depends on your use keys. What are the DevOps skills available, you know, what elasticity, separation you need, you know, what is your backup and DR strategy, what do you want in terms of high availability. And you will have to think about, you know, how much data you have and where it's going to live. And in order to understand the cost, or the cost and the benefit of deployment and you will have to understand the access patterns, and how you are moving data from and to the Cloud. So things to consider before you move a deployment, a Vertica deployment to the Cloud, right, is one thing to keep in mind is, virtual CPUs, or CPUs in the Cloud, are not the same as the usual CPUs that you've been familiar with in your data center. A vCPU is half of a CPU because of hyperthreading. There is definitely the noisy neighbor effect. There is, depending on what other things are hosted in the Cloud environment, you may see performance, you may occasionally see performance issues. There are I/O limitations on the instance that you provision, so that what that really means is you can't always scale up. You might have to scale up, basically, you have to add more instances rather than getting bigger or the right size instances. Finally, there is an important distinction here. Virtualization is not free. There can be significant overhead to virtualization. It could be as much as 30%, so when you size and scale your clusters, you must keep that in mind. Now the other important aspect is, you know, where you put Vertica cluster is important. The choice of the region, how far it is from your various office locations. Where will the data live with respect to the cluster. And remember, popular locations can fill up. So if you want to scale out, additional capacity may or may not be available. So these are things you have to keep in mind when picking or choosing your Cloud platform and your deployment. So at this point, I want to make a plug for Eon mode. Eon mode is the latest mode, is a Cloud mode from Vertica. It has been designed with Cloud economics in mind. It uses shared storage, which is durable, available, and very cheap, like S3 storage or Google Cloud storage. It has been designed for quick scaling, like scale out, and highly elastic deployments. It has also been designed for high workload isolation, where each application or user group can be isolated from the other ones, so that they'll be paid and monitored separately, without affecting each other. But there are some disadvantages, or perhaps, you know, there's a cost for using Eon mode. Storage in S3 is neither cheap nor efficient. So there is a high latency of I/O when accessing data from S3. There is API and data access cost. There is API and data access cost associated with accessing your data in S3. Vertica in Eon mode has a pay as you go model, which you know, works for some people and does not work for others. And so therefore it is important to keep that in mind. And performance can be a little bit variable here, because it depends on cache, it depends on the local depot, which is a cache, and it is not as predictable as EE mode, so that's another trade-off. So let's spend about a minute and see how a Vertica cluster in Eon mode looks like. A Vertica cluster in Eon mode has S3 as the durability layer where all the data sits. There are subclusters, which are essentially just aggregation groups, which is separated compute, which will service different workloads. So for in this example, you may have two subclusters, one servicing ETL workload and the other one servicing (mic interference obscures speaking). These clusters are isolated, and they do not affect each other's performance. This allows you to scale them independently and isolate workloads. So this is the new Vertica Eon mode which has been specifically designed by us for use in the Cloud. But beyond this, you can use EE mode or Eon mode in the Cloud, it really depends on what your use case is. But both of these are possible, and we highly recommend Eon mode wherever possible. Okay, let's talk a little bit about what we mean by Vertica support in the Cloud. Now as you know, a Cloud is a shared data center, right. Performance in the Cloud can vary. It can vary between regions, availability zones, time of the day, choice of instance type, what concurrency you use, and of course the noisy neighbor effect. You know, we in Vertica, we performance, load, and stress test our product before every release. We have a bunch of use cases, we go through all of them, make sure that we haven't, you know, regressed any performance, and make sure that it works up to standards and gives you the high performance that you've come to expect. However, your solution or your workload is unique to you, and it is still your responsibility to make sure that it is tuned appropriately. To do this, one of the easiest things you can do is you know, pick a tested operating system, allocate the virtual machine, you know, with enough resources. It's something that we recommend, because we have tested it thoroughly. It goes a long way in giving you predictability. So after this I would like to now go into the various platforms, Cloud platforms, that Vertica has worked on. And I'll start with AWS, and my colleague Chris will speak about Azure and GCP. And our thoughts forward. So without further ado, let's start with the Amazon Web Services platform. So this is Vertica running on the Amazon Web Services platform. So as you probably are all aware, Amazon Web Services is the market leader in this space, and indeed really our biggest provider by far, and have been here for a very long time. And Vertica has a deep integration in the Amazon Web Services space. We provide a marketplace offering which has both pay as you go or a bring your own license model. We have many, you know, knowledge base articles, best practices, scripts, and resources that help you configure and use a Vertica database in the Cloud. We have several customers in the Cloud for many, many years now, and we have managed and console-based point and click deployments, you know, for ease of use in the Cloud. So Vertica has a deep integration in the Amazon space, and has been there for quite a bit now. So we communicate a lot of experience here. So let's talk about sizing on AWS. And sizing on any platform comes down to you know, these four or five different things. It comes down to picking the right instance type, picking the right disk volume and type, tuning and optimizing your networking, and finally, you know, some operational concerns like security, maintainability, and backup. So let's go into each one of these on the AWS ecosystem. So the choice of instance type is one of the important choices that you will make. In Eon mode, you know, you don't really need persistent disk. You can, you should probably choose ephemeral disk because it gives you extra speed, and speed with the instance type. We highly recommend the i3.4x instance types, which are very economical, have a big, 4 terabyte depot or cache per node. The i3.metal is similar to the i3.4, but has got significantly better performance, for those subclusters that need this extra oomph. The i3.2 is good for scale out of small ad hoc clusters. You know, they have a smaller cache and lower performance but it's cheap enough to use very indiscriminately. If you were in EE mode, well we don't use S3 as the layer of durability. Your local volumes is where we persist the data. Hence you do need an EBS volume in EE mode. In order to make sure that, you know, that the instance or the deployment is manageable, you might have to use some sort of a software RAID array over the EBS volumes. The most common instance type you see in EE mode is the r4.4x, the c4, or the m4 instance types. And then of course for temp space and depot we always recommend instance volumes. They're just much faster. Okay. So let's go, let's talk about optimizing your network or tuning your network. So the best, the best thing you can do about tuning your network, especially in Eon mode but in other modes too, is to get a VPC S3 endpoint. This is essentially a route table that makes sure that all traffic between your cluster and S3 goes over an internal fabric. This makes it much faster, you don't pay for egress cost, especially if you're doing external tables or your communal storage, but you do need to create it. Many times people will forget doing it. So you really do have to create it. And best of all, it's free. It doesn't cost you anything extra. You just have to create it during cluster creation time, and there's a significant performance difference for using it. The next thing about tuning your network is, you know, sizing it correctly. Pick the closest geographical region to where you'll consume the data. Pick the right availability zone. We highly recommend using cluster placement groups. In fact, they are required for the stability of the cluster. A cluster placement group is essentially, it operates this notion of rack. Nodes in a cluster placement group, are, you know, physically closer to each other than they would otherwise be. And this allows, you know, a 10 Gbps, bidirectional, TCP/IP flow between the nodes. And this makes sure that, you know, you get a high amount of Gbps per second. As you probably are all aware, the Cloud does not support broadcast or UDP broadcast. Hence you must use point-to-point UDP for spread in the Cloud, or in AWS. Beyond that, you know, point-to-point UDP does not scale very well beyond 20 nodes. So you know, as your cluster sizes increase, you must switch over to large cluster mode. And finally, use instances with enhanced networking or SR-IOV support. Again, it's free, it comes with the choice of the instance type and the operating system. We highly recommend it, it makes a big difference in terms of how your workload will perform. So let's talk a little bit about security, configuration, and orchestration. As I said, we provide CloudFormation scripts to make the ease of deployment. You can use the MC point and click. With regard to security, you know, Vertica does support instance profiles out of the box in Amazon. We recommend you use it. This is highly desirable so that you're not passing access keys and secret keys around. If you use our marketplace image, we have picked the latest operating systems, we have patched them, Amazon actually validates everything on marketplace and scans them for security vulnerabilities. So you get that for free. We do some basic configuration, like we disable root ssh access, we disallow any password access, we turn on encryption. And we run a basic set of security checks to make sure that the image is secure. Of course, it could be made more secure. But we try to balance out security, performance, and convenience. And finally, let's talk about backups. Especially in Eon mode I get the question, "Do we really need to back up our system, "since the data is in S3?" And the answer is yes, you do. Because you know, S3's not going to protect you against an accidental drop table. You know, S3 has a finite amount of reliability, durability, and availability. And you may want to be able to restore data differently. Also, backups are important if you're doing DR, or if you have additional cluster in a different region. The other cluster can be considered a backup. And finally, you know, why not create a backup or a disaster recovery cluster, you know, storage is cheap in the Cloud. So you know, we highly recommend you use it. So with this, I would like to hand it over to my colleague Christopher Daly, who will talk about the other two platforms that we support, that is Google and Azure. Over to you, Chris, thank you. >> Chris: Thanks, Sumeet, and hi everyone. So while there's no argument that we here at Vertica have a long history of running within the Amazon Web Services space, there are other alternative Cloud service providers where we do have a presence, such as Google Cloud Platform, or GCP. For those of you who are unfamiliar with GCP, it's considered the third-largest Cloud service provider in the marketspace, and it's priced very competitively to its peers. Has a lot of similarities to AWS in the products and services that it offers, but it tends to be the go-to place for newer businesses or startups. We officially started supporting GCP a little over a year ago with our first entry into their GCP marketplace. So a solution that deployed a fully-functional and ready-to-use Enterprise mode cluster. We followed up on that with the release and the support of Google storage buckets, and now I'm extremely pleased to announce that with the launch of Vertica 10, we're officially supporting Eon mode architecture in GCP as well. But that's not all, as we're adding additional offerings into the GCP marketplace. With the launch of version 10 we'll be introducing a second listing in the marketplace that allows for the deployment of an Eon mode cluster. It's all being driven by our own management consult. This will allow customers to quickly spin up Eon-based clusters within the GCP space. And if that wasn't enough, I'm also pleased to tell you that very soon after the launch we're going to be offering Vertica by the hour in GCP as well. And while we've done a lot to automate the solutions coming out of the marketplace, we recognize the simple fact that for a lot of you, building your cluster manually is really the only option. So with that in mind, let's talk about the things you need to understand in GCP to get that done. So wag me if you think this slide looks familiar. Well nope, it's not an erroneous duplicate slide from Sumeet's AWS section, it's merely an acknowledgement of all the things you need to consider for running Vertica in the Cloud. In Vertica, the choice of the operational mode will dictate some of the choices you'll need to make in the infrastructure, particularly around storage. Just like on-prem solutions, you'll need to understand the disk and networking capacities to get the most out of your cluster. And one of the most attractive things in GCP is the pricing, as it tends to run a little less than the others. But it does translate into less choices and options within the environment. If nothing else, I want you to take one thing away from this slide, and Sumeet said this earlier. VMs running, about AWS, Sumeet said this about AWS earlier. VMs running in the GCP space run on top of hardware that has hyperthreading enabled. And that a vCPU doesn't equate to a core, but rather a processing thread. This becomes particularly important if you're moving from an on-prem environment into the Cloud. Because a physical Vertica node with 32 cores is not the same thing as a VM with 32 vCPUs. In fact, with 32 vCPUs, you're only getting about 16 cores worth of performance. GCP does offer a handful of VM types, which they categorize by letter, but for us, most of these don't make great choices for Vertica nodes. The M series, however, does offer a good core to memory ratio, especially when you're looking at the high-mem variants. Also keep in mind, performance in I/O, such as network and disk, are partially dependent on the VM size, so customers in GCP space should be focusing on 16 vCPU VMs and above for their Vertica nodes. Disk options in GCP can be broken down into two basic types, persistent disks and local disks, which are ephemeral. Persistent disks come in two forms, standard or SSD. For Vertica in Eon mode, we recommend that customers use persistent SSD disks for the catalog, and either local SSD disks or persistent SSD disks for the depot and the temp space. Couple of things to think about here, though. Persistent disks are provisioned as a single device with a settable size. Local disks are provisioned as multiple disk devices with a fixed size, requiring you to use some kind of software RAIDing to create a single storage device. So while local SSD disks provide much more throughput, you're using CPU resources to maintain that RAID set. So you're giving, it's a little bit of a trade-off. Persistent disks offer redundancy, either within the zone that they exist or within the region, and if you're selecting regional redundancy, the disks are replicated across multiple zones in the region. This does have an effect in the performance to VM, so we don't recommend this. What we do recommend is the zonal redundancy when you're using persistent disks, as it gives you that redundancy level without actually affecting the performance. Remember also, in the Cloud space, all I/O is network I/O, as disks are basically block storage devices. This means that disk actions can and will slow down network traffic. And finally, the storage bucket access in GCP is based on GCP interoperability mode, which means that it's basically compliant with the AWS S3 API. In interoperability mode, access to the bucket is granted by a key pair that GCP refers to as HMAC keys. HMAC keys can be generated for individual users or for service accounts. We will recommend that when you're creating HMAC keys, choose a service account to ensure that the keys are not tied to a single employee. When thinking about storage for Enterprise mode, things change a little bit. We still recommend persistent SSD disks over standard ones. However, the use of local SSD disks for anything other than temp space is highly discouraged. I said it before, local SSD disks are ephemeral, meaning that the data's lost if the machine is turned off or goes down. So not really a place you want to store your data. In GCP, multiple persistent disks placed into a software RAID set does not create more throughput like you can find in other Clouds. The I/O saturation usually hits the VM limit long before it hits the disk limit. In fact, performance of a persistent disk is determined not just by the size of the disk but also by the size of the VM. So a good rule of thumb in GCP is to maximize your I/O throughput for persistent disks, is that the size tends to max out at two terabytes for SSDs and 10 terabytes for standard disks. Network performance in GCP can be thought of in two distinct ways. There's node-to-node traffic, and then there's egress traffic. Node-to-node performance in GCP is really good within the zone, with typical traffic between nodes falling in the 10-15 gigabits per second range. This might vary a little from zone to zone and region to region, but usually it's only limited, they're only limited by the existing traffic where the VMs exist. So kind of a noisy neighbor effect. Egress traffic from a VM, however, is subject to throughput caps, and these are based on the size of the VM. So the speed is set for the number of vCPUs in the VM at two gigabits per second per vCPU, and tops out at 32 gigabits per second. So the larger the VM, the more vCPUs you get, the larger the cap. So some things to consider in the NAV ring space for your Vertica cluster, pick a region that's physically close to you, even if you're connecting to the GCP network from a corporate LAN as opposed to the internet. The further the packets have to travel, the longer it's going to take. Also, GCP, like most Clouds, doesn't support UDP broadcast traffic on their virtual NAV ring, so you do have to use the point-to-point flag for spread when you're creating your cluster. And since the network cap on VMs is set at 32 gigabits per second per VM, maximize your network egress throughput and don't use VMs that are smaller than 16 vCPUs for your Vertica nodes. And that gets us to the one question I get asked the most often. How do I get my data into and out of the Cloud? Well, GCP offers many different methods to support different speeds and different price points for data ingress and egress. There's the obvious one, right, across the internet either directly to the VMs or into the storage bucket. Or you can, you know, light up a VPN tunnel to encrypt all that traffic. But additionally, GCP offers direct network interconnect from your corporate network. These get provided either by Google or by a partner, and they vary in speed. They also offer things called direct or carrier peering, which is connecting the edges of the networks between your network and GCP, and you can use a CDN interconnect, which creates, I believe, an on-demand connection from the GCP network, your network to the GCP network provided by a large host of CDN service providers. So GCP offers a lot of ways to move your data around in and out of the GCP Cloud. It's really a matter of what price point works for you, and what technology your corporation is looking to use. So we've talked about AWS, we've talked about GCP, it really only leaves one more Cloud. So last, and by far not the least, there's the Microsoft Azure environment. Holding on strong to the number two place in the major Cloud providers, Azure offers a very robust Cloud offering that's attractive to customers that already consume services from Microsoft. But what you need to keep in mind is that the underlying foundation of their Cloud is based on the Microsoft Windows products. And this makes their Cloud offering a little bit different in the services and offerings that they have. The good news here, though, is that Microsoft has done a very good job of getting their virtualization drivers baked into the modern kernels of most Linux operating systems, making running Linux-based VMs in Azure fairly seamless. So here's the slide again, but now you're going to notice some slight differences. First off, in Azure we only support Enterprise mode. This is because the Azure storage product is very different from Google Cloud storage and S3 on AWS. So while we're working on getting this supported, and we're starting to focus on this, we're just not there yet. This means that since we're only supporting Enterprise mode in Azure, getting the local disk performance right is one of the keys to success of running Vertica here, with the other major key being making sure that you're getting the appropriate networking speeds. Overall, Azure's a really good platform for Vertica, and its performance and pricing are very much on par with AWS. But keep in mind that the newer versions of the Linux operating systems like RHEL and CentOS run much better here than the older versions. Okay, so first things first again, just like GCP, in Azure VMs are running on top of hardware that has hyperthreading enabled. And because of the way Hyper-V, Azure's virtualization engine works, you can actually see this, right? So if you look down into the CPU information of the VM, you'll actually see how it groups the vCPUs by core and by thread. Azure offers a lot of VM types, and is adding new ones all the time. But for us, we see three VM types that make the most sense for Vertica. For customers that are looking to run production workloads in Azure, the Es_v3 and the Ls_v2 series are the two main recommendations. While they differ slightly in the CPU to memory ratio and the I/O throughput, the Es_v3 series is probably the best recommendation for a generalized Vertica node, with the Ls_v2 series being recommended for workloads with higher I/O requirements. If you're just looking to deploy a sandbox environment, the Ds_v3 series is a very suitable choice that really can reduce your overall Cloud spend. VM storage in Azure is provided by a grouping of four different types of disks, all offering different levels of performance. Introduced at the end of last year, the Ultra Disk option is the highest-performing disk type for VMs in Azure. It was designed for database workloads where high throughput and low latency is very desirable. However, the Ultra Disk option is not available in all regions yet, although that's been changing slowly since their launch. The Premium SSD option, which has been around for a while and is widely available, can also offer really nice performance, especially higher capacities. And just like other Cloud providers, the I/O throughput you get on VMs is dictated not only by the size of the disk, but also by the size of the VM and its type. So a good rule of thumb here, VM types with an S will have a much better throughput rate than ones that don't, meaning, and the larger VMs will have, you know, higher I/O throughput than the smaller ones. You can expand the VM disk throughput by using multiple disks in Azure and using a software RAID. This overcomes limitations of single disk performance, but keep in mind, you're now using CPU cycles to maintain that raid, so it is a bit of a trade-off. The other nice thing in Azure is that all their managed disks are encrypted by default on the server side, so there's really nothing you need to do here to enable that. And of course I mentioned this earlier. There is no native access to Azure storage yet, but it is something we're working on. We have seen folks using third-party applications like MinIO to access Azure's storage as an S3 bucket. So it might be something you want to keep in mind and maybe even test out for yourself. Networking in Azure comes in two different flavors, standard and accelerated. In standard networking, the entire network stack is abstracted and virtualized. So this works really well, however, there are performance limitations. Standard networking tends to top out around four gigabits per second. Accelerated networking in Azure is based on single root I/O virtualization of the Mellanox adapter. This is basically the VM talking directly to the physical network card in the host hardware, and it can produce network speeds up to 20 gigabits per second, so much, much faster. Keep in mind, though, that not all VM types and operating systems actually support accelerated networking, and you know, just like disk throughput, network throughput is based on VM type and size. So what do you need to think about for networking in the Azure space? Again, stay close to home. Pick regions that are geographically close to your location. Yes, the backbones between the regions are very, very fast, but the more hops your packets have to make, the longer it takes. Azure offers two types of groupings of their VMs, availability sets and availability zones. Availability zones offer good redundancy across multiple zones, but this actually increases the node-to-node latency, so we recommend you avoid this. Availability sets, on the other hand, keep all your VMs grouped together within a single zone, but makes sure that no two VMs are running on the same host hardware, for redundancy. And just like the other Clouds, UDP broadcast is not supported. So you have to use the point-to-point flag when you're creating your database to ensure that the spread works properly. Spread time out, okay, this is a good one. So recently, Microsoft has started monthly rolling updates of their environment. What this looks like is VMs running on top of hardware that's receiving an update can be paused. And this becomes problematic when the pausing of the VM exceeds eight seconds, as the unpaused members of the cluster now think the paused VM is down. So consider adjusting the spread time out for your clusters in Azure to 30 seconds, and this will help avoid a little of that. If you're deploying a large cluster in Azure, more than 20 nodes, use large closer mode, as point-to-point for spread doesn't really scale well with a lot of Vertica nodes. And finally, you know, pick VM types and operating systems that support accelerated networking. The difference in the node-to-node speeds can be very dramatic. So how do we move data around in Azure, right? So Microsoft views data egress a little differently than other Clouds, as it classifies any data being transmitted by a VM as egress. However, it only bills for data egress that actually leaves the Azure environment. Egress speed limits in Azure are based entirely on the VM type and size, and then they're limited by your connection to them. While not offering as many pathways to access their Cloud as GCP, Azure does offer a direct network-to-network connection called ExpressRoute. Offered by a large group of third-party processors, partners, the ExpressRoute offers multiple tiers of performance that are based on a flat charge for inbound data and a metered charge for outbound data. And of course you can still access these via the internet, and securely through a VPN gateway. So on behalf of Jeff, Sumeet, and myself, I'd like to thank you for listening to our presentation today, and we're now ready for Q&A.
SUMMARY :
Also as a reminder that you can maximize your screen So the best, the best thing you can do and the larger VMs will have, you know,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
Sumeet | PERSON | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
Chris Daly | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
Christopher Daly | PERSON | 0.99+ |
Sumeet Keswani | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Vertica | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
10 Gbps | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
forum.vertica.com | OTHER | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
Amazon Web Services | ORGANIZATION | 0.99+ |
RHEL | TITLE | 0.99+ |
Today | DATE | 0.99+ |
32 cores | QUANTITY | 0.99+ |
CentOS | TITLE | 0.99+ |
more than 20 nodes | QUANTITY | 0.99+ |
32 vCPUs | QUANTITY | 0.99+ |
two platforms | QUANTITY | 0.99+ |
eight seconds | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
10 terabytes | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
20 nodes | QUANTITY | 0.99+ |
two terabytes | QUANTITY | 0.99+ |
each application | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
two types | QUANTITY | 0.99+ |
Linux | TITLE | 0.99+ |
two subclusters | QUANTITY | 0.98+ |
first entry | QUANTITY | 0.98+ |
one question | QUANTITY | 0.98+ |
four | QUANTITY | 0.98+ |
Azure | TITLE | 0.98+ |
Vertica 10 | TITLE | 0.98+ |
4/2 | DATE | 0.98+ |
First | QUANTITY | 0.98+ |
16 vCPU | QUANTITY | 0.98+ |
two forms | QUANTITY | 0.97+ |
MinIO | TITLE | 0.97+ |
single employee | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
this week | DATE | 0.96+ |
UNLIST TILL 4/2 - Model Management and Data Preparation
>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.
SUMMARY :
and thank you for joining us today and the limit, you can hit that in terms of cost,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Vertica | ORGANIZATION | 0.99+ |
Waqas Dhillon | PERSON | 0.99+ |
70 days | QUANTITY | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
two points | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
Sue | PERSON | 0.99+ |
seven days | QUANTITY | 0.99+ |
one week | QUANTITY | 0.99+ |
five days | QUANTITY | 0.99+ |
Sunday | DATE | 0.99+ |
two parts | QUANTITY | 0.99+ |
second part | QUANTITY | 0.99+ |
Saturday | DATE | 0.99+ |
Excel | TITLE | 0.99+ |
50 columns | QUANTITY | 0.99+ |
4/2 | DATE | 0.99+ |
First | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
each | QUANTITY | 0.99+ |
each node | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
first factor | QUANTITY | 0.99+ |
less than 0.27 seconds | QUANTITY | 0.99+ |
Vertica | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
Friday | DATE | 0.99+ |
Monday | DATE | 0.99+ |
second aspect | QUANTITY | 0.99+ |
eighth | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
one day | QUANTITY | 0.99+ |
two data points | QUANTITY | 0.99+ |
third consideration | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
first step | QUANTITY | 0.98+ |
first part | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
zero values | QUANTITY | 0.98+ |
second | QUANTITY | 0.98+ |
both approaches | QUANTITY | 0.98+ |
about 4.6 seconds | QUANTITY | 0.98+ |
third thing | QUANTITY | 0.98+ |
Secondly | QUANTITY | 0.98+ |
one tool | QUANTITY | 0.98+ |
zero | QUANTITY | 0.98+ |
each mode | QUANTITY | 0.98+ |
One | QUANTITY | 0.97+ |
figure B | OTHER | 0.97+ |
figure C | OTHER | 0.97+ |
4.6 value | QUANTITY | 0.97+ |
R | TITLE | 0.97+ |
Machine Learning with Vertica, Data Preparation and Model Management | TITLE | 0.97+ |
Waqas | PERSON | 0.97+ |
each model | QUANTITY | 0.97+ |
two main options | QUANTITY | 0.97+ |
80% | QUANTITY | 0.97+ |
two components | QUANTITY | 0.96+ |
around 90% | QUANTITY | 0.96+ |
two | QUANTITY | 0.96+ |
later this week | DATE | 0.95+ |
UNLIST TILL 4/2 The Data-Driven Prognosis
>> Narrator: Hi, everyone, thanks for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled toward Zero Unplanned Downtime of Medical Imaging Systems using Big Data. My name is Sue LeClaire, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Mauro Barbieri, lead architect of analytics at Philips. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. And we'll answer as many questions as we're able to during that time. Any questions that we don't get to we'll do our best to answer them offline. Alternatively, you can also visit the vertical forums to post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And yes, this virtual session is being recorded, and we'll be available to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Mauro, over to you. >> Thank you, good day everyone. So medical imaging systems such as MRI scanners, interventional guided therapy machines, CT scanners, the XR system, they need to provide hospitals, optimal clinical performance but also predictable cost of ownership. So clinicians understand the need for maintenance of these devices, but they just want to be non intrusive and scheduled. And whenever there is a problem with the system, the hospital suspects Philips services to resolve it fast and and the first interaction with them. In this presentation you will see how we are using big data to increase the uptime of our medical imaging systems. I'm sure you have heard of the company Phillips. Phillips is a company that was founded in 129 years ago in actually 1891 in Eindhoven in Netherlands, and they started by manufacturing, light bulbs, and other electrical products. The two brothers Gerard and Anton, they took an investment from their father Frederik, and they set up to manufacture and sale light bulbs. And as you may know, a key technology for making light bulbs is, was glass and vacuum. So when you're good at making glass products and vacuum and light bulbs, then there is an easy step to start making radicals like they did but also X ray tubes. So Philips actually entered very early in the market of medical imaging and healthcare technology. And this is what our is our core as a company, and it's also our future. So, healthcare, I mean, we are in a situation now in which everybody recognize the importance of it. And and we see incredible trends in a transition from what we call Volume Based Healthcare to Value Base, where, where the clinical outcomes are driving improvements in the healthcare domain. Where it's not enough to respond to healthcare challenges, but we need to be involved in preventing and maintaining the population wellness and from a situation in which we episodically are in touch with healthcare we need to continuously monitor and continuously take care of populations. And from healthcare facilities and technology available to a few elected and reach countries we want to make health care accessible to everybody throughout the world. And this of course, has poses incredible challenges. And this is why we are transforming the Philips to become a healthcare technology leader. So from Philips has been a concern realizing and active in many sectors in many sectors and realizing what kind of technologies we've been focusing on healthcare. And we have been transitioning from creating and selling products to making solutions to addresses ethical challenges. And from selling boxes, to creating long term relationships with our customers. And so, if you have known the Philips brand from from Shavers from, from televisions to light bulbs, you probably now also recognize the involvement of Philips in the healthcare domain, in diagnostic imaging, in ultrasound, in image guided therapy and systems, in digital pathology, non invasive ventilation, as well as patient monitoring intensive care, telemedicine, but also radiology, cardiology and oncology informatics. Philips has become a powerhouse of healthcare technology. To give you an idea of this, these are the numbers for, from 2019 about almost 20 billion sales, 4% comparable sales growth with respect to the previous year and about 10% of the sales are reinvested in R&D. This is also shown in the number of patents rights, last year we filed more than 1000 patents in, in the healthcare domain. And the company is about 80,000 employees active globally in over 100 countries. So, let me focus now on the type of products that are in the scope of this presentation. This is a Philips Magnetic Resonance Imaging Scanner, also called Ingenia 3.0 Tesla is an incredible machine. Apart from being very beautiful as you can see, it's a it's a very powerful technology. It can make high resolution images of the human body without harmful radiation. And it's a, it's a, it's a complex machine. First of all, it's massive, it weights 4.6 thousand kilograms. And it has superconducting magnets cooled with liquid helium at -269 degrees Celsius. And it's actually full of software millions and millions of lines of code. And it's occupied three rooms. What you see in this picture, the examination room, but there is also a technical room which is full of of of equipment of custom hardware, and machinery that is needed to operate this complex device. This is another system, it's an interventional, guided therapy system where the X ray is used during interventions with the patient on the table. You see on the left, what we call C-arm, a robotic arm that moves and can take images of the patient while it's been operated, it's used for cardiology intervention, neurological intervention, cardiovascular intervention. There's a table that moves in very complex ways and it again it occupies two rooms, this room that we see here and but also a room full of cabinets and hardwood and computers. This is another another characteristic of this machine is that it has to operate it as it is used during medical interventions, and so it has to interact with all kind of other equipment. This is another system it's a, it's a, it's a Computer Tomography Scanner Icon which is a unique, it is unique due to its special detection technology. It has an image resolution up to 0.5 millimeters and making thousand by thousand pixel images. And it is also a complex machine. This is a picture of the inside of a compatible device not really an icon, but it has, again three rotating, which waits two and a half turn. So, it's a combination of X ray tube on top, high voltage generators to power the extra tube and in a ray of detectors to create the images. And this rotates at 220 right per minutes, making 50 frames per second to make 3D reconstruction of the of the body. So a lot of technology, complex technology and this technology is made for this situation. We make it for clinicians, who are busy saving people lives. And of course, they want optimal clinical performance. They want the best technology to treat the patients. But they also want predictable cost of ownership. They want predictable system operations. They want their clinical schedules not interrupted. So, they understand these machines are complex full of technology. And these machines may have, may require maintenance, may require software update, sometimes may even say they require some parts, horrible parts to be replaced, but they don't want to have it unplanned. They don't want to have unplanned downtime. They would hate send, having to send patients home and to have to reschedule visits. So they understand maintenance. They just want to have a schedule predictable and non intrusive. So already a number of years ago, we started a transition from what we call Reactive Maintenance services of these devices to proactive. So, let me show you what we mean with this. Normally, if a system has an issue system on the field, and traditional reactive workflow would be that, this the customer calls a call center, reports the problem. The company servicing the device would dispatch a field service engineer, the field service engineer would go on site, do troubleshooting, literally smell, listen to noise, watch for lights, for, for blinking LEDs or other unusual issues and would troubleshoot the issue, find the root cause and perhaps decide that the spare part needs to be replaced. He would order a spare part. The part would have to be delivered at the site. Either immediately or the engineer would would need to come back another day when the part is available, perform the repair. That means replacing the parts, do all the needed tests and validations. And finally release the system for clinical use. So as you can see, there is a lot of, there are a lot of steps, and also handover of information from one to between different people, between different organizations even. Would it be better to actually keep monitoring the installed base, keep observing the machine and actually based on the information collected, detect or predict even when an issue is is going to happen? And then instead of reacting to a customer calling, proactively approach the customer scheduling, preventive service, and therefore avoid the problem. So this is actually what we call Corrective Service. And this is what we're being transitioning to using Big Data and Big Data is just one ingredient. In fact, there are more things that are needed. The devices themselves need to be designed for reliability and predictability. If the device is a black box does not communicate to the outside world the status, if it does not transmit data, then of course, it is not possible to observe and therefore, predict issues. This of course requires a remote service infrastructure or an IoT infrastructure as it is called nowadays. The passivity to connect the medical device with a data center in enterprise infrastructure, collect the data and perform the remote troubleshooting and the predictions. Also the right processes and the right organization is to be in place, because an organization that is, you know, waiting for the customer to call and then has a number of few service engineers available and a certain amount of spare parts and stock is a different organization from an organization that actually is continuously observing the installed base and is scheduling actions to prevent issues. And in other pillar is knowledge management. So in order to realize predictive models and to have predictive service action, it's important to manage knowledge about failure modes, about maintenance procedures very well to have it standardized and digitalized and available. And last but not least, of course, the predictive models themselves. So we talked about transmitting data from the installed base on the medical device, to an enterprise infrastructure that would analyze the data and generate predictions that's predictive models are exactly the last ingredient that is needed. So this is not something that I'm, you know, I'm telling you for the first time is actually a strategic intent of Philips, where we aim for zero unplanned downtime. And we market it that way. We also is not a secret that we do it by using big data. And, of course, there could be other methods to to achieving the same goal. But we started using big data already now well, quite quite many years ago. And one of the reasons is that our medical devices already are wired to collect lots of data about the functioning. So they collect events, error logs that are sensor connecting sensor data. And to give you an idea, for example, just as an order of magnitudes of size of the data, the one MRI scanner can log more than 1 million events per day, hundreds of thousands of sensor readings and tens of thousands of many other data elements. And so this is truly big data. On the other hand, this data was was actually not designed for predictive maintenance, you have to think a medical device of this type of is, stays in the field for about 10 years. Some a little bit longer, some of it's shorter. So these devices have been designed 10 years ago, and not necessarily during the design, and not all components were designed, were designed with predictive maintenance in mind with IoT, and with the latest technology at that time, you know, progress, will not so forward looking at the time. So the actual the key challenge is taking the data which is already available, which is already logged by the medical devices, integrating it and creating predictive models. And if we dive a little bit more into the research challenges, this is one of the Challenges. How to integrate diverse data sources, especially how to automate the costly process of data provisioning and cleaning? But also, once you have the data, let's say, how to create these models that can predict failures and the degradation of performance of a single medical device? Once you have these models and alerts, another challenge is how to automatically recommend service actions based on the probabilistic information on these possible failures? And once you have the insights even if you can recommend action still recommending an action should be done with the goal of planning, maintenance, for generating value. That means balancing costs and benefits, preventing unplanned downtimes without of course scheduling and unnecessary interventions because every intervention, of course, is a disruption for the clinical schedule. And there are many more applications that can be built off such as the optimal management of spare parts supplies. So how do you approach this problem? Our approach was to collect into one database Vertica. A large amount of historical data, first of all historical data coming from the medical devices, so event logs, parameter value system configuration, sensor readings, all the data that we have at our disposal, that in the same database together with records of failures, maintenance records, service work orders, part replacement contracts, so basically the evidence of failures and once you have data from the medical devices, and data from the failures in the same database, it becomes possible to correlate event logs, errors, signal sensor readings with records of failures and records of part replacement and maintenance operations. And we did that also with a specific approach. So we, we create integrated teams, and every integrated team at three figures, not necessarily three people, they were actually multiple people. But there was at least one business owner from a service organization. And this business owner is the person who knows what is relevant, which use case are relevant to solve for a particular type of product or a particular market. What basically is generating value or is worthwhile tackling as an organization. And we have data scientists, data scientists are the one who actually can manipulate data. They can write the queries, they can write the models and robust statistics. They can create visualization and they are the ones who really manipulate the data. Last but not least, very important is subject matter experts. Subject Matter Experts are the people who know the failure modes, who know about the functioning of the medical devices, perhaps they're even designed, they come from the design side, or they come from the service innovation side or even from the field. People who have been servicing the machines in real life for many, many years. So, they are familiar with the failure models, but also familiar with the type of data that is logged and the processes and how actually the systems behave, if you if you if you if you allow me in, in the wild in the in the field. So the combination of these three secrets was a key. Because data scientist alone, just statisticians basically are people who can all do machine learning. And they're not very effective because the data is too complicated. That's why you more than too complex, so they will spend a huge amount of time just trying to figure out the data. Or perhaps they will spend the time in tackling things that are useless, because it's such an interesting knows much quicker which data points are useful, which phenomenon can be found in the data or probably not found. So the combination of subject matter experts and data scientists is very powerful and together gathered by a business owner, we could tackle the most useful use cases first. So, this teams set up to work and they developed three things mainly, first of all, they develop insights on the failure modes. So, by looking at the data, and analyzing information about what happened in the field, they find out exactly how things fail in a very pragmatic and quantitative way. Also, they of course, set up to develop the predictive model with associated alerts and service actions. And a predictive model is just not an alert is just not a flag. Just not a flag, only flag that turns on like a like a traffic light, you know, but there's much more than that. It's such an alert is to be interpreted and used by highly skilled and trained engineer, for example, in a in a call center, who needs to evaluate that error and plan a service action. Service action may involve the ordering a replacement of an expensive part, it may involve calling up the customer hospital and scheduling a period of downtime, downtime to replace a part. So it has an impact on the clinical practice, could have an impact. So, it is important that the alert is coupled with sufficient evidence and information for such a highly skilled trained engineer to plan the service session efficiently. So, it's it's, it's a lot of work in terms of preparing data, preparing visualizations, and making sure that old information is represented correctly and in a compact form. Additionally, These teams develop, get insight into the failure modes and so they can provide input to the R&D organization to improve the products. So, to summarize these graphically, we took a lot of historical data from, coming from the medical devices from the history but also data from relational databases, where the service, work orders, where the part replacement, the contact information, we integrated it, and we set up to the data analytics. From there we don't have value yet, only value starts appearing when we use the insights of data analytics the model on live data. When we process live data with the module we can generate alerts, and the alerts can be used to plan the maintenance and the maintenance therefore the plant maintenance replaces replacing downtime is creating value. To give an idea of the, of the type of I cannot show you the details of these modules, all of these predictive models. But to give you an idea, this is just a picture of some of the components of our medical device for which we have models for which we have, for which we call the failure modes, hard disk, clinical grade monitoring, monitors, X ray tubes, and so forth. This is for MRI machines, a lot of custom hardware and other types of amplifiers and electronics. The alerts are then displayed in a in a dashboard, what we call a Remote monitoring dashboard. We have a team of remote monitoring engineers that basically surveyors the install base, looks at this dashboard picks up these alerts. And an alert as I said before is not just one flag, it contains a lot of information about the failure and about the medical device. And the remote monitor engineer basically will pick up these alerts, they review them and they create cases for the markets organization to handle. So, they see an alert coming in they create a case. So that the particular call center in in some country can call the customer and schedule and make an appointment to schedule a service action or it can add it preventive action to the schedule of the field service engineer who's already supposed to go to visit the customer for example. This is a picture and high-level picture of the overall data person architecture. On the bottom we have install base install base is formed by all our medical devices that are connected to our Philips and more service network. Data is transmitted in a in a secure and in a secure way to our enterprise infrastructure. Where we have a so called Data Lake, which is basically an archive where we store the data as it comes from, from the customers, it is scrubbed and protected. From there, we have a processes ETL, Extract, Transform and Load that in parallel, analyze this information, parse all these files and all this data and extract the relevant parameters. All this, the reason is that the data coming from the medical device is very verbose, and in legacy formats, sometimes in binary formats in strange legacy structures. And therefore, we parse it and we structure it and we make it magically usable by data science teams. And the results are stored in a in a vertica cluster, in a data warehouse. In the same data warehouse, where we also store information from other enterprise systems from all kinds of databases from SQL, Microsoft SQL Server, Tera Data SAP from Salesforce obligations. So, the enterprise IT system also are connected to vertica the data is inserted into vertica. And then from vertica, the data is pulled by our predictive models, which are Python and Rscripts that run on our proprietary environment helps with insights. From this proprietary environment we generate the alerts which are then used by the remote monitoring application. It's not the only application this is the case of remote monitoring. We also have applications for particular remote service. So whenever we cannot prevent or predict we cannot predict an issue from happening or we cannot prevent an issue from happening and we need to react on a customer call, then we can still use the data to very quickly troubleshoot the system, find the root cause and advice or the best service session. Additionally, there are reliability dashboards because all this data can also be used to perform reliability studies and improve the design of the medical devices and is used by R&D. And the access is with all kinds of tools. So Vertica gives the flexibility to connect with JDBC to connect dashboards using Power BI to create dashboards and click view or just simply use RM Python directly to perform analytics. So little summary of the, of the size of the data for the for the moment we have integrated about 500 terabytes worth of data tables, about 30 trillion data points. More than eighty different data sources. For our complete connected install base, including our customer relation management system SAP, we also have connected, we have integrated data from from the factory for repair shops, this is very useful because having information from the factory allows to characterize components and devices when they are new, when they are still not used. So, we can model degradation, excuse me, predict failures much better. Also, we have many years of historical data and of course 24/7 live feeds. So, to get all this going, we we have chosen very simple designs from the very beginning this was developed in the back the first system in 2015. At that time, we went from scratch to production eight months and is also very stable system. To achieve that, we apply what we call Exhaustive Error Handling. When you process, most of people attending this conference probably know when you are dealing with Big Data, you have probably you face all kinds of corner cases you feel that will never happen. But just because of the sheer volume of the data, you find all kinds of strange things. And that's what you need to take care of, if you want to have a stable, stable platform, stable data pipeline. Also other characteristic is that, we need to handle live data, but also be able to, we need to be able to reprocess large historical datasets, because insights into the data are getting generated over time by the team that is using the data. And very often, they find not only defects, but also they have changed requests for new data to be extracted to distract in a different way to be aggregated in a different way. So basically, the platform is continuously crunching data. Also, components have built-in monitoring capabilities. Transparent transparency builds trust by showing how the platform behaves. People actually trust that they are having all the data which is available, or if they don't see the data or if something is not functioning they can see why and where the processing has stopped. A very important point is documentation of data sources every data point as a so called Data Provenance Fields. That is not only the medical device where it comes from, with all this identifier, but also from which file, from which moment in time, from which row, from which byte offset that data point comes. This allows to identify and not only that, but also when this data point was created, by whom, by whom meaning which version of the platform and of the ETL created a data point. This allows us to identify issues and also to fix only the subset of when an issue is identified and fixed. It's possible then to fix only subset of the data that is impacted by that issue. Again, this grid trusts in data to essential for this type of applications. We actually have different environments in our analytic solution. One that we call data science environment is more or less what I've shown so far, where it's deployed in our Philips private cloud, but also can be deployed in in in public cloud such as Amazon. It contains the years of historical data, it allows interactive data exploration, human queries, therefore, it is a highly viable load. It is used for the training of machine learning algorithms and this design has been such that we it is for allowing rapid prototyping and for large data volumes. In other environments is the so called Production Environment where we actually score the models with live data from generation of the alerts. So this environment does not require years of data just months, because a model to make a prediction does not need necessarily years of data, but maybe some model even a couple of weeks or a few months, three months, six months depending on the type of data on the failure which has been predicted. And this has highly optimized queries because the applications are stable. It only only change when we deploy new models or new versions of the models. And it is designed optimized for low latency, high throughput and reliability is no human intervention, no human queries. And of course, there are development staging environments. And one of the characteristics. Another characteristic of all this work is that what we call Data Driven Service Innovation. In all this work, we use the data in every step of the process. The First business case creation. So, basically, some people ask how did you manage to find the unlocked investment to create such a platform and to work on it for years, you know, how did you start? Basically, we started with a business case and the business case again for that we use data. Of course, you need to start somewhere you need to have some data, but basically, you can use data to make a quantitative analysis of the current situation and also make it as accurate as possible estimate quantitative of value creation, if you have that basically, is you can justify the investments and you can start building. Next to that data is used to decide where to focus your efforts. In this case, we decided to focus on the use cases that had the maximum estimated business impact, with business impact meaning here, customer value, as well as value for the company. So we want to reduce unplanned downtime, we want to give value to our customers. But it would be not sustainable, if for creating value, we would start replacing, you know, parts without any consideration for the cost of it. So it needs to be sustainable. Also, then we use data to analyze the failure modes to actually do digging into the data understanding of things fail, for visualization, and to do reliability analysis. And of course, then data is a key to do feature engineering for the development of the predictive models for training the models and for the validation with historical data. So data is all over the place. And last but not least, again, these models is architecture generates new data about the alerts and about the how good the alerts are, and how well they can predict failures, how much downtime is being saved, how money issues have been prevented. So this also data that needs to be analyzed and provides insights on the performance of this, of this models and can be used to improve the models found. And last but not least, once you have performance of the models you can use data to, to quantify as much as possible the value which is created. And it is when you go back to the first step, you made the business value you you create the first business case with estimates. Can you, can you actually show that you are creating value? And the more you can, have this fitness feedback loop closed and quantify the better it is for having more and more impact. Among the key elements that are needed for realizing this? So I want to mention one about data documentation is the practice that we started already six years ago is proven to be very valuable. We document always how data is extracted and how it is stored in, in data model documents. Data Model documents specify how data goes from one place to the other, in this case from device logs, for example, to a table in vertica. And it includes things such as the finish of duplicates, queries to check for duplicates, and of course, the logical design of the tables below the physical design of the table and the rationale. Next to it, there is a data dictionary that explains for each column in the data model from a subject matter expert perspective, what that means, such as its definition and meaning is if it's, if it's a measurement, the use of measure and the range. Or if it's a, some sort of, of label the spec values, or whether the value is raw or or calculated. This is essential for maximizing the value of data for allowing people to use data. Last but not least, also an ETL design document, it explains how the transformation has happened from the source to the destination including very important the failure and the strategy. For example, when you cannot parse part of a file, should you load only what you can parse or drop the entire file completely? So, import best effort or do all or nothing or how to populate records for which there is no value what are the default values and you know, how to have the data is normalized or transform and also to avoid duplicates. This again is very important to provide to the users of the data, if full picture of all the data itself. And this is not just, this the formal process the documents are reviewed and approved by all the stakeholders into the subject matter experts and also the data scientists from a function that we have started called Data Architect. So to, this is something I want to give about, oh, yeah and of course the the documents are available to the end users of the data. And we even have links with documents of the data warehouse. So if you are, if you get access to the database, and you're doing your research and you see a table or a view, you think, well, it could be that could be interesting. It looks like something I could use for my research. Well, the data itself has a link to the document. So from the database while you're exploring data, you can retrieve a link to the place where the document is available. This is just the quick summary of some of the of the results that I'm allowed to share at this moment. This is about image guided therapy, using our remote service infrastructure for remotely connected system with the right contracts. We can achieve we have we have reduced downtime by 14% more than one out of three of cases are resolved remotely without an engineer having to go outside. 82% is the first time right fixed rate that means that the issue is fixed either remotely or if a visit at the site is needed, that visit only one visit is needed. So at that moment, the engineer we decided the right part and fix this straightaway. And this result on average on 135 hours more operational availability per year. This therefore, the ability to treat more patients for the same costs. I'd like to conclude with citing some nice testimonials from some of our customers, showing that the value that we've created is really high impact and this concludes my presentation. Thanks for your attention so far. >> Thank you Morrow, very interesting. And we've got a number of questions that we that have come in. So let's get to them. The first one, how many devices has Philips connected worldwide? And how do you determine which related center data workloads get analyzed with protocols? >> Okay, so this is just two questions. So the first question how many devices are connected worldwide? Well, actually, I'm not allowed to tell you the precise number of connected devices worldwide, but what I can tell is that we are in the order of tens of thousands of devices. And of all types actually. And then, how would we determine which related sensor gets analyzed with vertica well? And a little bit how I set In the in the presentation is a combination of two approaches is a data driven approach and the knowledge driven approach. So a knowledge driven approach because we make maximum use of our knowledge of the failure modes, and the behavior of the medical devices and of their components to select what we think are promising data points and promising features. However, from that moment on data science kicks in, and it's actually data science is used to look at the actual data and come up with quantitative information of what is really happening. So, it could be that an expert is convinced that the particular range of value of a sensor are indicative of a particular failure. And it turns out that maybe it was too optimistic on the other way around that in practice, there are many other situations situation he was not aware of. That could happen. So thanks to the data, then we, you know, get a better understanding of the phenomenon and we get the better modeling. I bet I answered that, any question? >> Yeah, we have another question. Do you have plans to perform any analytics at the edge? >> Now that's a good question. So I can't disclose our plans on this right now, but at the edge devices are certainly one of the options we look at to help our customers towards Zero Unplanned Downtime. Not only that, but also to facilitate the integration of our solution with existing and future hospital IT infrastructure. I mean, we're talking about advanced security, privacy and guarantee that the data is always safe remains. patient data and clinical data remains does not go outside the parameters of the hospital of course, while we want to enhance our functionality provides more value with our services. Yeah, so edge definitely very interesting area of innovation. >> Another question, what are the most helpful vertica features that you rely on? >> I would say, the first that comes to mind, to me at this moment is ease of integration. Basically, with vertica, we will be able to load any data source in a very easy way. And also it really can be interfaced very easily with old type of ions as an application. And this, of course, is not unique to vertica. Nevertheless, the added value here is that this is coupled with an incredible speed, incredible speed for loading and for querying. So it's basically a very versatile tool to innovate fast for data science, because basically we do not end up another thing is multiple projections, advanced encoding and compression. So this allows us to perform the optimizations only when we need it and without having to touch applications or queries. So if we want to achieve high performance, we Basically spend a little effort on improving the projection. And now we can achieve very often dramatic increases in performance. Another feature is EO mode. This is great for for cloud for cloud deployment. >> Okay, another question. What is the number one lesson learned that you can share? >> I think that would my advice would be document control your entire data pipeline, end to end, create positive feedback loops. So I hear that what I hear often is that enterprises I mean Philips is one of them that are not digitally native. I mean, Philips is 129 years old as a company. So you can imagine the the legacy that we have, we will not, you know, we are not born with Web, like web companies are with with, you know, with everything online and everything digital. So enterprises that are not digitally native, sometimes they struggle to innovate in big data or into to do data driven innovation, because, you know, the data is not available or is in silos. Data is controlled by different parts of the organ of the organization with different processes. There is not as a super strong enterprise IT system, providing all the data, you know, for everybody with API's. So my advice is to, to for the very beginning, a creative creating as soon as possible, an end to end solution, from data creation to consumption. That creates value for all the stakeholders of the data pipeline. It is important that everyone in the data pipeline from the producer of the data to the to the consumers, basically in order to pipeline everybody gets a piece of value, piece of the cake. When the value is proven to all stakeholders, everyone would naturally contribute to keep the data pipeline running, and to keep the quality of the data high. That's the students there. >> Yeah, thank you. And in the area of machine learning, what types of innovations do you plan to adopt to help with your data pipeline? >> So, in the error of machine learning, we're looking at things like automatically detecting the deterioration of models to trigger improvement action, as well as connected with active learning. Again, focused on improving the accuracy of our predictive models. So active learning is when the additional human intervention labeling of difficult cases is triggered. So the machine learning classifier may not be able to, you know, classify correctly all the time and instead of just randomly picking up some cases for a human to review, you, you want the costly humans to only review the most valuable cases, from a machine learning point of view, the ones that would contribute the most in improving the classifier. Another error is is deep learning and was not working on it, I mean, but but also applications of more generic anomaly detection algorithms. So the challenge of anomaly detection is that we are not only interested in finding anomalies but also in the recommended proper service actions. Because without a proper service action, and alert generated because of an anomaly, the data loses most of its value. So, this is where I think we, you know. >> Go ahead. >> No, that's, that's it, thanks. >> Okay, all right. So that's all the time that we have today for questions. I want to thank the audience for attending Mauro's presentation and also for your questions. If you weren't able to, if we weren't able to answer your question today, I'd ask let we'll let you know that we'll respond via email. And again, our engineers will be at the vertica, on the vertica quorums awaiting your other questions. It would help us greatly if you could give us some feedback and rate the session before you sign off. Your rating will help us guide us as when we're looking at content to provide for the next vertica BTC. Also, note that a replay of today's event and a PDF copy of the slides will be available on demand, we'll let you know when that'll be by email hopefully later this week. And of course, we invite you to share the content with your colleagues. Again, thank you for your participation today. This includes this breakout session and hope you have a wonderful day. Thank you. >> Thank you
SUMMARY :
in the lower right corner of the slide. and perhaps decide that the spare part needs to be replaced. So let's get to them. and the behavior of the medical devices Do you have plans to perform any analytics at the edge? and guarantee that the data is always safe remains. on improving the projection. What is the number one lesson learned that you can share? from the producer of the data to the to the consumers, And in the area of machine learning, what types the deterioration of models to trigger improvement action, and a PDF copy of the slides will be available on demand,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Mauro Barbieri | PERSON | 0.99+ |
Philips | ORGANIZATION | 0.99+ |
Gerard | PERSON | 0.99+ |
Frederik | PERSON | 0.99+ |
Phillips | ORGANIZATION | 0.99+ |
Sue LeClaire | PERSON | 0.99+ |
2015 | DATE | 0.99+ |
two questions | QUANTITY | 0.99+ |
Mauro | PERSON | 0.99+ |
Eindhoven | LOCATION | 0.99+ |
4.6 thousand kilograms | QUANTITY | 0.99+ |
two rooms | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
14% | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
Anton | PERSON | 0.99+ |
4% | QUANTITY | 0.99+ |
135 hours | QUANTITY | 0.99+ |
three months | QUANTITY | 0.99+ |
2019 | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
82% | QUANTITY | 0.99+ |
two approaches | QUANTITY | 0.99+ |
eight months | QUANTITY | 0.99+ |
three people | QUANTITY | 0.99+ |
three rooms | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
first question | QUANTITY | 0.99+ |
more than 1000 patents | QUANTITY | 0.99+ |
1891 | DATE | 0.99+ |
Today | DATE | 0.99+ |
Power BI | TITLE | 0.99+ |
Netherlands | LOCATION | 0.99+ |
one ingredient | QUANTITY | 0.99+ |
three figures | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
over 100 countries | QUANTITY | 0.99+ |
later this week | DATE | 0.99+ |
tens of thousands | QUANTITY | 0.99+ |
SQL | TITLE | 0.98+ |
about 10% | QUANTITY | 0.98+ |
about 80,000 employees | QUANTITY | 0.98+ |
six years ago | DATE | 0.98+ |
Python | TITLE | 0.98+ |
three | QUANTITY | 0.98+ |
two brothers | QUANTITY | 0.98+ |
millions | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
about 30 trillion data points | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
about 500 terabytes | QUANTITY | 0.98+ |
Microsoft | ORGANIZATION | 0.98+ |
first time | QUANTITY | 0.98+ |
each column | QUANTITY | 0.98+ |
hundreds of thousands | QUANTITY | 0.98+ |
this week | DATE | 0.97+ |
Salesforce | ORGANIZATION | 0.97+ |
first | QUANTITY | 0.97+ |
tens of thousands of devices | QUANTITY | 0.97+ |
first system | QUANTITY | 0.96+ |
about 10 years | QUANTITY | 0.96+ |
10 years ago | DATE | 0.96+ |
one visit | QUANTITY | 0.95+ |
Morrow | PERSON | 0.95+ |
up to 0.5 millimeters | QUANTITY | 0.95+ |
More than eighty different data sources | QUANTITY | 0.95+ |
129 years ago | DATE | 0.95+ |
first interaction | QUANTITY | 0.94+ |
one flag | QUANTITY | 0.94+ |
three things | QUANTITY | 0.93+ |
thousand | QUANTITY | 0.93+ |
50 frames per second | QUANTITY | 0.93+ |
First business | QUANTITY | 0.93+ |
UNLIST TILL 4/2 - Vertica in Eon Mode: Past, Present, and Future
>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Vertica in Eon Mode past, present and future. I'm Paige Roberts, open source relations manager at Vertica and I'll be your host for this session. Joining me is Vertica engineer, Yuanzhe Bei and Vertica Product Manager, David Sprogis. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait till the end. Just type your question or comment as you think of it in the question box, below the slides and click Submit. Q&A session at the end of the presentation. We'll answer as many of your questions as we're able to during that time, and any questions that we don't address, we'll do our best to answer offline. If you wish after the presentation, you can visit the Vertica forums to post your questions there and our engineering team is planning to join the forums to keep the conversation going, just like a Dev Lounge at a normal in person, BDC. So, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, if you want to see them bigger. And yes, before you ask, this virtual session is being recorded and will be available to view on demand this week. We are supposed to send you a notification as soon as it's ready. All right, let's get started. Over to you, Dave. >> David: Thanks, Paige. Hey, everybody. Let's start with a timeline of the life of Eon Mode. About two years ago, a little bit less than two years ago, we introduced Eon Mode on AWS. Pretty specifically for the purpose of rapid scaling to meet the cloud economics promise. It wasn't long after that we realized that workload isolation, a byproduct of the architecture was very important to our users and going to the third tick, you can see that the importance of that workload isolation was manifest in Eon Mode being made available on-premise using Pure Storage FlashBlade. Moving to the fourth tick mark, we took steps to improve workload isolation, with a new type of subcluster which Yuanzhe will go through and to the fifth tick mark, the introduction of secondary subclusters for faster scaling and other improvements which we will cover in the slides to come. Getting started with, why we created Eon Mode in the first place. Let's imagine that your database is this pie, the pecan pie and we're loading pecan data in through the ETL cutting board in the upper left hand corner. We have a couple of free floating pecans, which we might imagine to be data supporting external tables. As you know, the Vertica has a query engine capability as well which we call external tables. And so if we imagine this pie, we want to serve it with a number of servers. Well, let's say we wanted to serve it with three servers, three nodes, we would need to slice that pie into three segments and we would serve each one of those segments from one of our nodes. Now because the data is important to us and we don't want to lose it, we're going to be saving that data on some kind of raid storage or redundant storage. In case one of the drives goes bad, the data remains available because of the durability of raid. Imagine also, that we care about the availability of the overall database. Imagine that a node goes down, perhaps the second node goes down, we still want to be able to query our data and through nodes one and three, we still have all three shards covered and we can do this because of buddy projections. Each neighbor, each nodes neighbor contains a copy of the data from the node next to it. And so in this case, node one is sharing its segment with node two. So node two can cover node one, node three can cover node two and node one back to node three. Adding a little bit more complexity, we might store the data in different copies, each copy sorted for a different kind of query. We call this projections in Vertica and for each projection, we have another copy of the data sorted differently. Now it gets complex. What happens when we want to add a node? Well, if we wanted to add a fourth node here, what we would have to do, is figure out how to re-slice all of the data in all of the copies that we have. In effect, what we want to do is take our three slices and slice it into four, which means taking a portion of each of our existing thirds and re-segmenting into quarters. Now that looks simple in the graphic here, but when it comes to moving data around, it becomes quite complex because for each copy of each segment we need to replace it and move that data on to the new node. What's more, the fourth node can't have a copy of itself that would be problematic in case it went down. Instead, what we need is we need that buddy to be sitting on another node, a neighboring node. So we need to re-orient the buddies as well. All of this takes a lot of time, it can take 12, 24 or even 36 hours in a period when you do not want your database under high demand. In fact, you may want to stop loading data altogether in order to speed it up. This is a planned event and your applications should probably be down during this period, which makes it difficult. With the advent of cloud computing, we saw that services were coming up and down faster and we determined to re-architect Vertica in a way to accommodate that rapid scaling. Let's see how we did it. So let's start with four nodes now and we've got our four nodes database. Let's add communal storage and move each of the segments of data into communal storage. Now that's the separation that we're talking about. What happens if we run queries against it? Well, it turns out that the communal storage is not necessarily performing and so the IO would be slow, which would make the overall queries slow. In order to compensate for the low performance of communal storage, we need to add back local storage, now it doesn't have to be raid because this is just an ephemeral copy but with the data files, local to the node, the queries will run much faster. In AWS, communal storage really does mean an S3 bucket and here's a simplified version of the diagram. Now, do we need to store all of the data from the segment in the depot? The answer is no and the graphics inside the bucket has changed to reflect that. It looks more like a bullseye, showing just a segment of the data being copied to the cache or to the depot, as we call it on each one of the nodes. How much data do you store on the node? Well, it would be the active data set, the last 30 days, the last 30 minutes or the last. Whatever period of time you're working with. The active working set is the hot data and that's how large you want to size your depot. By architecting this way, when you scale up, you're not re-segmenting the database. What you're doing, is you're adding more compute and more subscriptions to the existing shards of the existing database. So in this case, we've added a complete set of four nodes. So we've doubled our capacity and we've doubled our subscriptions, which means that now, the two nodes can serve the yellow shard, two nodes can serve the red shard and so on. In this way, we're able to run twice as many queries in the same amount of time. So you're doubling the concurrency. How high can you scale? Well, can you scale to 3X, 5X? We tested this in the graphics on the right, which shows concurrent users in the X axis by the number of queries executed in a minute along the Y axis. We've grouped execution in runs of 10 users, 30 users, 50, 70 up to 150 users. Now focusing on any one of these groups, particularly up around 150. You can see through the three bars, starting with the bright purple bar, three nodes and three segments. That as you add nodes to the middle purple bar, six nodes and three segments, you've almost doubled your throughput up to the dark purple bar which is nine nodes and three segments and our tests show that you can go to 5X with pretty linear performance increase. Beyond that, you do continue to get an increase in performance but your incremental performance begins to fall off. Eon architecture does something else for us and that is it provides high availability because each of the nodes can be thought of as ephemeral and in fact, each node has a buddy subscription in a way similar to the prior architecture. So if we lose node four, we're losing the node responsible for the red shard and now node one has to pick up responsibility for the red shard while that node is down. When a query comes in, and let's say it comes into one and one is the initiator then one will look for participants, it'll find a blue shard and a green shard but when it's looking for the red, it finds itself and so the node number one will be doing double duty. This means that your performance will be cut in half approximately, for the query. This is acceptable until you are able to restore the node. Once you restore it and once the depot becomes rehydrated, then your performance goes back to normal. So this is a much simpler way to recover nodes in the event of node failure. By comparison, Enterprise Mode the older architecture. When we lose the fourth node, node one takes over responsibility for the first shard and the yellow shard and the red shard. But it also is responsible for rehydrating the entire data segment of the red shard to node four, this can be very time consuming and imposes even more stress on the first node. So performance will go down even further. Eon Mode has another feature and that is you can scale down completely to zero. We call this hibernation, you shut down your database and your database will maintain full consistency in a rest state in your S3 bucket and then when you need access to your database again, you simply recreate your cluster and revive your database and you can access your database once again. That concludes the rapid scaling portion of, why we created Eon Mode. To take us through workload isolation is Yuanzhe Bei, Yuanzhe. >> Yuanzhe: Thanks Dave, for presenting how Eon works in general. In the next section, I will show you another important capability of Vertica Eon Mode, the workload isolation. Dave used a pecan pie as an example of database. Now let's say it's time for the main course. Does anyone still have a problem with food touching on their plates. Parents know that it's a common problem for kids. Well, we have a similar problem in database as well. So there could be multiple different workloads accessing your database at the same time. Say you have ETL jobs running regularly. While at the same time, there are dashboards running short queries against your data. You may also have the end of month report running and their can be ad hoc data scientists, connect to the database and do whatever the data analysis they want to do and so on. How to make these mixed workload requests not interfere with each other is a real challenge for many DBAs. Vertica Eon Mode provides you the solution. I'm very excited here to introduce to you to the important concept in Eon Mode called subclusters. In Eon Mode, nodes they belong to the predefined subclusters rather than the whole cluster. DBAs can define different subcluster for different kinds of workloads and it redirects those workloads to the specific subclusters. For example, you can have an ETL subcluster, dashboard subcluster, report subcluster and the analytic machine learning subcluster. Vertica Eon subcluster is designed to achieve the three main goals. First of all, strong workload isolation. That means any operation in one subcluster should not affect or be affected by other subclusters. For example, say the subcluster running the report is quite overloaded and already there can be, the data scienctists running crazy analytic jobs, machine learning jobs on the analytics subcluster and making it very slow, even stuck or crash or whatever. In such scenario, your ETL and dashboards subcluster should not be or at least very minimum be impacted by this crisis and which means your ETL job which should not lag behind and dashboard should respond timely. We have done a lot of improvements as of 10.0 release and will continue to deliver improvements in this category. Secondly, fully customized subcluster settings. That means any subcluster can be set up and tuned for very different workloads without affecting other subclusters. Users should be able to tune up, tune down, certain parameters based on the actual needs of the individual subcluster workload requirements. As of today, Vertica already supports few settings that can be done at the subcluster level for example, the depot pinning policy and then we will continue extending more that is like resource pools (mumbles) in the near future. Lastly, Vertica subclusters should be easy to operate and cost efficient. What it means is that the subcluster should be able to turn on, turn off, add or remove or should be available for use according to rapid changing workloads. Let's say in this case, you want to spin up more dashboard subclusters because we need higher scores report, we can do that. You might need to run several report subclusters because you might want to run multiple reports at the same time. While on the other hand, you can shut down your analytic machine learning subcluster because no data scientists need to use it at this moment. So we made automate a lot of change, the improvements in this category, which I'll explain in detail later and one of the ultimate goal is to support auto scaling To sum up, what we really want to deliver for subcluster is very simple. You just need to remember that accessing subclusters should be just like accessing individual clusters. Well, these subclusters do share the same catalog. So you don't have to work out the stale data and don't need to worry about data synchronization. That'd be a nice goal, Vertica upcoming 10.0 release is certainly a milestone towards that goal, which will deliver a large part of the capability in this direction and then we will continue to improve it after 10.0 release. In the next couple of slides, I will highlight some issues about workload isolation in the initial Eon release and show you how we resolve these issues. First issue when we initially released our first or so called subcluster mode, it was implemented using fault groups. Well, fault groups and the subcluster have something in common. Yes, they are both defined as a set of nodes. However, they are very different in all the other ways. So, that was very confusing in the first place, when we implement this. As of 9.3.0 version, we decided to detach subcluster definition from the fault groups, which enabled us to further extend the capability of subclusters. Fault groups in the pre 9.3.0 versions will be converted into subclusters during the upgrade and this was a very important step that enabled us to provide all the amazing, following improvements on subclusters. The second issue in the past was that it's hard to control the execution groups for different types of workloads. There are two types of problems here and I will use some example to explain. The first issue is about control group size. There you allocate six nodes for your dashboard subcluster and what you really want is on the left, the three pairs of nodes as three execution groups, and each pair of nodes will need to subscribe to all the four shards. However, that's not really what you get. What you really get is there on the right side that the first four nodes subscribed to one shard each and the rest two nodes subscribed to two dangling shards. So you won't really get three execusion groups but instead only get one and two extra nodes have no value at all. The solution is to use subclusters. So instead of having a subcluster with six nodes, you can split it up into three smaller ones. Each subcluster will guarantee to subscribe to all the shards and you can further handle this three subcluster using load balancer across them. In this way you achieve the three real exclusion groups. The second issue is that the session participation is non-deterministic. Any session will just pick four random nodes from the subcluster as long as this covers one shard each. In other words, you don't really know which set of nodes will make up your execution group. What's the problem? So in this case, the fourth node will be doubled booked by two concurrent sessions. And you can imagine that the resource usage will be imbalanced and both queries performance will suffer. What is even worse is that these queries of the two concurrent sessions target different table They will cause the issue, that depot efficiency will be reduced, because both session will try to fetch the files on to two tables into the same depot and if your depot is not large enough, they will evict each other, which will be very bad. To solve this the same way, you can solve this by declaring subclusters, in this case, two subclusters and a load balancer group across them. The reason it solved the problem is because the session participation would not go across the boundary. So there won't be a case that any node is double booked and in terms of the depot and if you use the subcluster and avoid using a load balancer group, and carefully send the first workload to the first subcluster and the second to the second subcluster and then the result is that depot isolation is achieved. The first subcluster will maintain the data files for the first query and you don't need to worry about the file being evicted by the second kind of session. Here comes the next issue, it's the scaling down. In the old way of defining subclusters, you may have several execution groups in the subcluster. You want to shut it down, one or two execution groups to save cost. Well, here comes the pain, because you don't know which nodes may be used by which session at any point, it is hard to find the right timing to hit the shutdown button of any of the instances. And if you do and get unlucky, say in this case, you pull the first four nodes, one of the session will fail because it's participating in the node two and node four at that point. User of that session will notice because their query fails and we know that for many business this is critical problem and not acceptable. Again, with subclusters this problem is resolved. Same reason, session cannot go across the subcluster boundary. So all you need to do is just first prevent query sent to the first subcluster and then you can shut down the instances in that subcluster. You are guaranteed to not break any running sessions. Now, you're happy and you want to shut down more subclusters then you hit the issue four, the whole cluster will go down, why? Because the cluster loses quorum. As a distributed system, you need to have at least more than half of a node to be up in order to commit and keep the cluster up. This is to prevent the catalog diversion from happening, which is important. But do you still want to shut down those nodes? Because what's the point of keeping those nodes up and if you are not using them and let them cost you money right. So Vertica has a solution, you can define a subcluster as secondary to allow them to shut down without worrying about quorum. In this case, you can define the first three subclusters as secondary and the fourth one as primary. By doing so, this secondary subclusters will not be counted towards the quorum because we changed the rule. Now instead of requiring more than half of node to be up, it only require more than half of the primary node to be up. Now you can shut down your second subcluster and even shut down your third subcluster as well and keep the remaining primary subcluster to be still running healthily. There are actually more benefits by defining secondary subcluster in addition to the quorum concern, because the secondary subclusters no longer have the voting power, they don't need to persist catalog anymore. This means those nodes are faster to deploy, and can be dropped and re-added. Without the worry about the catalog persistency. For the most the subcluster that only need to read only query, it's the best practice to define them as secondary. The commit will be faster on this secondary subcluster as well, so running this query on the secondary subcluster will have less spikes. Primary subcluster as usual handle everything is responsible for consistency, the background tasks will be running. So DBAs should make sure that the primary subcluster is stable and assume is running all the time. Of course, you need to at least one primary subcluster in your database. Now with the secondary subcluster, user can start and stop as they need, which is very convenient and this further brings up another issue is that if there's an ETL transaction running and in the middle, a subcluster starting and it become up. In older versions, there is no catalog resync mechanism to keep the new subcluster up to date. So Vertica rolls back to ETL session to keep the data consistency. This is actually quite disruptive because real world ETL workloads can sometimes take hours and rolling back at the end means, a large waste of resources. We resolved this issue in 9.3.1 version by introducing a catalog resync mechanism when such situation happens. ETL transactions will not roll back anymore, but instead will take some time to resync the catalog and commit and the problem is resolved. And last issue I would like to talk about is the subscription. Especially for large subcluster when you start it, the startup time is quite long, because the subscription commit used to be serialized. In one of the in our internal testing with large catalogs committing a subscription, you can imagine it takes five minutes. Secondary subcluster is better, because it doesn't need to persist the catalog during the commit but still take about two seconds to commit. So what's the problem here? Let's do the math and look at this chart. The X axis is the time in the minutes and the Y axis is the number of nodes to be subscribed. The dark blues represents your primary subcluster and light blue represents the secondary subcluster. Let's say the subcluster have 16 nodes in total and if you start a secondary subcluster, it will spend about 30 seconds in total, because the 2 seconds times 16 is 32. It's not actually that long time. but if you imagine that starting secondary subcluster, you expect it to be super fast to react to the fast changing workload and 30 seconds is no longer trivial anymore and what is even worse is on the primary subcluster side. Because the commit is much longer than five minutes let's assume, then at the point, you are committing to six nodes subscription all other nodes already waited for 30 minutes for GCLX or we know the global catalog lock, and the Vertica will crash the nodes, if any node cannot get the GCLX for 30 minutes. So the end result is that your whole database crashed. That's a serious problem and we know that and that's why we are already planning for the fix, for the 10.0, so that all the subscription will be batched up and all the nodes will commit at the same time concurrently. And by doing that, you can imagine the primary subcluster can finish commiting in five minutes instead of crashing and the secondary subcluster can be finished even in seconds. That summarizes the highlights for the improvements we have done as of 10.0, and I hope you already get excited about Emerging Eon Deployment Pattern that's shown here. A primary subcluster that handles data loading, ETL jobs and tuple mover jobs is the backbone of the database and you keep it running all the time. At the same time defining different secondary subcluster for different workloads and provision them when the workload requirement arrives and then de-provision them when the workload is done to save the operational cost. So can't wait to play with the subcluster. Here as are some Admin Tools command you can start using. And for more details, check out our Eon subcluster documentation for more details. And thanks everyone for listening and I'll head back to Dave to talk about the Eon on-prem. >> David: Thanks Yuanzhe. At the same time that Yuanzhe and the rest of the dev team were working on the improvements that Yuanzhe described in and other improvements. This guy, John Yovanovich, stood on stage and told us about his deployment at at&t where he was running Eon Mode on-prem. Now this was only six months after we had launched Eon Mode on AWS. So when he told us that he was putting it into production on-prem, we nearly fell out of our chairs. How is this possible? We took a look back at Eon and determined that the workload isolation and the improvement to the operations for restoring nodes and other things had sufficient value that John wanted to run it on-prem. And he was running it on the Pure Storage FlashBlade. Taking a second look at the FlashBlade we thought alright well, does it have the performance? Yes, it does. The FlashBlade is a collection of individual blades, each one of them with NVMe storage on it, which is not only performance but it's scalable and so, we then asked is it durable? The answer is yes. The data safety is implemented with the N+2 redundancy which means that up to two blades can fail and the data remains available. And so with this we realized DBAs can sleep well at night, knowing that their data is safe, after all Eon Mode outsources the durability to the communal storage data store. Does FlashBlade have the capacity for growth? Well, yes it does. You can start as low as 120 terabytes and grow as high as about eight petabytes. So it certainly covers the range for most enterprise usages. And operationally, it couldn't be easier to use. When you want to grow your database. You can simply pop new blades into the FlashBlade unit, and you can do that hot. If one goes bad, you can pull it out and replace it hot. So you don't have to take your data store down and therefore you don't have to take Vertica down. Knowing all of these things we got behind Pure Storage and partnered with them to implement the first version of Eon on-premise. That changed our roadmap a little bit. We were imagining it would start with Amazon and then go to Google and then to Azure and at some point to Alibaba cloud, but as you can see from the left column, we started with Amazon and went to Pure Storage. And then from Pure Storage, we went to Minio and we launched Eon Mode on Minio at the end of last year. Minio is a little bit different than Pure Storage. It's software only, so you can run it on pretty much any x86 servers and you can cluster them with storage to serve up an S3 bucket. It's a great solution for up to about 120 terabytes Beyond that, we're not sure about performance implications cause we haven't tested it but for your dev environments or small production environments, we think it's great. With Vertica 10, we're introducing Eon Mode on Google Cloud. This means not only running Eon Mode in the cloud, but also being able to launch it from the marketplace. We're also offering Eon Mode on HDFS with version 10. If you have a Hadoop environment, and you want to breathe new fresh life into it with the high performance of Vertica, you can do that starting with version 10. Looking forward we'll be moving Eon mode to Microsoft Azure. We expect to have something breathing in the fall and offering it to select customers for beta testing and then we expect to release it sometime in 2021 Following that, further on horizon is Alibaba cloud. Now, to be clear we will be putting, Vertica in Enterprise Mode on Alibaba cloud in 2020 but Eon Mode is going to trail behind whether it lands in 2021 or not, we're not quite sure at this point. Our goal is to deliver Eon Mode anywhere you want to run it, on-prem or in the cloud, or both because that is one of the great value propositions of Vertica is the hybrid capability, the ability to run in both your on prem environment and in the cloud. What's next, I've got three priority and roadmap slides. This is the first of the three. We're going to start with improvements to the core of Vertica. Starting with query crunching, which allows you to run long running queries faster by getting nodes to collaborate, you'll see that coming very soon. We'll be making improvements to large clusters and specifically large cluster mode. The management of large clusters over 60 nodes can be tedious. We intend to improve that. In part, by creating a third network channel to offload some of the communication that we're now loading onto our spread or agreement protocol. We'll be improving depot efficiency. We'll be pushing down more controls to the subcluster level, allowing you to control your resource pools at the subcluster level and we'll be pairing tuple moving with data loading. From an operational flexibility perspective, we want to make it very easy to shut down and revive primaries and secondaries on-prem and in the cloud. Right now, it's a little bit tedious, very doable. We want to make it as easy as a walk in the park. We also want to allow you to be able to revive into a different size subcluster and last but not least, in fact, probably the most important, the ability to change shard count. This has been a sticking point for a lot of people and it puts a lot of pressure on the early decision of how many shards should my database be? Whether it's in 2020 or 2021. We know it's important to you so it's important to us. Ease of use is also important to us and we're making big investments in the management console, to improve managing subclusters, as well as to help you manage your load balancer groups. We also intend to grow and extend Eon Mode to new environments. Now we'll take questions and answers
SUMMARY :
and our engineering team is planning to join the forums and going to the third tick, you can see that and the second to the second subcluster and the improvement to the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David Sprogis | PERSON | 0.99+ |
David | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Dave | PERSON | 0.99+ |
John Yovanovich | PERSON | 0.99+ |
10 users | QUANTITY | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Yuanzhe Bei | PERSON | 0.99+ |
John | PERSON | 0.99+ |
five minutes | QUANTITY | 0.99+ |
2020 | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
50 | QUANTITY | 0.99+ |
second issue | QUANTITY | 0.99+ |
12 | QUANTITY | 0.99+ |
Yuanzhe | PERSON | 0.99+ |
120 terabytes | QUANTITY | 0.99+ |
30 users | QUANTITY | 0.99+ |
two types | QUANTITY | 0.99+ |
2021 | DATE | 0.99+ |
Paige | PERSON | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
three pairs | QUANTITY | 0.99+ |
second | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
nine nodes | QUANTITY | 0.99+ |
first subcluster | QUANTITY | 0.99+ |
two tables | QUANTITY | 0.99+ |
two nodes | QUANTITY | 0.99+ |
first issue | QUANTITY | 0.99+ |
each copy | QUANTITY | 0.99+ |
2 seconds | QUANTITY | 0.99+ |
36 hours | QUANTITY | 0.99+ |
second subcluster | QUANTITY | 0.99+ |
fourth node | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
six nodes | QUANTITY | 0.99+ |
third subcluster | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
twice | QUANTITY | 0.99+ |
First issue | QUANTITY | 0.99+ |
three segments | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
three bars | QUANTITY | 0.99+ |
24 | QUANTITY | 0.99+ |
5X | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
16 nodes | QUANTITY | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
each segment | QUANTITY | 0.99+ |
first node | QUANTITY | 0.99+ |
three slices | QUANTITY | 0.99+ |
Each subcluster | QUANTITY | 0.99+ |
each nodes | QUANTITY | 0.99+ |
three nodes | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
two subclusters | QUANTITY | 0.98+ |
three servers | QUANTITY | 0.98+ |
four shards | QUANTITY | 0.98+ |
3X | QUANTITY | 0.98+ |
three | QUANTITY | 0.98+ |
two concurrent sessions | QUANTITY | 0.98+ |