Day 4 Keynote Analysis | AWS re:Invent 2022

(upbeat music) >> Good morning everybody. Welcome back to Las Vegas. This is day four of theCUBE's wall-to-wall coverage of our Super Bowl, aka AWS re:Invent 2022. I'm here with my co-host, Paul Gillin. My name is Dave Vellante. Sanjay Poonen is in the house, CEO and president of Cohesity. He's sitting in as our guest market watcher, market analyst, you know, deep expertise, new to the job at Cohesity. He was kind enough to sit in, and help us break down what's happening at re:Invent. But Paul, first thing, this morning we heard from Werner Vogels. He was basically given a masterclass on system design. It reminded me of mainframes years ago. When we used to, you know, bury through those IBM blue books and red books. You remember those Sanjay? That's how we- learned back then. >> Oh God, I remember those, Yeah. >> But it made me think, wow, now you know IBM's more of a systems design, nobody talks about IBM anymore. Everybody talks about Amazon. So you wonder, 20 years from now, you know what it's going to be. But >> Well- >> Werner's amazing. >> He pulled out a 24 year old document. >> Yup. >> That he had written early in Amazon's evolution about synchronous design or about essentially distributed architectures that turned out to be prophetic. >> His big thing was nature is asynchronous. So systems are asynchronous. Synchronous is an illusion. It's an abstraction. It's kind of interesting. But, you know- >> Yeah, I mean I've had synonyms for things. Timeless architecture. Werner's an absolute legend. I mean, when you think about folks who've had, you know, impact on technology, you think of people like Jony Ive in design. >> Dave: Yeah. >> You got to think about people like Werner in architecture and just the fact that Andy and the team have been able to keep him engaged that long... I pay attention to his keynote. Peter DeSantis has obviously been very, very influential. And then of course, you know, Adam did a good job, you know, watching from, you know, having watched since I was at the first AWS re:Invent conference, at time was President SAP and there was only a thousand people at this event, okay? Andy had me on stage. I think I was one of the first guest of any tech company in 2011. And to see now this become like, it's a mecca. It's a mother of all IT events, and watch sort of even the transition from Andy to Adam is very special. I got to catch some of Ruba's keynote. So while there's some new people in the mix here, this has become a force of nature. And the last time I was here was 2019, before Covid, watched the last two ones online. But it feels like, I don't know 'about what you guys think, it feels like it's back to 2019 levels. >> I was here in 2019. I feel like this was bigger than 2019 but some people have said that it's about the same. >> I think it was 60,000 versus 50,000. >> Yes. So close. >> It was a little bigger in 2019. But it feels like it's more active. >> And then last year, Sanjay, you weren't here but it was 25,000, which was amazing 'cause it was right in that little space between Omicron, before Omicron hit. But you know, let me ask you a question and this is really more of a question about Amazon's maturity and I know you've been following them since early days. But the way I get the question, number one question I get from people is how is Amazon AWS going to be different under Adam than it was under Andy? What do you think? >> I mean, Adam's not new because he was here before. In some senses he knows the Amazon culture from prior, when he was running sales and marketing prior. But then he took the time off and came back. I mean, this will always be, I think, somewhat Andy's baby, right? Because he was the... I, you know, sent him a text, "You should be really proud of what you accomplished", but you know, I think he also, I asked him when I saw him a few weeks ago "Are you going to come to re:Invent?" And he says, "No, I want to leave this to be Adam's show." And Adam's going to have a slightly different view. His keynotes are probably half the time. It's a little bit more vision. There was a lot more customer stories at the beginning of it. Taking you back to the inspirational pieces of it. I think you're going to see them probably pulling up the stack and not just focused in infrastructure. Many of their platform services are evolved. Many of their, even application services. I'm surprised when I talk to customers. Like Amazon Connect, their sort of call center type technologies, an app layer. It's getting a lot. I mean, I've talked to a couple of Fortune 500 companies that are moving off Ayer to Connect. I mean, it's happening and I did not know that. So it's, you know, I think as they move up the stack, the platform's gotten more... The data centric stack has gotten, and you know, in the area we're working with Cohesity, security, data protection, they're an investor in our company. So this is an important, you know, both... I think tech player and a partner for many companies like us. >> I wonder the, you know, the marketplace... there's been a big push on the marketplace by all the cloud companies last couple of years. Do you see that disrupting the way softwares, enterprise software is sold? >> Oh, for sure. I mean, you have to be a ostrich with your head in the sand to not see this wave happening. I mean, what's it? $150 billion worth of revenue. Even though the growth rates dipped a little bit the last quarter or so, it's still aggregatively between Amazon and Azure and Google, you know, 30% growth. And I think we're still in the second or third inning off a grand 1 trillion or 2 trillion of IT, shifting not all of it to the cloud, but significantly faster. So if you add up all of the big things of the on-premise world, they're, you know, they got to a certain size, their growth is stable, but stalling. These guys are growing significantly faster. And then if you add on top of them, platform companies the data companies, Snowflake, MongoDB, Databricks, you know, Datadog, and then apps companies on top of that. I think the move to the Cloud is inevitable. In SaaS companies, I don't know why you would ever implement a CRM solution on-prem. It's all gone to the Cloud. >> Oh, it is. >> That happened 15 years ago. I mean, begin within three, five years of the advent of Salesforce. And the same thing in HR. Why would you deploy a HR solution now? You've got Workday, you've got, you know, others that are so some of those apps markets are are just never coming back to an on-prem capability. >> Sanjay, I want to ask you, you built a reputation for being able to, you know, forecast accurately, hit your plan, you know, you hit your numbers, you're awesome operator. Even though you have a, you know, technology degree, which you know, that's a two-tool star, multi-tool star. But I call it the slingshot economy. This is like, I mean I've seen probably more downturns than anybody in here, you know, given... Well maybe, maybe- >> Maybe me. >> You and I both. I've never seen anything like this, where where visibility is so unpredictable. The economy is sling-shotting. It's like, oh, hurry up, go Covid, go, go go build, build, build supply, then pull back. And now going forward, now pulling back. Slootman said, you know, on the call, "Hey the guide, is the guide." He said, "we put it out there, We do our best to hit it." But you had CrowdStrike had issues you know, mid-market, ServiceNow. I saw McDermott on the other day on the, on the TV. I just want to pay, you know, buy from the guy. He's so (indistinct) >> But mixed, mixed results, Salesforce, you know, Octa now pre-announcing, hey, they're going to be, or announcing, you know, better visibility, forward guide. Elastic kind of got hit really hard. HPE and Dell actually doing really well in the enterprise. >> Yep. >> 'Course Dell getting killed in the client. But so what are you seeing out there? How, as an executive, do you deal with such poor visibility? >> I think, listen, what the last two or three years have taught us is, you know, with the supply chain crisis, with the surge that people thought you may need of, you know, spending potentially in the pandemic, you have to start off with your tech platform being 10 x better than everybody else. And differentiate, differentiate. 'Cause in a crowded market, but even in a market that's getting tougher, if you're not differentiating constantly through technology innovation, you're going to get left behind. So you named a few places, they're all technology innovators, but even if some of them are having challenges, and then I think you're constantly asking yourselves, how do you move from being a point product to a platform with more and more services where you're getting, you know, many of them moving really fast. In the case of Roe, I like him a lot. He's probably one of the most savvy operators, also that I respect. He calls these speedboats, and you know, his core platform started off with the firewall network security. But he's built now a very credible cloud security, cloud AI security business. And I think that's how you need to be thinking as a tech executive. I mean, if you got core, your core beachhead 10 x better than everybody else. And as you move to adjacencies in these new platforms, have you got now speedboats that are getting to a point where they are competitive advantage? Then as you think of the go-to-market perspective, it really depends on where you are as a company. For a company like our size, we need partners a lot more. Because if we're going to, you know, stand on the shoulders of giants like Isaac Newton said, "I see clearly because I stand on the shoulders giants." I need to really go and cultivate Amazon so they become our lead partner in cloud. And then appropriately Microsoft and Google where I need to. And security. Part of what we announced last week was, last month, yeah, last couple of weeks ago, was the data security alliance with the biggest security players. What was I trying to do with that? First time ever done in my industry was get Palo Alto, CrowdStrike, Wallace, Tenable, CyberArk, Splunk, all to build an alliance with me so I could stand on their shoulders with them helping me. If you're a bigger company, you're constantly asking yourself "how do you make sure you're getting your, like Amazon, their top hundred customers spending more with that?" So I think the the playbook evolves, and I'm watching some of these best companies through this time navigate through this. And I think leadership is going to be tested in enormously interesting ways. >> I'll say. I mean, Snowflake is really interesting because they... 67% growth, which is, I mean, that's best in class for a company that's $2 billion. And, but their guide was still, you know, pretty aggressive. You know, so it's like, do you, you know, when it when it's good times you go, "hey, we can we can guide conservatively and know we can beat it." But when you're not certain, you can't dial down too far 'cause your investors start to bail on you. It's a really tricky- >> But Dave, I think listen, at the end of the day, I mean every CEO should not be worried about the short term up and down in the stock price. You're building a long-term multi-billion dollar company. In the case of Frank, he has, I think I shot to a $10 billion, you know, analytics data warehousing data management company on the back of that platform, because he's eyeing the market that, not just Teradata occupies today, but now Oracle occupies or other databases, right? So his tam as it grows bigger, you're going to have some of these things, but that market's big. I think same with Palo Alto. I mean Datadog's another company, 75% growth. >> Yeah. >> At 20% margins, like almost rule of 95. >> Amazing. >> When they're going after, not just the observability market, they're eating up the sim market, security analytics, the APM market. So I think, you know, that's, you look at these case studies of companies who are going from point product to platforms and are steadily able to grow into new tams. You know, to me that's very inspiring. >> I get it. >> Sanjay: That's what I seek to do at our com. >> I get that it's a marathon, but you know, when you're at VMware, weren't you looking at the stock price every day just out of curiosity? I mean listen, you weren't micromanaging it. >> You do, but at the end of the day, and you certainly look at the days of earnings and so on so forth. >> Yeah. >> Because you want to create shareholder value. >> Yeah. >> I'm not saying that you should not but I think in obsession with that, you know, in a short term, >> Going to kill ya. >> Makes you, you know, sort of myopically focused on what may not be the right thing in the long term. Now in the long arc of time, if you're not creating shareholder value... Look at what happened to Steve Bomber. You needed Satya to come in to change things and he's created a lot of value. >> Dave: Yeah, big time. >> But I think in the short term, my comments were really on the quarter to quarter, but over a four a 12 quarter, if companies are growing and creating profitable growth, they're going to get the valuation they deserve. >> Dave: Yeah. >> Do you the... I want to ask you about something Arvind Krishna said in the previous IBM earnings call, that IT is deflationary and therefore it is resistant to the macroeconomic headwinds. So IT spending should actually thrive in a deflation, in a adverse economic climate. Do you think that's true? >> Not all forms of IT. I pay very close attention to surveys from, whether it's the industry analysts or the Morgan Stanleys, or Goldman Sachs. The financial analysts. And I think there's a gluc in certain sectors that will get pulled back. Traditional view is when the economies are growing people spend on the top line, front office stuff, sales, marketing. If you go and look at just the cloud 100 companies, which are the hottest private companies, and maybe with the public market companies, there's way too many companies focused on sales and marketing. Way too many. I think during a downsizing and recession, that's going to probably shrink some, because they were all built for the 2009 to 2021 era, where it was all about the top line. Okay, maybe there's now a proposition for companies who are focused on cost optimization, supply chain visibility. Security's been intangible, that I think is going to continue to an investment. So I tell, listen, if you are a tech investor or if you're an operator, pay attention to CIO priorities. And right now, in our business at Cohesity, part of the reason we've embraced things like ransomware protection, there is a big focus on security. And you know, by intelligently being a management and a security company around data, I do believe we'll continue to be extremely relevant to CIO budgets. There's a ransomware, 20 ransomware attempts every second. So things of that kind make you relevant in a bank. You have to stay relevant to a buying pattern or else you lose momentum. >> But I think what's happening now is actually IT spending's pretty good. I mean, I track this stuff pretty closely. It's just that expectations were so high and now you're seeing earnings estimates come down and so, okay, and then you, yeah, you've got the, you know the inflationary factors and your discounted cash flows but the market's actually pretty good. >> Yeah. >> You know, relative to other downturns that if this is not a... We're not actually not in a downturn. >> Yeah. >> Not yet anyway. It may be. >> There's a valuation there. >> You have to prepare. >> Not sales. >> Yeah, that's right. >> When I was on CNBC, I said "listen, it's a little bit like that story of Joseph. Seven years of feast, seven years of famine." You have to prepare for potentially your worst. And if it's not the worst, you're in good shape. So will it be a recession 2023? Maybe. You know, high interest rates, inflation, war in Russia, Ukraine, maybe things do get bad. But if you belt tightening, if you're focused in operational excellence, if it's not a recession, you're pleasantly surprised. If it is one, you're prepared for it. >> All right. I'm going to put you in the spot and ask you for predictions. Expert analysis on the World Cup. What do you think? Give us the breakdown. (group laughs) >> As my... I wish India was in the World Cup, but you can't get enough Indians at all to play soccer well enough, but we're not, >> You play cricket, though. >> I'm a US man first. I would love to see one of Brazil, or Argentina. And as a Messi person, I don't know if you'll get that, but it would be really special for Messi to lead, to end his career like Maradonna winning a World Cup. I don't know if that'll happen. I'm probably going to go one of the Latin American countries, if the US doesn't make it far enough. But first loyalty to the US team, and then after one of the Latin American countries. >> And you think one of the Latin American countries is best bet to win or? >> I don't know. It's hard to tell. They're all... What happens now at this stage >> So close, right? >> is anybody could win. >> Yeah. You just have lots of shots of gold. I'm a big soccer fan. It could, I mean, I don't know if the US is favored to win, but if they get far enough, you get to the finals, anybody could win. >> I think they get Netherlands next, right? >> That's tough. >> Really tough. >> But... The European teams are good too, but I would like to see US go far enough, and then I'd like to see Latin America with team one of Argentina, or Brazil. That's my prediction. >> I know you're a big Cricket fan. Are you able to follow Cricket the way you like? >> At god unearthly times the night because they're in Australia, right? >> Oh yeah. >> Yeah. >> I watched the T-20 World Cup, select games of it. Yeah, you know, I'm not rapidly following every single game but the World Cup games, I catch you. >> Yeah, it's good. >> It's good. I mean, I love every sport. American football, soccer. >> That's great. >> You get into basketball now, I mean, I hope the Warriors come back strong. Hey, how about the Warriors Celtics? What do we think? We do it again? >> Well- >> This year. >> I'll tell you what- >> As a Boston Celtics- >> I would love that. I actually still, I have to pay off some folks from Palo Alto office with some bets still. We are seeing unprecedented NBA performance this year. >> Yeah. >> It's amazing. You look at the stats, it's like nothing. I know it's early. Like nothing we've ever seen before. So it's exciting. >> Well, always a pleasure talking to you guys. >> Great to have you on. >> Thanks for having me. >> Thank you. Love the expert analysis. >> Sanjay Poonen. Dave Vellante. Keep it right there. re:Invent 2022, day four. We're winding up in Las Vegas. We'll be right back. You're watching theCUBE, the leader in enterprise and emerging tech coverage. (lighthearted soft music)

Published Date : Dec 1 2022

SUMMARY :

When we used to, you know, Yeah. So you wonder, 20 years from now, out to be prophetic. But, you know- I mean, when you think you know, watching from, I feel like this was bigger than 2019 I think it was 60,000 But it feels like it's more active. But you know, let me ask you a question So this is an important, you know, both... I wonder the, you I mean, you have to be a ostrich you know, others that are so But I call it the slingshot economy. I just want to pay, you or announcing, you know, better But so what are you seeing out there? I mean, if you got core, you know, pretty aggressive. I think I shot to a $10 billion, you know, like almost rule of 95. So I think, you know, that's, I seek to do at our com. I mean listen, you and you certainly look Because you want to Now in the long arc of time, on the quarter to quarter, I want to ask you about And you know, by intelligently But I think what's happening now relative to other downturns It may be. But if you belt tightening, to put you in the spot but you can't get enough Indians at all But first loyalty to the US team, It's hard to tell. if the US is favored to win, and then I'd like to see Latin America the way you like? Yeah, you know, I'm not rapidly I mean, I love every sport. I mean, I hope the to pay off some folks You look at the stats, it's like nothing. talking to you guys. Love the expert analysis. in enterprise and emerging tech coverage.

ENTITIES

Entity	Category	Confidence
Andy	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Messi	PERSON	0.99+
Sanjay Poonen	PERSON	0.99+
Frank	PERSON	0.99+
Dave	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Werner	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Paul Gillin	PERSON	0.99+
Adam	PERSON	0.99+
Steve Bomber	PERSON	0.99+
Sanjay	PERSON	0.99+
Jony Ive	PERSON	0.99+
$2 billion	QUANTITY	0.99+
Dell	ORGANIZATION	0.99+
2019	DATE	0.99+
2011	DATE	0.99+
Peter DeSantis	PERSON	0.99+
$150 billion	QUANTITY	0.99+
$10 billion	QUANTITY	0.99+
Paul	PERSON	0.99+
last week	DATE	0.99+
Australia	LOCATION	0.99+
Isaac Newton	PERSON	0.99+
last month	DATE	0.99+
Las Vegas	LOCATION	0.99+
2009	DATE	0.99+
Slootman	PERSON	0.99+
60,000	QUANTITY	0.99+
Goldman Sachs	ORGANIZATION	0.99+
Arvind Krishna	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Tenable	ORGANIZATION	0.99+
2 trillion	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
Cohesity	ORGANIZATION	0.99+
50,000	QUANTITY	0.99+
Ruba	PERSON	0.99+
24 year	QUANTITY	0.99+
second	QUANTITY	0.99+
30%	QUANTITY	0.99+
Boston Celtics	ORGANIZATION	0.99+
CyberArk	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Maradonna	PERSON	0.99+
CrowdStrike	ORGANIZATION	0.99+
third	QUANTITY	0.99+
last year	DATE	0.99+
Wallace	ORGANIZATION	0.99+
World Cup	EVENT	0.99+
Splunk	ORGANIZATION	0.99+
Warriors	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
Palo Alto	ORGANIZATION	0.99+
Morgan Stanleys	ORGANIZATION	0.99+
Datadog	ORGANIZATION	0.99+
Werner Vogels	PERSON	0.99+
Databricks	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
Super Bowl	EVENT	0.99+
Snowflake	ORGANIZATION	0.99+
both	QUANTITY	0.99+
World Cup	EVENT	0.99+

Blueprint for Trusted Insfrastructure Episode 2 Full Episode 10-4 V2

>>The cybersecurity landscape continues to be one characterized by a series of point tools designed to do a very specific job, often pretty well, but the mosaic of tooling is grown over the years causing complexity in driving up costs and increasing exposures. So the game of Whackamole continues. Moreover, the way organizations approach security is changing quite dramatically. The cloud, while offering so many advantages, has also created new complexities. The shared responsibility model redefines what the cloud provider secures, for example, the S three bucket and what the customer is responsible for eg properly configuring the bucket. You know, this is all well and good, but because virtually no organization of any size can go all in on a single cloud, that shared responsibility model now spans multiple clouds and with different protocols. Now that of course includes on-prem and edge deployments, making things even more complex. Moreover, the DevOps team is being asked to be the point of execution to implement many aspects of an organization's security strategy. >>This extends to securing the runtime, the platform, and even now containers which can end up anywhere. There's a real need for consolidation in the security industry, and that's part of the answer. We've seen this both in terms of mergers and acquisitions as well as platform plays that cover more and more ground. But the diversity of alternatives and infrastructure implementations continues to boggle the mind with more and more entry points for the attackers. This includes sophisticated supply chain attacks that make it even more difficult to understand how to secure components of a system and how secure those components actually are. The number one challenge CISOs face in today's complex world is lack of talent to address these challenges. And I'm not saying that SecOps pros are not talented, They are. There just aren't enough of them to go around and the adversary is also talented and very creative, and there are more and more of them every day. >>Now, one of the very important roles that a technology vendor can play is to take mundane infrastructure security tasks off the plates of SEC off teams. Specifically we're talking about shifting much of the heavy lifting around securing servers, storage, networking, and other infrastructure and their components onto the technology vendor via r and d and other best practices like supply chain management. And that's what we're here to talk about. Welcome to the second part in our series, A Blueprint for Trusted Infrastructure Made Possible by Dell Technologies and produced by the Cube. My name is Dave Ante and I'm your host now. Previously we looked at what trusted infrastructure means and the role that storage and data protection play in the equation. In this part two of the series, we explore the changing nature of technology infrastructure, how the industry generally in Dell specifically, are adapting to these changes and what is being done to proactively address threats that are increasingly stressing security teams. >>Now today, we continue the discussion and look more deeply into servers networking and hyper-converged infrastructure to better understand the critical aspects of how one company Dell is securing these elements so that dev sec op teams can focus on the myriad new attack vectors and challenges that they faced. First up is Deepak rang Garage Power Edge security product manager at Dell Technologies. And after that we're gonna bring on Mahesh Nagar oim, who was consultant in the networking product management area at Dell. And finally, we're close with Jerome West, who is the product management security lead for HCI hyperconverged infrastructure and converged infrastructure at Dell. Thanks for joining us today. We're thrilled to have you here and hope you enjoy the program. Deepak Arage shoes powered security product manager at Dell Technologies. Deepak, great to have you on the program. Thank you. >>Thank you for having me. >>So we're going through the infrastructure stack and in part one of this series we looked at the landscape overall and how cyber has changed and specifically how Dell thinks about data protection in, in security in a manner that both secures infrastructure and minimizes organizational friction. We also hit on the storage part of the portfolio. So now we want to dig into servers. So my first question is, what are the critical aspects of securing server infrastructure that our audience should be aware of? >>Sure. So if you look at compute in general, right, it has rapidly evolved over the past couple of years, especially with trends toward software defined data centers and with also organizations having to deal with hybrid environments where they have private clouds, public cloud locations, remote offices, and also remote workers. So on top of this, there's also an increase in the complexity of the supply chain itself, right? There are companies who are dealing with hundreds of suppliers as part of their supply chain. So all of this complexity provides a lot of opportunity for attackers because it's expanding the threat surface of what can be attacked, and attacks are becoming more frequent, more severe and more sophisticated. And this has also triggered around in the regulatory and mandates around the security needs. >>And these regulations are not just in the government sector, right? So it extends to critical infrastructure and eventually it also get into the private sector. In addition to this, organizations are also looking at their own internal compliance mandates. And this could be based on the industry in which they're operating in, or it could be their own security postures. And this is the landscape in which servers they're operating today. And given that servers are the foundational blocks of the data center, it becomes extremely important to protect them. And given how complex the modern server platforms are, it's also extremely difficult and it takes a lot of effort. And this means protecting everything from the supply chain to the manufacturing and then eventually the assuring the hardware and software integrity of the platforms and also the operations. And there are very few companies that go to the lens that Dell does in order to secure the server. We truly believe in the notion and the security mentality that, you know, security should enable our customers to go focus on their business and proactively innovate on their business and it should not be a burden to them. And we heavily invest to make that possible for our customers. >>So this is really important because the premise that I set up at the beginning of this was really that I, as of security pro, I'm not a security pro, but if I were, I wouldn't want to be doing all this infrastructure stuff because I now have all these new things I gotta deal with. I want a company like Dell who has the resources to build that security in to deal with the supply chain to ensure the providence, et cetera. So I'm glad you you, you hit on that, but so given what you just said, what does cybersecurity resilience mean from a server perspective? For example, are there specific principles that Dell adheres to that are non-negotiable? Let's say, how does Dell ensure that its customers can trust your server infrastructure? >>Yeah, like when, when it comes to security at Dell, right? It's ingrained in our product, so that's the best way to put it. And security is nonnegotiable, right? It's never an afterthought where we come up with a design and then later on figure out how to go make it secure, right? Our security development life cycle, the products are being designed to counter these threats right from the big. And in addition to that, we are also testing and evaluating these products continuously to identify vulnerabilities. We also have external third party audits which supplement this process. And in addition to this, Dell makes the commitment that we will rapidly respond to any mitigations and vulnerability, any vulnerabilities and exposures found out in the field and provide mitigations and patches for in attacking manner. So this security principle is also built into our server life cycle, right? Every phase of it. >>So we want our products to provide cutting edge capabilities when it comes to security. So as part of that, we are constantly evaluating what our security model is done. We are building on it and continuously improving it. So till a few years ago, our model was primarily based on the N framework of protect, detect and rigor. And it's still aligns really well to that framework, but over the past couple of years, we have seen how computers evolved, how the threads have evolved, and we have also seen the regulatory trends and we recognize the fact that the best security strategy for the modern world is a zero trust approach. And so now when we are building our infrastructure and tools and offerings for customers, first and foremost, they're cyber resilient, right? What we mean by that is they're capable of anticipating threats, withstanding attacks and rapidly recurring from attacks and also adapting to the adverse conditions in which they're deployed. The process of designing these capabilities and identifying these capabilities however, is done through the zero press framework. And that's very important because now we are also anticipating how our customers will end up using these capabilities at there and to enable their own zero trust IT environments and IT zero trusts deployments. We have completely adapted our security approach to make it easier for customers to work with us no matter where they are in their journey towards zero trust option. >>So thank you for that. You mentioned the, this framework, you talked about zero trust. When I think about n I think as well about layered approaches. And when I think about zero trust, I think about if you, if you don't have access to it, you're not getting access, you've gotta earn that, that access and you've got layers and then you still assume that bad guys are gonna get in. So you've gotta detect that and you've gotta response. So server infrastructure security is so fundamental. So my question is, what is Dell providing specifically to, for example, detect anomalies and breaches from unauthorized activity? How do you enable fast and easy or facile recovery from malicious incidents, >>Right? What is that is exactly right, right? Breachers are bound to happen and given how complex our current environment is, it's extremely distributed and extremely connected, right? Data and users are no longer contained with an offices where we can set up a perimeter firewall and say, Yeah, everything within that is good. We can trust everything within it. That's no longer true. The best approach to protect data and infrastructure in the current world is to use a zero trust approach, which uses the principles. Nothing is ever trusted, right? Nothing is trusted implicitly. You're constantly verifying every single user, every single device, and every single access in your system at every single level of your ID environment. And this is the principles that we use on power Edge, right? But with an increased focus on providing granular controls and checks based on the principles of these privileged access. >>So the idea is that service first and foremost need to make sure that the threats never enter and they're rejected at the point of entry, but we recognize breaches are going to occur and if they do, they need to be minimized such that the sphere of damage cost by attacker is minimized so they're not able to move from one part of the network to something else laterally or escalate their privileges and cause more damage, right? So the impact radius for instance, has to be radius. And this is done through features like automated detection capabilities and automation, automated remediation capabilities. So some examples are as part of our end to end boot resilience process, we have what they call a system lockdown, right? We can lock down the configuration of the system and lock on the form versions and all changes to the system. And we have capabilities which automatically detect any drift from that lockdown configuration and we can figure out if the drift was caused to authorized changes or unauthorized changes. >>And if it is an unauthorize change can log it, generate security alerts, and we even have capabilities to automatically roll the firm where, and always versions back to a known good version and also the configurations, right? And this becomes extremely important because as part of zero trust, we need to respond to these things at machine speed and we cannot do it at a human speed. And having these automated capabilities is a big deal when achieving that zero trust strategy. And in addition to this, we also have chassis inclusion detection where if the chassis, the box, the several box is opened up, it logs alerts, and you can figure out even later if there's an AC power cycle, you can go look at the logs to see that the box is opened up and figure out if there was a, like a known authorized access or some malicious actor opening and chain something in your system. >>Great, thank you for that lot. Lot of detail and and appreciate that. I want to go somewhere else now cuz Dell has a renowned supply chain reputation. So what about securing the, the supply chain and the server bill of materials? What does Dell specifically do to track the providence of components it uses in its systems so that when the systems arrive, a customer can be a hundred percent certain that that system hasn't been compromised, >>Right? And we've talked about how complex the modern supply chain is, right? And that's no different for service. We have hundreds of confidence on the server and a lot of these form where in order to be configured and run and this former competence could be coming from third parties suppliers. So now the complexity that we are dealing with like was the end to end approach and that's where Dell pays a lot of attention into assuring the security approach approaching and it starts all the way from sourcing competence, right? And then through the design and then even the manufacturing process where we are wetting the personnel leather factories and wetting the factories itself. And the factories also have physical controls, physical security controls built into them and even shipping, right? We have GPS tagging of packages. So all of this is built to ensure supply chain security. >>But a critical aspect of this is also making sure that the systems which are built in the factories are delivered to the customers without any changes or any tapper. And we have a feature called the secure component verification, which is capable of doing this. What the feature does this, when the system gets built in a factory, it generates an inventory of all the competence in the system and it creates a cryptographic certificate based on the signatures presented to this by the competence. And this certificate is stored separately and sent to the customers separately from the system itself. So once the customers receive the system at their end, they can run out to, it generates an inventory of the competence on the system at their end and then compare it to the golden certificate to make sure nothing was changed. And if any changes are detected, we can figure out if there's an authorized change or unauthorize change. >>Again, authorized changes could be like, you know, upgrades to the drives or memory and ized changes could be any sort of temper. So that's the supply chain aspect of it and bill of metal use is also an important aspect to galing security, right? And we provide a software bill of materials, which is basically a list of ingredients of all the software pieces in the platform. So what it allows our customers to do is quickly take a look at all the different pieces and compare it to the vulnerability database and see if any of the vulner which have been discovered out in the wild affected platform. So that's a quick way of figuring out if the platform has any known vulnerabilities and it has not been patched. >>Excellent. That's really good. My last question is, I wonder if you, you know, give us the sort of summary from your perspective, what are the key strengths of Dell server portfolio from a security standpoint? I'm really interested in, you know, the uniqueness and the strong suit that Dell brings to the table, >>Right? Yeah. We have talked enough about the complexity of the environment and how zero risk is necessary for the modern ID environment, right? And this is integral to Dell powered service. And as part of that like you know, security starts with the supply chain. We already talked about the second component verification, which is a beneath feature that Dell platforms have. And on top of it we also have a silicon place platform mode of trust. So this is a key which is programmed into the silicon on the black service during manufacturing and can never be changed after. And this immutable key is what forms the anchor for creating the chain of trust that is used to verify everything in the platform from the hardware and software integrity to the boot, all pieces of it, right? In addition to that, we also have a host of data protection features. >>Whether it is protecting data at risk in news or inflight, we have self encrypting drives which provides scalable and flexible encryption options. And this couple with external key management provides really good protection for your data address. External key management is important because you know, somebody could physically steam the server walk away, but then the keys are not stored on the server, it stood separately. So that provides your action layer of security. And we also have dual layer encryption where you can compliment the hardware encryption on the secure encrypted drives with software level encryption. Inion to this we have identity and access management features like multifactor authentication, single sign on roles, scope and time based access controls, all of which are critical to enable that granular control and checks for zero trust approach. So I would say like, you know, if you look at the Dell feature set, it's pretty comprehensive and we also have the flexibility built in to meet the needs of all customers no matter where they fall in the spectrum of, you know, risk tolerance and security sensitivity. And we also have the capabilities to meet all the regulatory requirements and compliance requirements. So in a nutshell, I would say that you know, Dell Power Service cyber resident infrastructure helps accelerate zero tested option for customers. >>Got it. So you've really thought this through all the various things that that you would do to sort of make sure that your server infrastructure is secure, not compromised, that your supply chain is secure so that your customers can focus on some of the other things that they have to worry about, which are numerous. Thanks Deepak, appreciate you coming on the cube and participating in the program. >>Thank you for having >>You're welcome. In a moment I'll be back to dig into the networking portion of the infrastructure. Stay with us for more coverage of a blueprint for trusted infrastructure and collaboration with Dell Technologies on the cube, your leader in enterprise and emerging tech coverage. We're back with a blueprint for trusted infrastructure and partnership with Dell Technologies in the cube. And we're here with Mahesh Nager, who is a consultant in the area of networking product management at Dell Technologies. Mahesh, welcome, good to see you. >>Hey, good morning Dell's, nice to meet, meet to you as well. >>Hey, so we've been digging into all the parts of the infrastructure stack and now we're gonna look at the all important networking components. Mahesh, when we think about networking in today's environment, we think about the core data center and we're connecting out to various locations including the cloud and both the near and the far edge. So the question is from Dell's perspective, what's unique and challenging about securing network infrastructure that we should know about? >>Yeah, so few years ago IT security and an enterprise was primarily putting a wrapper around data center out because it was constrained to an infrastructure owned and operated by the enterprise for the most part. So putting a rapid around it like a parameter or a firewall was a sufficient response because you could basically control the environment and data small enough control today with the distributed data, intelligent software, different systems, multi-cloud environment and asset service delivery, you know, the infrastructure for the modern era changes the way to secure the network infrastructure In today's, you know, data driven world, it operates everywhere and data has created and accessed everywhere so far from, you know, the centralized monolithic data centers of the past. The biggest challenge is how do we build the network infrastructure of the modern era that are intelligent with automation enabling maximum flexibility and business agility without any compromise on the security. We believe that in this data era, the security transformation must accompany digital transformation. >>Yeah, that's very good. You talked about a couple of things there. Data by its very nature is distributed. There is no perimeter anymore, so you can't just, as you say, put a rapper around it. I like the way you phrase that. So when you think about cyber security resilience from a networking perspective, how do you define that? In other words, what are the basic principles that you adhere to when thinking about securing network infrastructure for your customers? >>So our belief is that cybersecurity and cybersecurity resilience, they need to be holistic, they need to be integrated, scalable, one that span the entire enterprise and with a co and objective and policy implementation. So cybersecurity needs to span across all the devices and running across any application, whether the application resets on the cloud or anywhere else in the infrastructure. From a networking standpoint, what does it mean? It's again, the same principles, right? You know, in order to prevent the threat actors from accessing changing best destroy or stealing sensitive data, this definition holds good for networking as well. So if you look at it from a networking perspective, it's the ability to protect from and withstand attacks on the networking systems as we continue to evolve. This will also include the ability to adapt and recover from these attacks, which is what cyber resilience aspect is all about. So cybersecurity best practices, as you know, is continuously changing the landscape primarily because the cyber threats also continue to evolve. >>Yeah, got it. So I like that. So it's gotta be integrated, it's gotta be scalable, it's gotta be comprehensive, comprehensive and adaptable. You're saying it can't be static, >>Right? Right. So I think, you know, you had a second part of a question, you know, that says what do we, you know, what are the basic principles? You know, when you think about securing network infrastructure, when you're looking at securing the network infrastructure, it revolves around core security capability of the devices that form the network. And what are these security capabilities? These are access control, software integrity and vulnerability response. When you look at access control, it's to ensure that only the authenticated users are able to access the platform and they're able to access only the kind of the assets that they're authorized to based on their user level. Now accessing a network platform like a switch or a rotor for example, is typically used for say, configuration and management of the networking switch. So user access is based on say roles for that matter in a role based access control, whether you are a security admin or a network admin or a storage admin. >>And it's imperative that logging is enable because any of the change to the configuration is actually logged and monitored as that. Talking about software's integrity, it's the ability to ensure that the software that's running on the system has not been compromised. And, and you know, this is important because it could actually, you know, get hold of the system and you know, you could get UND desire results in terms of say validation of the images. It's, it needs to be done through say digital signature. So, so it's important that when you're talking about say, software integrity, a, you are ensuring that the platform is not compromised, you know, is not compromised and be that any upgrades, you know, that happens to the platform is happening through say validated signature. >>Okay. And now, now you've now, so there's access control, software integrity, and I think you, you've got a third element which is i I think response, but please continue. >>Yeah, so you know, the third one is about civil notability. So we follow the same process that's been followed by the rest of the products within the Dell product family. That's to report or identify, you know, any kind of a vulnerability that's being addressed by the Dell product security incident response team. So the networking portfolio is no different, you know, it follows the same process for identification for tri and for resolution of these vulnerabilities. And these are addressed either through patches or through new reasons via networking software. >>Yeah, got it. Okay. So I mean, you didn't say zero trust, but when you were talking about access control, you're really talking about access to only those assets that people are authorized to access. I know zero trust sometimes is a buzzword, but, but you I think gave it, you know, some clarity there. Software integrity, it's about assurance validation, your digital signature you mentioned and, and that there's been no compromise. And then how you respond to incidents in a standard way that can fit into a security framework. So outstanding description, thank you for that. But then the next question is, how does Dell networking fit into the construct of what we've been talking about Dell trusted infrastructure? >>Okay, so networking is the key element in the Dell trusted infrastructure. It provides the interconnect between the service and the storage world. And you know, it's part of any data center configuration for a trusted infrastructure. The network needs to have access control in place where only the authorized nels are able to make change to the network configuration and logging off any of those changes is also done through the logging capabilities. Additionally, we should also ensure that the configuration should provide network isolation between say the management network and the data traffic network because they need to be separate and distinct from each other. And furthermore, even if you look at the data traffic network and now you have things like segmentation isolated segments and via VRF or, or some micro segmentation via partners, this allows various level of security for each of those segments. So it's important you know, that, that the network infrastructure has the ability, you know, to provide all this, this services from a Dell networking security perspective, right? >>You know, there are multiple layer of defense, you know, both at the edge and in the network in this hardware and in the software and essentially, you know, a set of rules and a configuration that's designed to sort of protect the integrity, confidentiality, and accessibility of the network assets. So each network security layer, it implements policies and controls as I said, you know, including send network segmentation. We do have capabilities sources, centralized management automation and capability and scalability for that matter. Now you add all of these things, you know, with the open networking standards or software, different principles and you essentially, you know, reach to the point where you know, you're looking at zero trust network access, which is essentially sort of a building block for increased cloud adoption. If you look at say that you know the different pillars of a zero trust architecture, you know, if you look at the device aspect, you know, we do have support for security for example, we do have say trust platform in a trusted platform models tpms on certain offer products and you know, the physical security know plain, simple old one love port enable from a user trust perspective, we know it's all done via access control days via role based access control and say capability in order to provide say remote authentication or things like say sticky Mac or Mac learning limit and so on. >>If you look at say a transport and decision trust layer, these are essentially, you know, how do you access, you know, this switch, you know, is it by plain hotel net or is it like secure ssh, right? And you know, when a host communicates, you know, to the switch, we do have things like self-signed or is certificate authority based certification. And one of the important aspect is, you know, in terms of, you know, the routing protocol, the routing protocol, say for example BGP for example, we do have the capability to support MD five authentication between the b g peers so that there is no, you know, manages attack, you know, to the network where the routing table is compromised. And the other aspect is about second control plane is here, you know, you know, it's, it's typical that if you don't have a control plane here, you know, it could be flooded and you know, you know, the switch could be compromised by city denial service attacks. >>From an application test perspective, as I mentioned, you know, we do have, you know, the application specific security rules where you could actually define, you know, the specific security rules based on the specific applications, you know, that are running within the system. And I did talk about, say the digital signature and the cryptographic check that we do for authentication and for, I mean rather for the authenticity and the validation of, you know, of the image and the BS and so on and so forth. Finally, you know, the data trust, we are looking at, you know, the network separation, you know, the network separation could happen or VRF plain old wheel Ls, you know, which can bring about sales multi 10 aspects. We talk about some microsegmentation as it applies to nsx for example. The other aspect is, you know, we do have, with our own smart fabric services that's enabled in a fabric, we have a concept of c cluster security. So all of this, you know, the different pillars, they sort of make up for the zero trust infrastructure for the networking assets of an infrastructure. >>Yeah. So thank you for that. There's a, there's a lot to unpack there. You know, one of the premise, the premise really of this, this, this, this segment that we're setting up in this series is really that everything you just mentioned, or a lot of things you just mentioned used to be the responsibility of the security team. And, and the premise that we're putting forth is that because security teams are so stretched thin, you, you gotta shift the vendor community. Dell specifically is shifting a lot of those tasks to their own r and d and taking care of a lot of that. So, cuz scop teams got a lot of other stuff to, to worry about. So my question relates to things like automation, which can help and scalability, what about those topics as it relates to networking infrastructure? >>Okay, our >>Portfolio, it enables state of the automation software, you know, that enables simplifying of the design. So for example, we do have, you know, you know the fabric design center, you know, a tool that automates the design of the fabric and you know, from a deployment and you know, the management of the network infrastructure that are simplicities, you know, using like Ansible s for Sonic for example are, you know, for a better sit and tell story. You know, we do have smart fabric services that can automate the entire fabric, you know, for a storage solution or for, you know, for one of the workloads for example. Now we do help reduce the complexity by closely integrating the management of the physical and the virtual networking infrastructure. And again, you know, we have those capabilities using Sonic or Smart Traffic services. If you look at Sonic for example, right? >>It delivers automated intent based secure containerized network and it has the ability to provide some network visibility and Avan has and, and all of these things are actually valid, you know, for a modern networking infrastructure. So now if you look at Sonic, you know, it's, you know, the usage of those tools, you know, that are available, you know, within the Sonic no is not restricted, you know, just to the data center infrastructure is, it's a unified no, you know, that's well applicable beyond the data center, you know, right up to the edge. Now if you look at our north from a smart traffic OS 10 perspective, you know, as I mentioned, we do have smart traffic services which essentially, you know, simplifies the deployment day zero, I mean rather day one, day two deployment expansion plans and the lifecycle management of our conversion infrastructure and hyper and hyper conversion infrastructure solutions. And finally, in order to enable say, zero touch deployment, we do have, you know, a VP solution with our SD van capability. So these are, you know, ways by which we bring down the complexity by, you know, enhancing the automation capability using, you know, a singular loss that can expand from a data center now right to the edge. >>Great, thank you for that. Last question real quick, just pitch me, what can you summarize from your point of view, what's the strength of the Dell networking portfolio? >>Okay, so from a Dell networking portfolio, we support capabilities at multiple layers. As I mentioned, we're talking about the physical security for examples, say disabling of the unused interface. Sticky Mac and trusted platform modules are the things that to go after. And when you're talking about say secure boot for example, it delivers the authenticity and the integrity of the OS 10 images at the startup. And Secure Boot also protects the startup configuration so that, you know, the startup configuration file is not compromised. And Secure port also enables the workload of prediction, for example, that is at another aspect of software image integrity validation, you know, wherein the image is data for the digital signature, you know, prior to any upgrade process. And if you are looking at secure access control, we do have things like role based access control, SSH to the switches, control plane access control that pre do tags and say access control from multifactor authentication. >>We do have various tech ads for entry control to the network and things like CSE and PRV support, you know, from a federal perspective we do have say logging wherein, you know, any event, any auditing capabilities can be possible by say looking at the clog service, you know, which are pretty much in our transmitter from the devices overts for example, and last we talked about say network segment, you know, say network separation and you know, these, you know, separation, you know, ensures that are, that is, you know, a contained say segment, you know, for a specific purpose or for the specific zone and, you know, just can be implemented by a, a micro segmentation, you know, just a plain old wheel or using virtual route of framework VR for example. >>A lot there. I mean I think frankly, you know, my takeaway is you guys do the heavy lifting in a very complicated topic. So thank you so much for, for coming on the cube and explaining that in in quite some depth. Really appreciate it. >>Thank you indeed. >>Oh, you're very welcome. Okay, in a moment I'll be back to dig into the hyper-converged infrastructure part of the portfolio and look at how when you enter the world of software defined where you're controlling servers and storage and networks via software led system, you could be sure that your infrastructure is trusted and secure. You're watching a blueprint for trusted infrastructure made possible by Dell Technologies and collaboration with the cube, your leader in enterprise and emerging tech coverage, your own west product management security lead at for HCI at Dell Technologies hyper-converged infrastructure. Jerome, welcome. >>Thank you Dave. >>Hey Jerome, in this series of blueprint for trusted infrastructure, we've been digging into the different parts of the infrastructure stack, including storage servers and networking, and now we want to cover hyperconverged infrastructure. So my first question is, what's unique about HCI that presents specific security challenges? What do we need to know? >>So what's unique about hyper-converge infrastructure is the breadth of the security challenge. We can't simply focus on a single type of IT system. So like a server or storage system or a virtualization piece of software, software. I mean HCI is all of those things. So luckily we have excellent partners like VMware, Microsoft, and internal partners like the Dell Power Edge team, the Dell storage team, the Dell networking team, and on and on. These partnerships in these collaborations are what make us successful from a security standpoint. So let me give you an example to illustrate. In the recent past we're seeing growing scope and sophistication in supply chain attacks. This mean an attacker is going to attack your software supply chain upstream so that hopefully a piece of code, malicious code that wasn't identified early in the software supply chain is distributed like a large player, like a VMware or Microsoft or a Dell. So to confront this kind of sophisticated hard to defeat problem, we need short term solutions and we need long term solutions as well. >>So for the short term solution, the obvious thing to do is to patch the vulnerability. The complexity is for our HCI portfolio. We build our software on VMware, so we would have to consume a patch that VMware would produce and provide it to our customers in a timely manner. Luckily VX rail's engineering team has co engineered a release process with VMware that significantly shortens our development life cycle so that VMware would produce a patch and within 14 days we will integrate our own code with the VMware release we will have tested and validated the update and we will give an update to our customers within 14 days of that VMware release. That as a result of this kind of rapid development process, VHA had over 40 releases of software updates last year for a longer term solution. We're partnering with VMware and others to develop a software bill of materials. We work with VMware to consume their software manifest, including their upstream vendors and their open source providers to have a comprehensive list of software components. Then we aren't caught off guard by an unforeseen vulnerability and we're more able to easily detect where the software problem lies so that we can quickly address it. So these are the kind of relationships and solutions that we can co engineer with effective collaborations with our, with our partners. >>Great, thank you for that. That description. So if I had to define what cybersecurity resilience means to HCI or converged infrastructure, and to me my takeaway was you gotta have a short term instant patch solution and then you gotta do an integration in a very short time, you know, two weeks to then have that integration done. And then longer term you have to have a software bill of materials so that you can ensure the providence of all the components help us. Is that a right way to think about cybersecurity resilience? Do you have, you know, a additives to that definition? >>I do. I really think that's site cybersecurity and resilience for hci because like I said, it has sort of unprecedented breadth across our portfolio. It's not a single thing, it's a bit of everything. So really the strength or the secret sauce is to combine all the solutions that our partner develops while integrating them with our own layer. So let me, let me give you an example. So hci, it's a, basically taking a software abstraction of hardware functionality and implementing it into something called the virtualized layer. It's basically the virtual virtualizing hardware functionality, like say a storage controller, you could implement it in hardware, but for hci, for example, in our VX rail portfolio, we, our Vxl product, we integrated it into a product called vsan, which is provided by our partner VMware. So that portfolio of strength is still, you know, through our, through our partnerships. >>So what we do, we integrate these, these security functionality and features in into our product. So our partnership grows to our ecosystem through products like VMware, products like nsx, Horizon, Carbon Black and vSphere. All of them integrate seamlessly with VMware and we also leverage VMware's software, part software partnerships on top of that. So for example, VX supports multifactor authentication through vSphere integration with something called Active Directory Federation services for adfs. So there's a lot of providers that support adfs including Microsoft Azure. So now we can support a wide array of identity providers such as Off Zero or I mentioned Azure or Active Directory through that partnership. So we can leverage all of our partners partnerships as well. So there's sort of a second layer. So being able to secure all of that, that provides a lot of options and flexibility for our customers. So basically to summarize my my answer, we consume all of the security advantages of our partners, but we also expand on them to make a product that is comprehensively secured at multiple layers from the hardware layer that's provided by Dell through Power Edge to the hyper-converged software that we build ourselves to the virtualization layer that we get through our partnerships with Microsoft and VMware. >>Great, I mean that's super helpful. You've mentioned nsx, Horizon, Carbon Black, all the, you know, the VMware component OTH zero, which the developers are gonna love. You got Azure identity, so it's really an ecosystem. So you may have actually answered my next question, but I'm gonna ask it anyway cuz you've got this software defined environment and you're managing servers and networking and storage with this software led approach, how do you ensure that the entire system is secure end to end? >>That's a really great question. So the, the answer is we do testing and validation as part of the engineering process. It's not just bolted on at the end. So when we do, for example, VxRail is the market's only co engineered solution with VMware, other vendors sell VMware as a hyper converged solution, but we actually include security as part of the co-engineering process with VMware. So it's considered when VMware builds their code and their process dovetails with ours because we have a secure development life cycle, which other products might talk about in their discussions with you that we integrate into our engineering life cycle. So because we follow the same framework, all of the, all of the codes should interoperate from a security standpoint. And so when we do our final validation testing when we do a software release, we're already halfway there in ensuring that all these features will give the customers what we promised. >>That's great. All right, let's, let's close pitch me, what would you say is the strong suit summarize the, the strengths of the Dell hyper-converged infrastructure and converged infrastructure portfolio specifically from a security perspective? Jerome? >>So I talked about how hyper hyper-converged infrastructure simplifies security management because basically you're gonna take all of these features that are abstracted in in hardware, they're now abstracted in the virtualization layer. Now you can manage them from a single point of view, whether it would be, say, you know, in for VX rail would be b be center, for example. So by abstracting all this, you make it very easy to manage security and highly flexible because now you don't have limitations around a single vendor. You have a multiple array of choices and partnerships to select. So I would say that is the, the key to making it to hci. Now, what makes Dell the market leader in HCI is not only do we have that functionality, but we also make it exceptionally useful to you because it's co engineered, it's not bolted on. So I gave the example of spo, I gave the example of how we, we modify our software release process with VMware to make it very responsive. >>A couple of other features that we have specific just to HCI are digitally signed LCM updates. This is an example of a feature that we have that's only exclusive to Dell that's not done through a partnership. So we digitally signed our software updates so the user can be sure that the, the update that they're installing into their system is an authentic and unmodified product. So we give it a Dell signature that's invalidated prior to installation. So not only do we consume the features that others develop in a seamless and fully validated way, but we also bolt on our own a specific HCI security features that work with all the other partnerships and give the user an exceptional security experience. So for, for example, the benefit to the customer is you don't have to create a complicated security framework that's hard for your users to use and it's hard for your system administrators to manage it all comes in a package. So it, it can be all managed through vCenter, for example, or, and then the specific hyper, hyper-converged functions can be managed through VxRail manager or through STDC manager. So there's very few pains of glass that the, the administrator or user ever has to worry about. It's all self contained and manageable. >>That makes a lot of sense. So you've got your own infrastructure, you're applying your best practices to that, like the digital signatures, you've got your ecosystem, you're doing co-engineering with the ecosystems, delivering security in a package, minimizing the complexity at the infrastructure level. The reason Jerome, this is so important is because SecOps teams, you know, they gotta deal with cloud security, they gotta deal with multiple clouds. Now they have their shared responsibility model going across multiple cl. They got all this other stuff that they have to worry, they gotta secure the containers and the run time and and, and, and, and the platform and so forth. So they're being asked to do other things. If they have to worry about all the things that you just mentioned, they'll never get, you know, the, the securities is gonna get worse. So what my takeaway is, you're removing that infrastructure piece and saying, Okay guys, you now can focus on those other things that is not necessarily Dell's, you know, domain, but you, you know, you can work with other partners to and your own teams to really nail that. Is that a fair summary? >>I think that is a fair summary because absolutely the worst thing you can do from a security perspective is provide a feature that's so unusable that the administrator disables it or other key security features. So when I work with my partners to define, to define and develop a new security feature, the thing I keep foremost in mind is, will this be something our users want to use and our administrators want to administer? Because if it's not, if it's something that's too difficult or onerous or complex, then I try to find ways to make it more user friendly and practical. And this is a challenge sometimes because we are, our products operate in highly regulated environments and sometimes they have to have certain rules and certain configurations that aren't the most user friendly or management friendly. So I, I put a lot of effort into thinking about how can we make this feature useful while still complying with all the regulations that we have to comply with. And by the way, we're very successful in a highly regulated space. We sell a lot of VxRail, for example, into the Department of Defense and banks and, and other highly regulated environments and we're very successful there. >>Excellent. Okay, Jerome, thanks. We're gonna leave it there for now. I'd love to have you back to talk about the progress that you're making down the road. Things always, you know, advance in the tech industry and so would appreciate that. >>I would look forward to it. Thank you very much, Dave. >>You're really welcome. In a moment I'll be back to summarize the program and offer some resources that can help you on your journey to secure your enterprise infrastructure. I wanna thank our guests for their contributions in helping us understand how investments by a company like Dell can both reduce the need for dev sec up teams to worry about some of the more fundamental security issues around infrastructure and have greater confidence in the quality providence and data protection designed in to core infrastructure like servers, storage, networking, and hyper-converged systems. You know, at the end of the day, whether your workloads are in the cloud, on prem or at the edge, you are responsible for your own security. But vendor r and d and vendor process must play an important role in easing the burden faced by security devs and operation teams. And on behalf of the cube production content and social teams as well as Dell Technologies, we want to thank you for watching a blueprint for trusted infrastructure. Remember part one of this series as well as all the videos associated with this program and of course today's program are available on demand@thecube.net with additional coverage@siliconangle.com. And you can go to dell.com/security solutions dell.com/security solutions to learn more about Dell's approach to securing infrastructure. And there's tons of additional resources that can help you on your journey. This is Dave Valante for the Cube, your leader in enterprise and emerging tech coverage. We'll see you next time.

Published Date : Oct 4 2022

SUMMARY :

So the game of Whackamole continues. But the diversity of alternatives and infrastructure implementations continues to how the industry generally in Dell specifically, are adapting to We're thrilled to have you here and hope you enjoy the program. We also hit on the storage part of the portfolio. So all of this complexity provides a lot of opportunity for attackers because it's expanding and the security mentality that, you know, security should enable our customers to go focus So I'm glad you you, you hit on that, but so given what you just said, what And in addition to this, Dell makes the commitment that we will rapidly how the threads have evolved, and we have also seen the regulatory trends and So thank you for that. And this is the principles that we use on power Edge, So the idea is that service first and foremost the chassis, the box, the several box is opened up, it logs alerts, and you can figure Great, thank you for that lot. So now the complexity that we are dealing with like was So once the customers receive the system at their end, do is quickly take a look at all the different pieces and compare it to the vulnerability you know, give us the sort of summary from your perspective, what are the key strengths of And as part of that like you know, security starts with the supply chain. And we also have dual layer encryption where you of the other things that they have to worry about, which are numerous. Technologies on the cube, your leader in enterprise and emerging tech coverage. So the question is from Dell's perspective, what's unique and to secure the network infrastructure In today's, you know, data driven world, it operates I like the way you phrase that. So if you look at it from a networking perspective, it's the ability to protect So I like that. kind of the assets that they're authorized to based on their user level. And it's imperative that logging is enable because any of the change to and I think you, you've got a third element which is i I think response, So the networking portfolio is no different, you know, it follows the same process for identification for tri and And then how you respond to incidents in a standard way has the ability, you know, to provide all this, this services from a Dell networking security You know, there are multiple layer of defense, you know, both at the edge and in the network in And one of the important aspect is, you know, in terms of, you know, the routing protocol, the specific security rules based on the specific applications, you know, that are running within the system. really that everything you just mentioned, or a lot of things you just mentioned used to be the responsibility design of the fabric and you know, from a deployment and you know, the management of the network and all of these things are actually valid, you know, for a modern networking infrastructure. just pitch me, what can you summarize from your point of view, is data for the digital signature, you know, prior to any upgrade process. can be possible by say looking at the clog service, you know, I mean I think frankly, you know, my takeaway is you of the portfolio and look at how when you enter the world of software defined where you're controlling different parts of the infrastructure stack, including storage servers this kind of sophisticated hard to defeat problem, we need short term So for the short term solution, the obvious thing to do is to patch bill of materials so that you can ensure the providence of all the components help So really the strength or the secret sauce is to combine all the So our partnership grows to our ecosystem through products like VMware, you know, the VMware component OTH zero, which the developers are gonna love. life cycle, which other products might talk about in their discussions with you that we integrate into All right, let's, let's close pitch me, what would you say is the strong suit summarize So I gave the example of spo, I gave the example of how So for, for example, the benefit to the customer is you The reason Jerome, this is so important is because SecOps teams, you know, they gotta deal with cloud security, And by the way, we're very successful in a highly regulated space. I'd love to have you back to talk about the progress that you're making down the Thank you very much, Dave. in the quality providence and data protection designed in to core infrastructure like

ENTITIES

Entity	Category	Confidence
Jerome	PERSON	0.99+
Dave	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Dave Valante	PERSON	0.99+
Deepak	PERSON	0.99+
Dell Technologies	ORGANIZATION	0.99+
Mahesh Nager	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Jerome West	PERSON	0.99+
Mahesh	PERSON	0.99+
Dell Technologies	ORGANIZATION	0.99+
demand@thecube.net	OTHER	0.99+
Department of Defense	ORGANIZATION	0.99+
Dave Ante	PERSON	0.99+
second part	QUANTITY	0.99+
first question	QUANTITY	0.99+
VX rail	ORGANIZATION	0.99+
First	QUANTITY	0.99+
two weeks	QUANTITY	0.99+
last year	DATE	0.99+
Deepak Arage	PERSON	0.99+
14 days	QUANTITY	0.99+
second component	QUANTITY	0.99+
second layer	QUANTITY	0.99+
one	QUANTITY	0.99+
today	DATE	0.99+
A Blueprint for Trusted Infrastructure Made Possible	TITLE	0.99+
hundreds	QUANTITY	0.99+
one part	QUANTITY	0.99+
both	QUANTITY	0.98+
VMware	ORGANIZATION	0.98+
VHA	ORGANIZATION	0.98+
coverage@siliconangle.com	OTHER	0.98+
hundred percent	QUANTITY	0.98+
each	QUANTITY	0.98+
vSphere	TITLE	0.98+
dell.com/security	OTHER	0.98+

Breaking Analysis: Big 4 Cloud Revenue Poised to Surpass $100B in 2021

>> From the cube studios in Palo Alto in Boston bringing you data-driven insights from the cube in ETR. This is breaking analysis with Dave Vellante. >> There are four A players, in the IS slash pass hyperscale cloud services space, AWS, Azure, Alibaba, and alphabet, pretty clever, huh? In our view, these four have the resources, the momentum, and stamina to outperform all others virtually indefinitely. Now combined, we believe these companies will generate more than $115 billion in 2021 IaaS and PaaS revenue. That is a substantial chunk of market opportunity that is growing as a whole in the mid 30% range in 2021. Welcome to this week's Wiki bond cube insights, powered by ETR. In this breaking analysis, we are initiating coverage of Alibaba for our IaaS and PaaS market segments. And we'll update you on the latest hyperscale cloud market data, and survey data from ETR. Big week in hyperscale cloud land, Amazon and alphabet reported earnings and AWS CEO Andy Jassy was promoted to lead Amazon overall. I interviewed John Furrier on the cube this week. John has a close relationship with Jassy and a unique perspective on these developments. And we simulcast the interview on clubhouse, and then hosted a two hour clubhouse room that brought together all kinds of great perspectives on the topic. And then, we took the conversation to Twitter. Now in that discussion, we were just riffing on our updated cloud estimates and our numbers. And here's this tweet that inspired the addition of Alibaba. Now this gentleman is a tech journalist out of New Delhi and he pointed out that we were kind of overlooking Alibaba and I responded that no, we do not just discounting them but we just need to do more homework in the company's cloud business. He also said we're ignoring IBM, but really they're not in this conversation as a hyperscale IaaS competitor to the big four in our view. And we'll just leave it at that for now on IBM, but, back to Alibaba and the big four, we actually did some homework. So thank you for that suggestion. And this chart shows our updated IaaS figures and includes the full year 2020 which was pretty close to our Q4 projections. You know, the big change is we've added Alibaba in the mix. Now these four companies last year, accounted for $86 billion in revenue, and they grew it 41% rate combined relative to 2019. Now, notably as your revenue for the first time is more than half of that of AWS's revenue which of course hit over $45 billion. AWS's revenue, over top 45 billion last year, which is just astounding. Alibaba you'll note, is larger than Google cloud. The Google cloud platform, I should say GCP, at just over eight billion for Alibaba. Now, the reason Baba is such a formidable competitor, is because the vast majority of its revenue comes from China inside that country. And the company do have plans to continue their international expansion, so we see Alibaba as a real force here. Their cloud business showed positive EBITDA for the first time in the history of the company last quarter. So that has people excited. Now, Google, as we've often reported, is far behind AWS and Azure, despite its higher growth rates Google's overall cloud business lost 5.6 billion in 2020 which has some people concerned. We on the other hand are thrilled, because as we've reported in our view, Google needs to get its head out of its ads cloud is it's future. And we're very excited about the company pouring investments into its cloud business. Look with $120 billion essentially in the balance sheet, we can think of a better use of its cash. Now, I want to stress that these figures are our best efforts to create an apples to apples comparison across all four clouds. Many people have asked about, how much of these figures represent, for example, Microsoft office 365 or Google G suite, which by the way now is called workspaces. And the answer is our intention is $0. These are our estimates of worldwide IaaS in PaaS revenue. You know, some of said, we're too low. Some of said, we're too high. Hey, if you have better numbers, Please share them, happy to have a look. Now you maybe asking, what are the drivers of these figures and the growth that we're showing here? Well, all four of these companies, of course, they're benefiting from an accelerated shift to digital as a result to COVID, but each one has other tailwinds. You know, for example, AWS, it's Capitalizing on its a large headstart. It's created tremendous brand value. And as well, despite the fact that, while we estimate that more than 75% of AWS revenue comes from compute and storage, AWS is feature and functional differentiation combined with this large ecosystem is a very much a driving force of it's growth. In the case of Azure, in addition to its captive software application estate, the company on its earnings calls cited strong growth in its consumption based business across all of its industries and customer segments. As we've said, many times, Microsoft makes it really easy for customers to tap into Azure and a true consumption pricing model, with no minimums and cancel any time. Those kinds of terms make it extremely attractive to experiment and get hooked. We certainly saw this with AWS over the years. Now for Google it's growth is being powered by its outstanding technology, and in particular its prowess in AI and analytics. As well we suspect that much of the losses in Google cloud are coming from large go to market investments for Google cloud platform, and they're paying growth dividends. Now, as Tim Crawford said on Twitter, 6 billion, you know that's not too shabby. Also Google cited wins at Wayfair in Etsy, that Google is putting forth in our view to signal that many retailers they might be are you reluctant to do business with Amazon, was of course a big retailer competitor. These are two high profile names, we'd like to see more in future quarters and likely will. Now let's give you another view of this data and paint a picture of, how the pie is being carved out in the market. Actually we'll use bars because my, millennials sounding boards they hate pie charts. And I like to pay attention, to these emerging voices. At any rate amongst these four, AWS has more than half of the market. AWS and Azure are well ahead of the rest. And we think we'll continue to hold serve for quite some time. Now while we're impressed with Alibaba, they're currently constrained to doing business mostly in China. And we think it'll take many years for Baba and GCP to close that gap on the two leaders if they'll ever even get there. Now let's take a look at, what the customers are saying within the ETR survey data. The chart that we're showing here, this is X, Y chart that we show all the time. It's got net score or spending moments on the vertical axis, and market share or the pervasiveness in the datasets in the survey on the horizontal axis. Now on the upper right, you can see the net scores and the number of mentions for each company and the detailed behind this data. And what we've done here is cut the January survey data of 1,262 respondents, you can see that in filtered in there on the left, and we've filtered the data by cloud meaning the respondents are answering about the companies, cloud computing offerings only. So we're filtering out anything of the non-cloud spend. That's a nice little capability of the ETR platform. Azure is really quite amazing to us. It's got a net score of 72.6%, and that's across 572 responses out of the 1262. AWS is the next most pervasive in the data set with 492 shared accounts and a net score of 57.1%. Now, you may be wondering, well, why is Azure bigger in the dataset than AWS? And when we just told you that the opposite is the case in the market in the previous slide. And the answer is, like this is a survey and it's a lot of Microsoft out there, they're everywhere. And I have no doubt that the respondants notion of cloud doesn't directly map into IaaS and PaaS views of the world, but the trends are clear and consistent. Amazon and Azure, they dominate in this market space. Now for context, we've included functions in the form of AWS Lambda as your functions and Google cloud functions. Because, as you can see, there's a lot of spending momentum in these capabilities in these services. You'll also note, that we've added Alibaba to this chart, and it's got a respectable 63.6% net Score, but there are only 11 shared responses in the data. So they'll go into the bank on these numbers, but look, 11 data points, we'll take it. It's better than zero data points. We've also added VMware cloud on AWS in this chart, and you can see that, that capability that service, that has the momentum and you can see those ones that we've highlighted above the 40% red dotted line, that's where the real action in the market is. So all of those offerings have very strong or strong spending velocity in the ETR data set. Now, for context, we've put Oracle and IBM in the chart. And you can see, they both have, you know they've got a decent presence in the data set. They have 132 mentions and 81 responses respectively. So Oracle, they've got a positive net score of 16.7%, and IBM is in a negative 6.2%. Now, remember this is for their cloud offerings, as the respondents in the data set see them. So what does this mean? It says that among the 132 survey respondents answering that they use Oracle cloud, 16.7% more customers are spending more on Oracle's cloud than are spending less. In the case of IBM, it says more customers are spending less than spending more. Both companies are in the red zone, and show far less momentum than the leaders. Look, I've said many times that the good news is, that Oracle and IBM at least have clouds. But they're not direct competitors of the big four in our view, there just not. They have a large software business, and they can migrate their customers, to their respective clouds and market hybrid cloud services. Their definition of cloud is most certainly different than that of AWS, which is fine, but both companies use what I call a kitchen sink method of reporting their cloud business. Oracle includes, cloud and license support, often with revenue recognition at the time of contract, With a term that's renewable and, it also includes on-prem fees, for things like database and middleware, and if, you want to call that cloud, fine. IBM is just as bad, maybe they're worse and includes so much legacy stuff and its cloud number to hide the ball. It's just not even worth trying to unpack for this episode, I have previously and frankly, it's just not a good use of time. Now, as I've said before, both companies they're in the game that can make good money provisioning infrastructure to support their respective software businesses. I just don't consider them hyperscale class clouds which are defined by the big four, and really only those four. And I'm sure I'll get hate mail about that statement, and I'm happy to defend that position, so please reach out. Okay, but one other important thing that we want to discuss is something that came up this week in our Twitter conversation. Here's a tweet from Matt Baker who had strategic planning for Dell. He was responding to someone who commented on our cloud data, basically saying that, with all that cloud revenue who took the hit, which pockets did it come out of, and Matt was saying, look, it's coming out of customer pockets, but can we please end this zero sum game narrative. In other words, it's not a dollar for cloud that doesn't translate into a lost dollar from on-prem for the legacy companies. So let's take a look at that. For first I would agree, with Matt Baker, it's not a one for one swap of spend but there's definitely been an impact. And here's some data from ETR that can, maybe give us some insight here. What this chart shows is a cut of 915 hyperscale cloud accounts. So within those big four, and within those accounts we show the spending velocity or net score cut within further sectors representative of these on-prem players. So servers, storage and networking, so we cut the data on those three segments. And we're looking here at, VMware, Cisco, Dell, HPE, and IBM, for 2020 and into 2021. It's kind of an interesting picture, it shows the net scores for the January of 20 April, July and October 20 surveys and the January 21 surveys. Now all the on-prem players, they were of course impacted by COVID, IBM seems to be that counter trend line. Not that they weren't impacted, but they have this notable mainframe cycle thing going on. And you know, they're in a down cycle now. So it's kind of opposite of the other guys in terms of the survey momentum. And you can see pretty much, all the others are showing upticks headed into 2021, Cisco, you know kind of flattish, but stable and held up a bit. So to Matt Baker's point, despite the 35% or so growth expected for the big four and 2021 the on-prem leaders are showing some signs of positive spending momentum. So let's dig into this a little bit further, 'cause we're not saying cloud hasn't hurt on prem spending. You know, of course it has. Here's that same picture, over a 10 year view. So you're seeing this long, slow, decline occur, and it's no surprise. If you think about the prevailing model for servers, storage, and networking, on prem in particular. Servers have been perpetually under utilized, even with virtualization. You know, with the exception of like backup jobs, there aren't many workloads that can max out server utilization. So we kept buying more servers to give us performance headroom and ran at 20, 30% utilization, you know in a good day. Yes I know some folks can get up over 50%, but generally speaking servers are well under utilized in storage my gosh, it's kind of the same story, maybe even worse. Because for years it was powered by a mechanical system. So more spindles are required to gain performance, lots of copying going on, lots of, you know, pre-flash waste. And in networking it was a story of got to buy more ports. You've got to buy more ports. In the case of these segments, customers will just defense essentially, forced in this endless cycle of planning, procuring, you know, first planning. They got to get the secure the CapEx, and then they procure, and then they over-provision, and then they manage, you know, ongoing. So then along comes AWS, and says, try this on for size and you can see from that chart, the impact of cloud on those bellwether on-prem infrastructure players. Now, just to give you a little bit more insight on this topic, here's a picture of the wheel charts from the ETR data set. For AWS Microsoft, Google, and we brought in VMware to compare them. A wheel chart shows the percent of customers saying they'll either add a platform new that's the lime green. Increased spending by more than 5%, that's the forest green spend flat relative to last year. That's the gray spend less by more than 5% down, that's the pinkish or leave the platform, that's the Bright red. You subtract the red from the green and you get a percentage that represents net score, AWS with a net score of 60% is off the charts good. Microsoft remember, this includes the entire Microsoft business portfolio, not just Azure, so it's still really strong. Google, frankly, we'd like to see higher net scores and VMware's, you know, so there's a gold standard for on-prem. So we include them, so you can see for reference the strong, but notice they got a much, much bigger flat spending, which is what you would expect from some of these more mature players. Now let's compare these scores to the other, on-prem Kings. So this is not surprising to see, but the greens, they go down, the flats that gray area goes up compared to the cloud guys and the red which is virtually non-existent within AWS, goes into the high teens with the exception of Cisco which despite its exposure to virtually all industries including those hard hit by COVID shows pretty low read scores. So that's, that's good. And I got to share one other, look at this wheel chart for pure storage. We're not really not sure what's happening here, but this is impressive. We're seeing a huge rebound, and you can see we've superimposed as candlestick over comparing previous quarters surveys and, look at the huge up check in the January survey for pure that blue line. That's highlighted in that red dot at ellipse, jumps to a 63% net score from below 20% last quarter. You know, we'll see, I've never seen that kind of uptick before for an established company. And, you know, maybe it's pent up demand or some other anomaly in the data. We'll find out when pure reports in 2021, because remember these are forward looking surveys. But the point is, you still see action going on in hybrid and on-prem, and despite the freight train that is cloud, coming at the legacy players. You know, not that pure is legacy, but it's, you know, it's no longer a lanky teenager. And I think the bottom line, coming back to Matt Baker's point, is there are opportunities that the on-prem players can pursue in hybrid and multi-cloud, and we've talked about this a lot where you're building abstraction layer, on top of the hyperscale clouds and letting them build out their data center presence worldwide, spend on capex, they're going to outspend everybody. And these guys, these on-prem, and hybrid and multi-cloud folks they're going to have to add value on top of that. Now if they move fast, you no doubt there'll be acquiring startups to make that happen. They're going to have to put forth the value proposition and execute on that, in a way that adds clear value above and beyond what the hyperscalers are going to do. Now, the challenge, is picking those right spots, moving fast enough and balancing wall street promises with innovation. There's that same old dilemma. Let's face It. Amazon for years could lose tons of money and not get killed in the street. Google, they got so much cash, they can't spend it fast enough and Microsoft after years of going sideways is finally figured out and the some. Alibaba they're new to our analysis, but it's looking like you know, it's the Amazon of China, Plus ANT despite its regulatory challenges with the Chinese government. So all four of these players, are in the driver's seat in our view. And they're leading in not only cloud, but AI. And of course the data keeps flowing into their cloud. So they're really are in a strong position. Bottom line is we're still early into the cloud platform era and it's morphing. It's from a collection of remote cloud services, into this ubiquitous, sensing, thinking, anticipatory system, that's increasingly automated and working towards full automation. It's intelligent and it's hyper decentralizing toward the edge. One thing's for sure, the next 10 years, they're not going to be the same as the past 10. Okay, that's it for now. Remember I publish each week on Wikibond.com and siliconANGLE.com, these episodes they're all available as podcasts just search for breaking analysis podcast. You can always connect on Twitter. I'm @dvellante or email me at david.Vellante@siliconANGLE.com. I love the comments on LinkedIn and of course in clubhouse the new social app. So please follow me, so that you can get notified when we start a room and riff on these topics. And don't forget to check out etr.plus for all the survey action. This is Dave Vellante for the cube insights powered by ETR be well, and we'll see you next time. (upbeat music)

Published Date : Feb 5 2021

SUMMARY :

From the cube studios Oracle and IBM in the chart.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Tim Crawford	PERSON	0.99+
Jassy	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Alibaba	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
$0	QUANTITY	0.99+
Matt Baker	PERSON	0.99+
January 21	DATE	0.99+
China	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Andy Jassy	PERSON	0.99+
16.7%	QUANTITY	0.99+
2021	DATE	0.99+
$120 billion	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
2019	DATE	0.99+
HPE	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
New Delhi	LOCATION	0.99+
January	DATE	0.99+
20	QUANTITY	0.99+
63.6%	QUANTITY	0.99+
John	PERSON	0.99+
Matt	PERSON	0.99+
72.6%	QUANTITY	0.99+
1,262 respondents	QUANTITY	0.99+
2020	DATE	0.99+
57.1%	QUANTITY	0.99+
John Furrier	PERSON	0.99+
two leaders	QUANTITY	0.99+
5.6 billion	QUANTITY	0.99+
last year	DATE	0.99+
$86 billion	QUANTITY	0.99+
572 responses	QUANTITY	0.99+

Breaking Analysis: Cloud Momentum & CIO Optimism Point to a 4% Rise in 2020 Tech Spending

>> From theCube studios in Palo Alto in Boston, bringing you data-driven insights from theCube in ETR. This is Breaking Analysis with Dave Vellante. >> New data suggests the tech spending will be higher than we previously thought for 2021. COVID learnings, a faster than expected vaccine rollout, productivity gains in the last 10 months, and broad-based cloud leverage lead us to raise our outlook for next year. We now expect a three to 5% increase in 2021 technology spending, roughly double our previously forecasted growth rate of 2%. Hello everyone and welcome to this week's we keep on Cube Insights powered by ETR. In this breaking analysis, we're going to share new spending data from ETR partners and take a preliminary look at which sectors and which companies are showing momentum heading into next year. Let's get right into it. The data is pointing to a strong 2021 rebound. A latest survey from ETR and the information from theCube Community suggests that the accelerated pace of the vaccine rollout pent up demand for normalcy and learnings from COVID will boost 2021 tech spending higher than previously anticipated. Now a key factor we've cited is that the forced March to digital transformation due to the pandemic created a massive proof of concept for what works and what doesn't in a digital business. CIOs are planning to bet on those sure things to drive continued productivity improvements and new business opportunities. Now, speaking of productivity, nearly 80% of respondents in the latest ETR survey indicate that productivity either stayed the same or improved over the past three months. Now of those, the vast majority, more than 80% cited improvements in productivity. This has been a common theme throughout the year. As well, the expectation among CIOs is that many workers will return to the office in the second half of the year, which we expect will drive new spending in the infrastructure needs of company HQs, which have been neglected over the past 10 months. Now, despite the expectation that many workers will return to the office, 2020 has shown us that working remotely, hey, it's here to stay, and a much larger number of employees are going to be permanently remote working than pre pandemic. ETR survey data shows that that number is going to be approximately double over the longterm. We'll look at some of that specific data. In addition, cloud computing, it became the staple of business viability in 2020. Those that were up the cloud adoption ramp, well, they benefited greatly, those that weren't well, they had to learn fast. Now, along with remote work cloud necessitated new thinking around network security, and as we've reported identity access management, endpoint security and cloud security with the beneficiaries. Companies like Okta, CrowdStrike, Zscaler, a number of others continue to ride this wave. Larger established security companies like Cisco, Palo Alto Networks, F5, Fortunate and others, they have major portions of their business that are benefiting from the tailwinds in the shift and network traffic, as a result of cloud and remote work. Now, despite all the momentum in the market and the expect of improvements in 2021, these tailwinds are not expected to be evenly distributed, far from it. We think Q4 is going to remain soft relative to last year and Q1 2021 is going to be flat, maybe up slightly. Remember the COVID impact was definitely felt in March of this year. So based on the earnings that we saw, there may be some upside in Q1, given that organizations are still being cautious in Q4, and really there's still some uncertainty in Q1. Let's look at some of the survey responses and you'll see why we're more optimistic than we've previously reported. This chart shows the responses to key questions around spending trajectories from the March, June, September, and December surveys of this year. Now it's no surprise that there's been little change in remote workers and limiting business travel. But look at the other categories, seeing a dramatic reduction in hiring freezes. The percentage of companies freezing new IT deployments continues to drop throughout the year. And then conversely, the percentage of companies accelerating new it deployments that's sharply up to 34% from the March low of 12%. And look at the headcount trends. The percentage of companies instituting layoffs. It continues its downward trajectory while accelerated hiring is now up to 17%. So there's a lot to be excited about in these results. Now let's look the remote worker trend. How do CIO see that shift in the near to midterm? This chart shows the work from home data and it's amazingly consistent from the September survey drill down. You can see CIO's is indicate that on average, 15 to 60% of workers were remote prior to the pandemic, and that jumped up to 72 to 73% currently, and is expected to stay in the high fifties until the summer of 2021. Thereafter, organizations expect that the number of employees that work remotely on a permanent basis is going to more than double to 34% long term. By the way, I've talked to a number of executives, CEOs, CIOs, and CFOs that expect that number to be higher than these especially in the technology sector. They expect more than half of their workers to be remote and are looking to consolidate facilities cost to save money. As we've said, cloud computing has been the most significant contributor to business resilience and digital transformation this year. So let's look at cloud strategies and see how CIOs expect those to evolve. This chart shows responses to how organizations see multi-cloud evolving. It's interesting to note the ETR call-out, which concludes that the narrative around multi-cloud multi-cloud is real, and it is. But I want to talk to you about a flip side to this notion in that, as many customers have, or are planning to increasingly concentrate workloads in the cloud. This actually makes some sense. Sure, virtually every major company uses multiple clouds, but more often than not, it concentrate work on a primary cloud. CIO strategies, they're not generally evenly distributed across clouds. The data shows that this is the case for less than 20% of the respondents, rather organizations are typically going to apply an 80, 20 or a 70, 30 rule for their multi-cloud approach. Meaning they pick a primary cloud on which most work is done, and then they use alternative clouds as either a hedge or maybe for specific workloads or maybe even data protection purposes. Now, if you think about it, optimizing on a primary cloud allows organizations to simplify their security and governance and consolidate their skills. At this point in the cloud evolution, it seems CIOs feel there's more value that is going to come from leveraging the cloud to change their operating models, and maybe broadly spreading the wealth to reduce risk or maybe cut costs, or maybe even to tap specialized capabilities. What's more in thinking about AWS and Microsoft respectively. Each can make a very strong case from MANO cloud. AWS has more features than any other cloud, and as such can handle most workloads. Microsoft can make a similar argument for its customers that have an affinity and a largest state of Microsoft software. The key for multi-cloud in our view will be the degree to which technology vendors can abstract the underlying cloud complexity and create a layer that floats above the clouds and adds incremental value. Snowflakes data cloud is one of the best examples of this, and we've covered that pretty extensively. Now, clearly VMware and Red Hat have aspirations at the infrastructure layer in a similar fashion. Pure storage, and NetApp are a couple of the largest storage players with similar visions. And then Qumulo and Clumio are two other examples with promising technologies, but they have a much smaller install base. Take a look at Cisco, Dell, IBM and HPE. They have a lot to gain and a lot to lose in this cloud game. So multi-cloud is an imperative for these leaders, but for them it's much more complicated because of the complexity and vastness of their portfolios. And notably Dell has VMware and IBM of course has Red Hat, which are key assets that can be leveraged for this multi-cloud game. HPE has a channel and a large install base, but all of these firms, they have to spread R&D much more thinly than some of these other companies that we mentioned for example. The bottom line is that multi-cloud has to be more than just plugging into an operating well on any of the clouds. It require... Which is by the way, this is mostly where we are today. It requires an incremental value proposition that solves a clear problem, and at the same time runs efficiently, meaning it takes advantage of cloud native services at scale. What sectors are showing momentum heading into 2021? And who are some of the names that are looking strong? We've reported a lot that cloud containers and container orchestration, machine intelligence and automation are by far the hottest sectors, the biggest areas of investment with the greatest spending momentum. Now we measure this in ETR parlance, remember by net score. But here's the good news, almost every other sector in the ETR taxonomy with the notable exception of IT outsourcing and IT consulting is showing positive spending momentum relative to previous surveys this year. Yeah, maybe not, it's not a shock, but it appears that the tech spending recovery will be broad-based. It's also worth noting that there are several vendors that stand out and we show a number of them here. CrowdStrike, Microsoft has had consistent performance in the dataset throughout this year. Okta, we called out those guys last year and they've clearly performed as you can see in their earnings reports. Pure storage, interestingly, big acceleration and a turnaround from last quarter in the dataset, and of course, snowflake has been off the charts as we reported many times. These guys are all seeing highly accelerated momentum. UiPath just announced its intent to IPO, AWS, Google, Zscaler, SailPoint, ServiceNow, and Elastic, these all continue to trend up. And so, there are some real positives that we're looking for a member of the ETR surveys, they're forward-looking. So we'll see, as we catch up next quarter. Now, before we wrap, I want to say a few words on security, and maybe it's a bit of a non-sequitur here, but I think it's relevant to the trends that we've been discussing, especially as we talk about moving to the cloud. And as you know, we've reported many times on the security space, basically updating you quarterly with our scenarios and the spending and the technology trends and highlighting our four-star companies. Four-star company's insecurity on those with both momentum and significant market presence. And last year we put CrowdStrike, Okta and Zscaler, and some others on the radar. And we've closely track the cyber business of larger companies with a security portfolio like Palo Alto and Cisco, and more recently, VMware has made some acquisitions. Now the government hacked that became news this week. It really underscores the importance of security. It remains the most challenging area for organizations because well, failure's not an option, skills are short, tools are abundant, the adversaries are very well-funded and extremely capable yet failure is common as we saw this week. And there's a misconception that cloud solves the security problem, and it's important to point out that it does not. Cloud is a shared responsibility model, meaning the cloud provider is going to secure the infrastructure for example, but it's up to you as the customer to configure things properly and deal with application security. It's ultimately on you. And the example of S3 is instructive because we've seen a number S3 breaches over the years where the customer didn't properly configure the S3 bucket. We're talking about companies like Honda and Capital One, not just small businesses that don't have the SecOps resources. And generally it was because a non-security person was configuring things. Maybe they were Or developers who are not focused on security, and perhaps permission set too broadly, and access was given to far too many people. Whatever the issue, it took some breaches and subsequent education to increase awareness of this problem and tighten it up. We see some similar trends occurring with new workloads, especially in cloud databases. It's becoming so easy to spin up new data warehouses for example, and we believe that there are exposures out there due the lack of awareness or inconsistent corporate governance being applied to these new data stores. As well, even though important areas like threat intelligence and database security are important, SecOps budgets are stretched thin. And when you ask companies where the priorities are, these fall lower down the list, these areas specifically have taken a back seat, the endpoint, identity and cloud security. And we bring this up because it's a potential blind spot as we saw this week with the US government hack. It was stealthy, it wasn't detected for many, many months. Who knows maybe even years. And not to be a buzzkill, but the point is, cloud enthusiasm has to be concompetent with security vigilant. Enough preaching, let's wrap up here. As we enter 2020, this year, we said the cloud was going to be the force that drove innovation along with data and AI. And as we look in the rear view mirror and put 2020 behind us, I know many of you want to do that, it was the cloud that enabled businesses to not only continue to operate, but to actually increase productivity. Nonetheless, we still see IT spending declines of four to 5% this year with an expectation of a tepid Q4 relative to the last year. We see Q1 slowly rebounding and kind of a swoosh, let me try that again, recovery in the subsequent quarters with tech spending rebounding in 2021 to a positive three to 5%, let's call it 4%. Now supporting us scenario, the pandemic forced a giant Petri dish for digital. And we see some real successes and learnings that organizations will apply in 2021 to bet on sure things. These are cloud, containers, AI, ML, machine intelligence pieces and automation. For sure, along with upticks for virtually every other sector of technology because spending has been so depressed. The two exceptions are outsourcing and IT consulting and related services which continue to be a drag on overall spending. Priorities must be focused on security and governance and further improvements in applying corporate edicts in a cloud world. We also see new data architectures emerging where domain knowledge becomes central to data platforms. We'll be covering this in more detail on top of the work that we've already done in this area. Now, automation is not only an opportunity, it's become a mandate. Yes, RPA, but also broader automation agendas be on point tools. And importantly, we're not talking about paving the cow path here by automating existing processes. Rather we're talking about rethinking processes across the entire organization for a new digital reality where many of these processes are being invented. The work of Erik Brynjolfsson and Andrew McAfee on the second machine age. It was pressured back in 2014 and the conclusions they drew, they're becoming increasingly important in the 2020s, meaning that look machines have always replaced humans throughout time. But for the first time in history, it's happening for cognitive functions, and a huge base of workers is going to be, or as being marginalized, unless they're retrained. Education and public policy that supports this transition is critical. And I for one would like to see a much more productive discussion that goes beyond the cult of break up big tech. Rather I'd like to see governments partner with big tech to truly do good and help drive the re-skilling of workers for the digital age. Now cloud remains the underpinning of the digital business mandate, but the path forward isn't really always crystal clear. This is evidenced by the virtual dead heat between those organizations that are consolidating workloads in a cloud workloads versus those that are hedging bets on a multi-cloud strategy. One thing is clear cloud is the linchpin for our growth scenarios and will continue to be the substrate for innovation in the coming decade. Remember, these episodes, they're all available as podcasts, wherever you listen, all you got to do is search Breaking Analysis podcast, and please subscribe to the series, appreciate that. Check out ETR's website at ETR.plus. We also publish full report every week on wikibond.com and siliconangle.com and get in touch with me at David.vallante, siliconangle.Com, you can DM me at D. Vellante. And please by all means comment on our LinkedIn posts. This is Dave Vellante for theCube Insights powered by ETR. Have a great week everybody, Merry Christmas, happy Hanukkah, happy Kwanzaa, or happy, whatever holiday you celebrate. Stay safe, be well, and we'll see you next time. (upbeat music)

Published Date : Dec 18 2020

SUMMARY :

in Palo Alto in Boston, in the near to midterm?

ENTITIES

Entity	Category	Confidence
Honda	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
2014	DATE	0.99+
Zscaler	ORGANIZATION	0.99+
Okta	ORGANIZATION	0.99+
Palo Alto	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
Andrew McAfee	PERSON	0.99+
September	DATE	0.99+
15	QUANTITY	0.99+
F5	ORGANIZATION	0.99+
2021	DATE	0.99+
UiPath	ORGANIZATION	0.99+
2020	DATE	0.99+
Palo Alto	LOCATION	0.99+
December	DATE	0.99+
three	QUANTITY	0.99+
March	DATE	0.99+
Capital One	ORGANIZATION	0.99+
ETR	ORGANIZATION	0.99+
CrowdStrike	ORGANIZATION	0.99+
last year	DATE	0.99+
2%	QUANTITY	0.99+
four	QUANTITY	0.99+
12%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
2020s	DATE	0.99+
80	QUANTITY	0.99+
Fortunate	ORGANIZATION	0.99+
last quarter	DATE	0.99+
Erik Brynjolfsson	PERSON	0.99+
next quarter	DATE	0.99+
Palo Alto Networks	ORGANIZATION	0.99+
VMware	ORGANIZATION	0.99+
20	QUANTITY	0.99+
June	DATE	0.99+
this year	DATE	0.99+
siliconangle.Com	OTHER	0.99+
next year	DATE	0.99+
Christmas	EVENT	0.99+
Four-star	QUANTITY	0.99+
more than 80%	QUANTITY	0.99+
less than 20%	QUANTITY	0.99+
70	QUANTITY	0.99+
first time	QUANTITY	0.99+
one	QUANTITY	0.99+
4%	QUANTITY	0.99+
four-star	QUANTITY	0.99+
theCube	ORGANIZATION	0.99+
siliconangle.com	OTHER	0.98+
COVID	OTHER	0.98+
5%	QUANTITY	0.98+
this year	DATE	0.98+
theCube Community	ORGANIZATION	0.98+
S3	TITLE	0.98+
pandemic	EVENT	0.98+
wikibond.com	OTHER	0.98+
two exceptions	QUANTITY	0.98+

3 4 Insights for All v3 clean

>>Yeah. >>Welcome back for our last session of the day how to deliver career making business outcomes with Search and AI. So we're very lucky to be hearing from Canada. Canadian Tire, one of Canada's largest and most successful retailers, have been powered 4.5 1000 employees to maximize the value of data with self service insights. So today we're joining us. We have Yarrow Baturin, who is the manager of Merch analytics and planning to support at Canadian Tire and then also Andrea Frisk, who is the engagement manager manager for thoughts. What s O U R Andrea? Thanks so much for being here. And with >>that, >>I'll pass the mic to you guys. >>Thank you for having us. Um, already, I I think I'll start with an introduction off who I am, what I do. A Canadian entire on what Canadian pair is all about. So, as a manager of Merch analytics at Canadian Tire, I support merchant organization with reporting tools, and then be I platform to enable decision making on a day to day basis. What is? Canadian Tire's Canadian tire is one of the largest retailers in Canada. Um, serving Canadians with a number of lines of business spanning automotive fixing, living, playing and SNG departments. We have a number of banners, including sport check Marks Party City Phl that covers more than 1700 locations. So as an organization, we've got vast variety of different data, whether it's product or loyalty. Now, as the time goes on, the number of asks the number off data points. The complexity of the analysis has been increasing on banned traditional tools. Analytical tools such as Excel Microsoft Access do find job but start hitting their limitations. So we started on the journey of exploring what other B I platforms would be suitable for our needs. And the criteria that we thought about as we started on that journey is to make sure that we enable customization as well as the McCarthy ization of data. What does that mean? That means we wanted to ensure that each one of the end users have ability to create their own versions off the report while having consistency from the data standpoint, we also wanted Thio ensure that they're able to create there at hawks search queries and draw insights based on the desired business needs. As each one of our lines of business as each one of our departments is quite unique in their nature. And this is where thoughts about comes into play. Um, you checked off all the boxes? Um, as current customers, as potential customers, you will discover that this is the tool that allows that at hawks search ability within a matter of seconds and ability to visualize the information and create those curated pin boards for each one of the business units, depending on what the needs are. And now where? I guess well, Andrea will talk a little bit more about how we gained adoption, but the usage was like and how we, uh, implemented the tool successfully in the organization. >>Okay, so I actually used to work for Canadian tire on DSO. During that time, I helped Thio build training and engaging users to sort of really kick start our use cases. Andi, the ongoing process of adopting thought spot through Canadian Tire s 01 of the sort of reasons that we moved into using thought spot was there was a need Thio evolve, um, in order to see the wealth of data that we had coming in. So the existing reporting again. And this is this sort of standard thoughts bought fix is, um, it brings the data toe. Everyone on git makes it more accessible, so you get more out of your data. So we want to provide users with the ability to customize what they could see and personalized three information so that they could get their specific business requirements out of the data rather than relying on the weekly monthly quarterly reporting. That was all usually fairly generic eso without the ability to deep dive in. So this gave the users the agility thio optimize their campaigns, optimize product murder, urgency where products are or where there's maybe supply chain gaps. Andi just really bring this out for trillions of rose to become accessible. Thio the Canadian tire. That's what user base think. That's the slide. >>That's the slight, Um So as Andrea talked about the business use of the particular tool, let's talk a little bit about how we set it up and a wonderful journey of how it's evolved. So we first implemented 5.3 version of that spot on the Falcon server on we've been adding horsepower to it over time. Now mhm. What I want to stress is the importance off the very first, Data said. That goes into the tool toe. Actually engage the users and to gain the adoption and to make sure there is no argument whether the tool is accurate or not. So what we've started with is a key p I marked layer with all the major metrics that we have and all the available permutations and combinations off the dimensions, whether it's a calendar dimension, proud of dimension or, let's say, customer attribute now, as we started with that data set, we wanted to make sure that we're we have the ability to add and the dimensions right. So now, as we're implementing the tool, we're starting to add in more dimension tables to satisfy the needs off our clients if you want to call it that way as they want to evolve their analytics. So we started adding in some of the store attributes we started adding in some of the product attributes on when I refer to a product attributes, let's say, uh, it involves costs and involves prices involved in some of the strategic internal pieces that we're thinking about now as the comprehensive mark contains right now, in our instance, close to five billion records. This is where it becomes the one source of truth for people declaring information against right so as they go in, we also wanted to make sure when they Corey thought spot there, we're really Onley. According one source of data. One source of truth. It became apparent over time, obviously, that more metrics are needed. They might not be all set up in that particular mark. And that's when we went on the journey off implementing some of the new worksheets or some of the new data sets particularly focused on the four looking pieces. And uh, that's where it becomes important to say This is how you gain the interest and keep the interests of the public right. So you're not just implementing a number off data sets all at once and then letting the users be you're implementing pieces and stages. You're keeping the interest thio, the tool relevant. You're keeping, um, the needs of the public in mind. Now, as you can imagine on the Falcon server piece, um, adding in the horsepower capacity might become challenging the mawr. Billions of Rosie erratic eso were actually in the middle of transitioning our environment to azure in snowflake so that we can connect it. Thio embrace capability of thoughts cloud. And that's where I'm looking forward to that in 2021 I truly believe this will enable us Thio increase the speed off adoption Increase the speed of getting insights out of the tool and scale with regards Thio new data sets that we're thinking about implementing as we're continuing our thoughts about journey >>Okay, so how we drove adoption Thio 4500 plus users eso When we first started Thio approach our use case with the merchants within Canadian Tire We had meetings with these users with who are used place is gonna be with and sort of found out. What are they searching for, Where they typically looking at what existing reports are available for them. Andi kind of sought out to like, What are those things where you're pulling this on your own or someone else's pulling this data because it's not accessible yet And we really use that as our foundation to determine one what data we needed to initially bring into the system but also to sort of create those launchpad pin boards that had the base information that the users we're gonna need so that we could twofold, make it easy for them, toe adopt into the tool and also quickly start Thio, deactivate or discontinue those reports. And just like these air now only available in thought spot because with the sort of formatting within thought spot around dates, it's really easy to make this year's report last year report etcetera. Just have everything roll over every month or a recorder s. So that was kind of some of the pre work foundation when we originally did it. But really, it's been a lot of training, a lot of training. So we conducted ah, lot of in person training, obviously pre co vid eso. We've started to train the group that we targeted, which was the merchants and all of the like, surrounding support groups. Eso we had planners going in and training as well, so that everyone who was really closely connected to the merchants I had an idea of what thoughts about what was and how to use it and where the reports were, and so we just sort of rolled it out that way, and then it started to fly like wildfire. Eso the merchants start to engage with supply chain to have conversations, or the merchants were engaging with the vendors to sort of have negotiations about pricing. And they're creating these reports and getting the access to the information so quickly, and they're sharing it out that we had other groups just coming to us asking, How do I get into thoughts about how can I get in on DSO on top of those groups, we also sought out other heavy analytics groups such a supply chain where we felt like they could have the same benefits if they on boarded into thought spot with their data as well on Ben. Just continuing to evolve the training roll out. Um, you know, we continued to engage with the users, >>so >>we had a newsletter briefly Thio, sort of just keep informing users of the new data coming in or when we actually upgraded our system. So the here are the new features that you'll start seeing. We did virtual trainings and maintaining an F A Q document with the incoming questions from the users, and then eventually evolved into a self guided learning so that users that were coming to a group, or maybe we've already done a full rollout could come in and have the opportunity to learn how to use thought spot, have examples that were relevant to the business and really get started. Eso then each use case sort of after our initial started to build into a formula of the things that we needed to have. So you need to understand it. Having SMEs ready and having the database Onda worksheets built out sort of became the step by step path to drive adoption. Um, from an implementation timeline, I think they're saying, Took about two months and about half of that waas Kenny entire figuring out how figuring out our security, how to get the data in on, Do we need the time to set up the environment and get on Falcon? So then, after that initial two months, then each use case that we come through. Generally, we've got users trained and SMEs set up within about 2 to 3 weeks after the data is ingested. It's not obviously, once snowflakes set up on the data starts to get into that and the data feeds in, then you're really just looking at the 2 to 3 weeks because the data is easily connected in, >>um, no. All right, let's talk about some of the use cases. So we started with what data we've implemented. Andrea touched upon what Use a training look like what the back curate that piece wants. Now let's talk a little bit about use cases and how we actually leverage thoughts bought together the insights. So the very first one is ultimately the benefit of the tool to the entire organization. Israel Time insights. To reiterate what Andrea said, we first implemented the tool with our buyers. They're the nucleus of any retail organization as they work with everybody within the company and as the buyer's eyes, Their responsibility to ensure both the procurement and the sales channel, um, stays afloat at the end of the day, right? So they need information on a regular basis. They needed fast. They needed timely, and they needed in a fashion that they choose to digest it. It right? Not every business is the same. Not every individual is the same. They consume digest, analyze information differently. And that's what that's what allows you to dio whether it's the search, whether it's a customized onboard, please now supply chain unexpected things. As Andrea mentioned Irish work a lot of supply chain. What is the goal of supply chain to receive product and to be able to ship that product to the stores Now, as our organization has been growing and is doing extremely well, we've actually published Q three results recently. Um, the aspect off prioritization at D C level becomes very important, And what drives some of that prioritization is the analysis around what the upcoming sales would be for specific products for specific categories. And that's where again thoughts. But is one of the tools that we've utilized recently to set our prioritization logic from both inbound and outbound us. It's right because it gives you most recent results. It gives you most granular results, depending on the business problem that you're trying to tackle. Now let's chat a little bit about covert 19 response, because this one is an extremely interesting case as a pandemic hit back in March. Um, as you can imagine, the everyday life a Canadian entire became as business unusual is our executives referred to it under business unusual. This speed and the intensity of the insights and the analytics has grown exponentially. And the speed and the intensity of the insights is driven by the fact that we were trying Thio ensure that we have the right selection of products for our Canadian customers because that's ultimately bread and butter off all of the retailers is the customers, right? So thoughts bought allowed us to have early trends off both sales and inventory patterns, where, whether we were stalking out of some of the products in specific stories of provinces, whether we saw some of the upload off different lines of business, depending on the region, ality right as pandemic hit, for example, um, gym's closed restaurants closed. So as Canadian pack carries a wide variety of different lines of business, we actually offer a wide selection of exercise equipment and accessories, cycling products as well as the kitchen appliances and kitchen accessories pieces. Right? So all of those items started growing exponentially and in certain areas more than others. And this is where thoughts about comes into play. A typical analysis on what the region ality of the sales has been over the last couple of days, which is lifetime and pandemic terms, um, could have taken days weeks for analysts to ultimately cobbled together an Excel spreadsheet. Meanwhile, it can take a couple of seconds for 12 Korean tosspot set up a PIN board that can be shared through a wide variety of individuals rather than fording that one Excel spreadsheet that gets manipulated every single time. And then you don't get the right inside. So from again merch supply chain covert response aspect of things. That spot has been one of those blessings and one of those amazing tools to utilize and improve the speed off insights, improved the speed of analytics and improve the speed of decision making that's ultimately impacting, then consumer at the store level. So Andrea talked about 4500 users that we have that number of school. But what I owe the recently like to focus on, uh, Andrew and I laughing because I think the last time we've spoken at a larger forum with the fastball community, I think we had only 500 users. That was in the beginning >>of the year in in February, we were aiming to have like 1000 >>exactly. So mission accomplished. So we've got 4500 employees now. Everybody asked me, Yeah, that's a big number, but how many times do people actually log in on a weekly or daily basis? I'm or interested in that statistic? So lately, um, we've had more than 400 users on the weekly basis. What's what's been cool lately is, uh, the exponential growth off ad hoc ways. So throughout October, we've reached a 75,000 ad hoc ways in our system and about 13,000 PIN board views. So why is that's that's significant? We started off, I would say, in January of 2020 when Andrea refers to it, I think we started off with about 40 45,000 ad hoc worries a month. So again, that was cool. But at the end of the day, we were able to thio double that amount as more people migrate to act hawk searches from PIN board views, and that's that's a tremendous phenomena, because that's what that's about is all about. So I touched upon a little bit about exercise and cycling. So these are our quarterly results for Q two, um, that have showed tremendous growth that we did not plan for, that we were able to achieve with, ultimately the individuals who work throughout the organization, whether it's the merch organization or whether it's the supply chain side of the business. But coming together and utilizing a B I platform by tools such a hot spot, we can see triple digit growth results. Eso What's next for us users at Hawks searches? That's fantastic. I would still like to get to more than 1200 people on the weekly basis. The cool number to me is if all of our lifetime users were you were getting into the tool on a weekly basis. That would be cool. And what's proven to be true is ultimately the only way to achieve it is to keep surprising and delighting them and your surprising and delighting them with the functionality of the tool. With more of the relevant content and ultimately data adding in more data, um, is again possible through ET else, and it's possible through pulling that information manually. But it's expensive, expensive not from the sense of monetary value, but it's expensive from the size time, all of those aspects of things So what I'm looking forward to is migrating our platform to azure in snowflake and being able thio scale our insights accordingly. Toe adding more data to Adam or incites more, uh, more individual worksheets and data sets for people to Korea against helps the each one of the individuals learn. Get some of the insights. Helps my team in particular be, well, more well versed in the data that we have existing throughout the organization. Um, and then now Andrea, in touch upon how we scale it further and and how each one of the individuals can become better with this wonderful >>Yeah, soas used a zero mentioned theater hawk searches going up. It's sort of it's a little internal victory because our starting platform had really been thio build the pin boards to replicate what the users were already expecting. So that was sort of how we easily got people in. And then we just cut off the tap Thio, whatever the previous report waas. So it gave them away. Thio get into the tool and understand the information. So now that they're using ad hoc really means they understand the tool. Um, then they they have the data literacy Thio access the information and use it how they need. So that's it's a really cool piece. Um, that worked on for Canadian tire. A very report oriented and heavy organization. So it was a good starting platforms. So seeing those ad hoc searches go up is great. Um, one of the ways that we sort of scaled out of our initial group and I kind of mentioned this earlier I sort of stepped on my own toes here. Um is that once it was a proven success with the merchants and it started to spread through word of mouth and we sought out the analyst teams. Um, we really just kept sort of driving the insights, finding the data and learning more about the pieces of the business. As you would like to think he knows everything about everything. He only knows what he knows. Eso You have to continue to cultivate the internal champions. Um Thio really keep growing the adoption eso find this means that air excited about the possibility of using thought spot and what they can do with it. You need to find those people because they're the ones who are going to be excited to have this rapid access to the information and also to just be able to quickly spend less time telling a user had access it in thought spot. Then they would running the report because euro mentioned we basically hit a curiosity tax, right? You you didn't want to search for things or you didn't want to ask questions of the data because it was so conversed. Um, it was took too much time to get the data. And if you didn't know exactly what you were looking for, it was worse. So, you know, you wouldn't run a query and be like, Oh, that's interesting. Let me let me now run another query of all that information to get more data. Just not. It's not time effective or resource effective. Actually, at the point, eso scaling the adoption is really cultivating those people who are really into it as well. Um, from a personal development perspective, sort of as a user, I mean, one who doesn't like being smartest person in the room on bought spot sort of provides that possibility. Andi, it makes it easier for you to get recognized for delivering results on Dahlia ble insights and sort of driving the business forward. So you know, B b that all star be the Trailblazer with all the answers, and then you can just sort of find out what really like helping the organization realized the power of thought spot on, baby. Make it into a career. >>Amazing. I love love that you've joined us, Andrea. Such a such an amazing create trajectory. No bias that all of my s o heaps of great information there. Thank you both. So much for sharing your story on driving such amazing adoption and the impact that you've been able to make a T organization through. That we've got a couple of minutes remaining. So just enough time for questions. Eso Andrea. Our first questions for you from your experience. What is one thing you would recommend to new thoughts about users? >>Um, yeah, I would say Be curious and creative. Um, there's one phrase that we used a lot in training, which was just mess around in the tool. Um, it's sort of became a catchphrase. It is really true. Just just try and use it. You can't break. It s Oh, just just play around. Try it you're only limitation of what you're gonna find is your own creativity. Um, and the last thing I would say is don't get trapped by trying to replicate things. Is that exactly as they were? B, this is how we've always done it. Isin necessarily The the best move on day isn't necessarily gonna find new insights. Right. So the change forces you thio look at things from a different perspective on defined. Find new value in the data. >>Yeah, absolutely. Sage advice there. Andan another one here for Yaro. So I guess our theme for beyond this year is analytics meets Cloud Open for everyone. So, in your experience, what does What does that mean for you? >>Wonderful question. Yeah. Listen, Angela Okay, so to me, in short, uh, means scale and it means turning Yes. Sorry. No, into a yes. Uh, no, I'm gonna elaborate. Is interest is laughing at me a little bit. That's right. >>I can talk >>Fancy Two. Okay, So scale from the scale perspective Cloud a zai touched upon Throw our conversation on our presentation cloud enables your ability Thio store have more data, have access to more data without necessarily employing a number off PTL developers and going toe a number of security aspect of things in different data sources now turning a no into a yes. What does that mean with more data with more scalability? Um, the analytics possibilities become infinite throughout my career at Canadian Tire. Other organizations, if you don't necessarily have access thio data or you do not have the necessary granularity, you always tell individuals No, it's not possible. I'm not able to deliver that result. And quite often that becomes the norm, saying no becomes the norm. And I think what we're all striving towards here on this call Aziz part the conference is turning that no one say yes on then making a yes a new, uh, standard a new form. Um, as we have more access to the data, more access to the insights. So that would be my answer. >>Love it. Amazing. Well, that kind of brings in into this session. So thank you, everyone for joining us today on did wrap up this dream. Don't miss the upcoming product roadmap eso We'll be sticking around to speak thio some of the speakers you heard earlier today and I'll make the experts round table, and you can absolutely continue the conversation with this life. Q. On Q and A So you've got an opportunity here to ask questions that maybe keep you up at night. Perhaps, but yet stay tuned for the meat. The experts secrets to scaling analytics adoption after the product roadmap session. Thanks everyone. And thank you again for joining us. Guys. Appreciate it. >>Thank you. Thanks. Thanks.

Published Date : Dec 10 2020

SUMMARY :

Welcome back for our last session of the day how to deliver career making business outcomes with Search And the criteria that we thought about as we started on that journey of the sort of reasons that we moved into using thought spot was there was a need Thio the business use of the particular tool, let's talk a little bit about how we set it up and boards that had the base information that the users we're gonna need so that we could of the things that we needed to have. and the intensity of the insights is driven by the fact that we were trying Thio But at the end of the day, we were able to thio double that amount as more people Um, one of the ways that we sort of scaled out of our initial group and I kind on driving such amazing adoption and the impact that you've been able to make a T organization through. So the change forces you thio look at things from a different perspective on So I guess our theme for beyond this year is analytics meets Cloud so to me, in short, uh, means scale and And quite often that becomes the norm, saying no becomes the norm. the experts round table, and you can absolutely continue the conversation with this life. Thank you.

ENTITIES

Entity	Category	Confidence
Andrea	PERSON	0.99+
Andrea Frisk	PERSON	0.99+
Canada	LOCATION	0.99+
Angela	PERSON	0.99+
January of 2020	DATE	0.99+
Andrew	PERSON	0.99+
Yarrow Baturin	PERSON	0.99+
February	DATE	0.99+
2	QUANTITY	0.99+
Merch analytics	ORGANIZATION	0.99+
4500 employees	QUANTITY	0.99+
first questions	QUANTITY	0.99+
Canadian Tire	ORGANIZATION	0.99+
October	DATE	0.99+
2021	DATE	0.99+
Adam	PERSON	0.99+
one phrase	QUANTITY	0.99+
Korea	LOCATION	0.99+
Excel	TITLE	0.99+
One source	QUANTITY	0.99+
last year	DATE	0.99+
March	DATE	0.99+
more than 1700 locations	QUANTITY	0.99+
one	QUANTITY	0.99+
12	QUANTITY	0.99+
today	DATE	0.99+
more than 1200 people	QUANTITY	0.99+
one source	QUANTITY	0.99+
1000	QUANTITY	0.99+
3 weeks	QUANTITY	0.99+
more than 400 users	QUANTITY	0.98+
both	QUANTITY	0.98+
two months	QUANTITY	0.98+
first	QUANTITY	0.98+
4.5 1000 employees	QUANTITY	0.98+
Corey	PERSON	0.97+
each one	QUANTITY	0.97+
four looking pieces	QUANTITY	0.96+
this year	DATE	0.96+
trillions of rose	QUANTITY	0.96+
about two months	QUANTITY	0.96+
each use case	QUANTITY	0.96+
Kenny	PERSON	0.96+
about 40 45,000 ad	QUANTITY	0.96+
McCarthy	PERSON	0.95+
75,000 ad	QUANTITY	0.95+
Eso	ORGANIZATION	0.95+
five billion records	QUANTITY	0.95+
three information	QUANTITY	0.94+
first one	QUANTITY	0.93+
Canadian	OTHER	0.93+
earlier today	DATE	0.91+
Thio	ORGANIZATION	0.9+
500 users	QUANTITY	0.9+
Thio	PERSON	0.9+
hawks	ORGANIZATION	0.9+
about 2	QUANTITY	0.89+
both sales	QUANTITY	0.88+
pandemic	EVENT	0.88+
Microsoft	ORGANIZATION	0.87+
about 4500 users	QUANTITY	0.86+

IO TAHOE EPISODE 4 DATA GOVERNANCE V2

>>from around the globe. It's the Cube presenting adaptive data governance brought to you by Iota Ho. >>And we're back with the data automation. Siri's. In this episode, we're gonna learn more about what I owe Tahoe is doing in the field of adaptive data governance how it can help achieve business outcomes and mitigate data security risks. I'm Lisa Martin, and I'm joined by a J. Bihar on the CEO of Iot Tahoe and Lester Waters, the CEO of Bio Tahoe. Gentlemen, it's great to have you on the program. >>Thank you. Lisa is good to be back. >>Great. Staley's >>likewise very socially distant. Of course as we are. Listen, we're gonna start with you. What's going on? And I am Tahoe. What's name? Well, >>I've been with Iot Tahoe for a little over the year, and one thing I've learned is every customer needs air just a bit different. So we've been working on our next major release of the I O. Tahoe product. But to really try to address these customer concerns because, you know, we wanna we wanna be flexible enough in order to come in and not just profile the date and not just understand data quality and lineage, but also to address the unique needs of each and every customer that we have. And so that required a platform rewrite of our product so that we could, uh, extend the product without building a new version of the product. We wanted to be able to have plausible modules. We also focused a lot on performance. That's very important with the bulk of data that we deal with that we're able to pass through that data in a single pass and do the analytics that are needed, whether it's, uh, lineage, data quality or just identifying the underlying data. And we're incorporating all that we've learned. We're tuning up our machine learning we're analyzing on MAWR dimensions than we've ever done before. We're able to do data quality without doing a Nen initial rejects for, for example, just out of the box. So I think it's all of these things were coming together to form our next version of our product. We're really excited by it, >>So it's exciting a J from the CEO's level. What's going on? >>Wow, I think just building on that. But let's still just mentioned there. It's were growing pretty quickly with our partners. And today, here with Oracle are excited. Thio explain how that shaping up lots of collaboration already with Oracle in government, in insurance, on in banking and we're excited because we get to have an impact. It's real satisfying to see how we're able. Thio. Help businesses transform, Redefine what's possible with their data on bond. Having I recall there is a partner, uh, to lean in with is definitely helping. >>Excellent. We're gonna dig into that a little bit later. Let's let's go back over to you. Explain adaptive data governance. Help us understand that >>really adaptive data governance is about achieving business outcomes through automation. It's really also about establishing a data driven culture and pushing what's traditionally managed in I t out to the business. And to do that, you've got to you've got Thio. You've got to enable an environment where people can actually access and look at the information about the data, not necessarily access the underlying data because we've got privacy concerns itself. But they need to understand what kind of data they have, what shape it's in what's dependent on it upstream and downstream, and so that they could make their educated decisions on on what they need to do to achieve those business outcomes. >>Ah, >>lot of a lot of frameworks these days are hardwired, so you can set up a set of business rules, and that set of business rules works for a very specific database and a specific schema. But imagine a world where you could just >>say, you >>know, the start date of alone must always be before the end date of alone and having that generic rule, regardless of the underlying database and applying it even when a new database comes online and having those rules applied. That's what adaptive data governance about I like to think of. It is the intersection of three circles, Really. It's the technical metadata coming together with policies and rules and coming together with the business ontology ease that are that are unique to that particular business. And this all of this. Bringing this all together allows you to enable rapid change in your environment. So it's a mouthful, adaptive data governance. But that's what it kind of comes down to. >>So, Angie, help me understand this. Is this book enterprise companies are doing now? Are they not quite there yet. >>Well, you know, Lisa, I think every organization is is going at its pace. But, you know, markets are changing the economy and the speed at which, um, some of the changes in the economy happening is is compelling more businesses to look at being more digital in how they serve their own customers. Eh? So what we're seeing is a number of trends here from heads of data Chief Data Officers, CEO, stepping back from, ah, one size fits all approach because they've tried that before, and it it just hasn't worked. They've spent millions of dollars on I T programs China Dr Value from that data on Bennett. And they've ended up with large teams of manual processing around data to try and hardwire these policies to fit with the context and each line of business and on that hasn't worked. So the trends that we're seeing emerge really relate. Thio, How do I There's a chief data officer as a CEO. Inject more automation into a lot of these common tax. Andi, you know, we've been able toc that impact. I think the news here is you know, if you're trying to create a knowledge graph a data catalog or Ah, business glossary. And you're trying to do that manually will stop you. You don't have to do that manually anymore. I think best example I can give is Lester and I We we like Chinese food and Japanese food on. If you were sitting there with your chopsticks, you wouldn't eat the bowl of rice with the chopsticks, one grain at a time. What you'd want to do is to find a more productive way to to enjoy that meal before it gets cold. Andi, that's similar to how we're able to help the organizations to digest their data is to get through it faster, enjoy the benefits of putting that data to work. >>And if it was me eating that food with you guys, I would be not using chopsticks. I would be using a fork and probably a spoon. So eso Lester, how then does iota who go about doing this and enabling customers to achieve this? >>Let me, uh, let me show you a little story have here. So if you take a look at the challenges the most customers have, they're very similar, but every customers on a different data journey, so but it all starts with what data do I have? What questions or what shape is that data in? Uh, how is it structured? What's dependent on it? Upstream and downstream. Um, what insights can I derive from that data? And how can I answer all of those questions automatically? So if you look at the challenges for these data professionals, you know, they're either on a journey to the cloud. Maybe they're doing a migration oracle. Maybe they're doing some data governance changes on bits about enabling this. So if you look at these challenges and I'm gonna take you through a >>story here, E, >>I want to introduce Amanda. Man does not live like, uh, anyone in any large organization. She's looking around and she just sees stacks of data. I mean, different databases, the one she knows about, the one she doesn't know about what should know about various different kinds of databases. And a man is just tasking with understanding all of this so that they can embark on her data journey program. So So a man who goes through and she's great. I've got some handy tools. I can start looking at these databases and getting an idea of what we've got. Well, as she digs into the databases, she starts to see that not everything is as clear as she might have hoped it would be. You know, property names or column names, or have ambiguous names like Attribute one and attribute to or maybe date one and date to s Oh, man is starting to struggle, even though she's get tools to visualize. And look what look at these databases. She still No, she's got a long road ahead. And with 2000 databases in her large enterprise, yes, it's gonna be a long turkey but Amanda Smart. So she pulls out her trusty spreadsheet to track all of her findings on what she doesn't know about. She raises a ticket or maybe tries to track down the owner to find what the data means. And she's tracking all this information. Clearly, this doesn't scale that well for Amanda, you know? So maybe organization will get 10 Amanda's to sort of divide and conquer that work. But even that doesn't work that well because they're still ambiguities in the data with Iota ho. What we do is we actually profile the underlying data. By looking at the underlying data, we can quickly see that attribute. One looks very much like a U. S. Social Security number and attribute to looks like a I c D 10 medical code. And we do this by using anthologies and dictionaries and algorithms to help identify the underlying data and then tag it. Key Thio Doing, uh, this automation is really being able to normalize things across different databases, so that where there's differences in column names, I know that in fact, they contain contain the same data. And by going through this exercise with a Tahoe, not only can we identify the data, but we also could gain insights about the data. So, for example, we can see that 97% of that time that column named Attribute one that's got us Social Security numbers has something that looks like a Social Security number. But 3% of the time, it doesn't quite look right. Maybe there's a dash missing. Maybe there's a digit dropped. Or maybe there's even characters embedded in it. So there may be that may be indicative of a data quality issues, so we try to find those kind of things going a step further. We also try to identify data quality relationships. So, for example, we have two columns, one date, one date to through Ah, observation. We can see that date 1 99% of the time is less than date, too. 1% of the time. It's not probably indicative of a data quality issue, but going a step further, we can also build a business rule that says Day one is less than date to. And so then when it pops up again, we can quickly identify and re mediate that problem. So these are the kinds of things that we could do with with iota going even a step further. You could take your your favorite data science solution production ISAT and incorporated into our next version a zey what we call a worker process to do your own bespoke analytics. >>We spoke analytics. Excellent, Lester. Thank you. So a J talk us through some examples of where you're putting this to use. And also what is some of the feedback from >>some customers? But I think it helped do this Bring it to life a little bit. Lisa is just to talk through a case study way. Pull something together. I know it's available for download, but in ah, well known telecommunications media company, they had a lot of the issues that lasted. You spoke about lots of teams of Amanda's, um, super bright data practitioners, um, on baby looking to to get more productivity out of their day on, deliver a good result for their own customers for cell phone subscribers, Um, on broadband users. So you know that some of the examples that we can see here is how we went about auto generating a lot of that understanding off that data within hours. So Amanda had her data catalog populated automatically. A business class three built up on it. Really? Then start to see. Okay, where do I want Thio? Apply some policies to the data to to set in place some controls where they want to adapt, how different lines of business, maybe tax versus customer operations have different access or permissions to that data on What we've been able to do there is, is to build up that picture to see how does data move across the entire organization across the state. Andi on monitor that overtime for improvement, so have taken it from being a reactive. Let's do something Thio. Fix something. Thio, Now more proactive. We can see what's happening with our data. Who's using it? Who's accessing it, how it's being used, how it's being combined. Um, on from there. Taking a proactive approach is a real smart use of of the talents in in that telco organization Onda folks that worked there with data. >>Okay, Jason, dig into that a little bit deeper. And one of the things I was thinking when you were talking through some of those outcomes that you're helping customers achieve is our ally. How do customers measure are? Why? What are they seeing with iota host >>solution? Yeah, right now that the big ticket item is time to value on. And I think in data, a lot of the upfront investment cause quite expensive. They have been today with a lot of the larger vendors and technologies. So what a CEO and economic bio really needs to be certain of is how quickly can I get that are away. I think we've got something we can show. Just pull up a before and after, and it really comes down to hours, days and weeks. Um, where we've been able Thio have that impact on in this playbook that we pulled together before and after picture really shows. You know, those savings that committed a bit through providing data into some actionable form within hours and days to to drive agility, but at the same time being out and forced the controls to protect the use of that data who has access to it. So these are the number one thing I'd have to say. It's time on. We can see that on the the graphic that we've just pulled up here. >>We talk about achieving adaptive data governance. Lester, you guys talk about automation. You talk about machine learning. How are you seeing those technologies being a facilitator of organizations adopting adaptive data governance? Well, >>Azaz, we see Mitt Emmanuel day. The days of manual effort are so I think you know this >>is a >>multi step process. But the very first step is understanding what you have in normalizing that across your data estate. So you couple this with the ontology, that air unique to your business. There is no algorithms, and you basically go across and you identify and tag tag that data that allows for the next steps toe happen. So now I can write business rules not in terms of columns named columns, but I could write him in terms of the tags being able to automate. That is a huge time saver and the fact that we can suggest that as a rule, rather than waiting for a person to come along and say, Oh, wow. Okay, I need this rule. I need this will thes air steps that increased that are, I should say, decrease that time to value that A. J talked about and then, lastly, a couple of machine learning because even with even with great automation and being able to profile all of your data and getting a good understanding, that brings you to a certain point. But there's still ambiguities in the data. So, for example, I might have to columns date one and date to. I may have even observed the date. One should be less than day two, but I don't really know what date one and date to our other than a date. So this is where it comes in, and I might ask the user said, >>Can >>you help me identify what date? One and date You are in this in this table. Turns out they're a start date and an end date for alone That gets remembered, cycled into the machine learning. So if I start to see this pattern of date one day to elsewhere, I'm going to say, Is it start dating and date? And these Bringing all these things together with this all this automation is really what's key to enabling this This'll data governance. Yeah, >>great. Thanks. Lester and a j wanna wrap things up with something that you mentioned in the beginning about what you guys were doing with Oracle. Take us out by telling us what you're doing there. How are you guys working together? >>Yeah, I think those of us who worked in i t for many years we've We've learned Thio trust articles technology that they're shifting now to ah, hybrid on Prohm Cloud Generation to platform, which is exciting. Andi on their existing customers and new customers moving to article on a journey. So? So Oracle came to us and said, you know, we can see how quickly you're able to help us change mindsets Ondas mindsets are locked in a way of thinking around operating models of I t. That there may be no agile and what siloed on day wanting to break free of that and adopt a more agile A p I at driven approach. A lot of the work that we're doing with our recall no is around, uh, accelerating what customers conduce with understanding their data and to build digital APS by identifying the the underlying data that has value. Onda at the time were able to do that in in in hours, days and weeks. Rather many months. Is opening up the eyes to Chief Data Officers CEO to say, Well, maybe we can do this whole digital transformation this year. Maybe we can bring that forward and and transform who we are as a company on that's driving innovation, which we're excited about it. I know Oracle, a keen Thio to drive through and >>helping businesses transformed digitally is so incredibly important in this time as we look Thio things changing in 2021 a. J. Lester thank you so much for joining me on this segment explaining adaptive data governance, how organizations can use it benefit from it and achieve our Oi. Thanks so much, guys. >>Thank you. Thanks again, Lisa. >>In a moment, we'll look a adaptive data governance in banking. This is the Cube, your global leader in high tech coverage. >>Innovation, impact influence. Welcome to the Cube. Disruptors. Developers and practitioners learn from the voices of leaders who share their personal insights from the hottest digital events around the globe. Enjoy the best this community has to offer on the Cube, your global leader in high tech digital coverage. >>Our next segment here is an interesting panel you're gonna hear from three gentlemen about adaptive data. Governments want to talk a lot about that. Please welcome Yusuf Khan, the global director of data services for Iot Tahoe. We also have Santiago Castor, the chief data officer at the First Bank of Nigeria, and good John Vander Wal, Oracle's senior manager of digital transformation and industries. Gentlemen, it's great to have you joining us in this in this panel. Great >>to be >>tried for me. >>Alright, Santiago, we're going to start with you. Can you talk to the audience a little bit about the first Bank of Nigeria and its scale? This is beyond Nigeria. Talk to us about that. >>Yes, eso First Bank of Nigeria was created 125 years ago. One of the oldest ignored the old in Africa because of the history he grew everywhere in the region on beyond the region. I am calling based in London, where it's kind of the headquarters and it really promotes trade, finance, institutional banking, corporate banking, private banking around the world in particular, in relationship to Africa. We are also in Asia in in the Middle East. >>So, Sanjay, go talk to me about what adaptive data governance means to you. And how does it help the first Bank of Nigeria to be able to innovate faster with the data that you have? >>Yes, I like that concept off adaptive data governor, because it's kind of Ah, I would say an approach that can really happen today with the new technologies before it was much more difficult to implement. So just to give you a little bit of context, I I used to work in consulting for 16, 17 years before joining the president of Nigeria, and I saw many organizations trying to apply different type of approaches in the governance on by the beginning early days was really kind of a year. A Chicago A. A top down approach where data governance was seeing as implement a set of rules, policies and procedures. But really, from the top down on is important. It's important to have the battle off your sea level of your of your director. Whatever I saw, just the way it fails, you really need to have a complimentary approach. You can say bottom are actually as a CEO are really trying to decentralize the governor's. Really, Instead of imposing a framework that some people in the business don't understand or don't care about it, it really needs to come from them. So what I'm trying to say is that data basically support business objectives on what you need to do is every business area needs information on the detector decisions toe actually be able to be more efficient or create value etcetera. Now, depending on the business questions they have to solve, they will need certain data set. So they need actually to be ableto have data quality for their own. For us now, when they understand that they become the stores naturally on their own data sets. And that is where my bottom line is meeting my top down. You can guide them from the top, but they need themselves to be also empower and be actually, in a way flexible to adapt the different questions that they have in orderto be able to respond to the business needs. Now I cannot impose at the finish for everyone. I need them to adapt and to bring their answers toe their own business questions. That is adaptive data governor and all That is possible because we have. And I was saying at the very beginning just to finalize the point, we have new technologies that allow you to do this method data classifications, uh, in a very sophisticated way that you can actually create analitico of your metadata. You can understand your different data sources in order to be able to create those classifications like nationalities, a way of classifying your customers, your products, etcetera. >>So one of the things that you just said Santa kind of struck me to enable the users to be adaptive. They probably don't want to be logging in support ticket. So how do you support that sort of self service to meet the demand of the users so that they can be adaptive. >>More and more business users wants autonomy, and they want to basically be ableto grab the data and answer their own question. Now when you have, that is great, because then you have demand of businesses asking for data. They're asking for the insight. Eso How do you actually support that? I would say there is a changing culture that is happening more and more. I would say even the current pandemic has helped a lot into that because you have had, in a way, off course, technology is one of the biggest winners without technology. We couldn't have been working remotely without these technologies where people can actually looking from their homes and still have a market data marketplaces where they self serve their their information. But even beyond that data is a big winner. Data because the pandemic has shown us that crisis happened, that we cannot predict everything and that we are actually facing a new kind of situation out of our comfort zone, where we need to explore that we need to adapt and we need to be flexible. How do we do that with data. Every single company either saw the revenue going down or the revenue going very up For those companies that are very digital already. Now it changed the reality, so they needed to adapt. But for that they needed information. In order to think on innovate, try toe, create responses So that type of, uh, self service off data Haider for data in order to be able to understand what's happening when the prospect is changing is something that is becoming more, uh, the topic today because off the condemning because of the new abilities, the technologies that allow that and then you then are allowed to basically help your data. Citizens that call them in the organization people that no other business and can actually start playing and an answer their own questions. Eso so these technologies that gives more accessibility to the data that is some cataloging so they can understand where to go or what to find lineage and relationships. All this is is basically the new type of platforms and tools that allow you to create what are called a data marketplace. I think these new tools are really strong because they are now allowing for people that are not technology or I t people to be able to play with data because it comes in the digital world There. Used to a given example without your who You have a very interesting search functionality. Where if you want to find your data you want to sell, Sir, you go there in that search and you actually go on book for your data. Everybody knows how to search in Google, everybody's searching Internet. So this is part of the data culture, the digital culture. They know how to use those schools. Now, similarly, that data marketplace is, uh, in you can, for example, see which data sources they're mostly used >>and enabling that speed that we're all demanding today during these unprecedented times. Goodwin, I wanted to go to you as we talk about in the spirit of evolution, technology is changing. Talk to us a little bit about Oracle Digital. What are you guys doing there? >>Yeah, Thank you. Um, well, Oracle Digital is a business unit that Oracle EMEA on. We focus on emerging countries as well as low and enterprises in the mid market, in more developed countries and four years ago. This started with the idea to engage digital with our customers. Fear Central helps across EMEA. That means engaging with video, having conference calls, having a wall, a green wall where we stand in front and engage with our customers. No one at that time could have foreseen how this is the situation today, and this helps us to engage with our customers in the way we were already doing and then about my team. The focus of my team is to have early stage conversations with our with our customers on digital transformation and innovation. And we also have a team off industry experts who engaged with our customers and share expertise across EMEA, and we inspire our customers. The outcome of these conversations for Oracle is a deep understanding of our customer needs, which is very important so we can help the customer and for the customer means that we will help them with our technology and our resource is to achieve their goals. >>It's all about outcomes, right? Good Ron. So in terms of automation, what are some of the things Oracle's doing there to help your clients leverage automation to improve agility? So that they can innovate faster, which in these interesting times it's demanded. >>Yeah, thank you. Well, traditionally, Oracle is known for their databases, which have bean innovated year over year. So here's the first lunch on the latest innovation is the autonomous database and autonomous data warehouse. For our customers, this means a reduction in operational costs by 90% with a multi medal converts, database and machine learning based automation for full life cycle management. Our databases self driving. This means we automate database provisioning, tuning and scaling. The database is self securing. This means ultimate data protection and security, and it's self repairing the automates failure, detection fail over and repair. And then the question is for our customers, What does it mean? It means they can focus on their on their business instead off maintaining their infrastructure and their operations. >>That's absolutely critical use if I want to go over to you now. Some of the things that we've talked about, just the massive progression and technology, the evolution of that. But we know that whether we're talking about beta management or digital transformation, a one size fits all approach doesn't work to address the challenges that the business has, um that the i t folks have, as you're looking through the industry with what Santiago told us about first Bank of Nigeria. What are some of the changes that you're seeing that I owe Tahoe seeing throughout the industry? >>Uh, well, Lisa, I think the first way I'd characterize it is to say, the traditional kind of top down approach to data where you have almost a data Policeman who tells you what you can and can't do, just doesn't work anymore. It's too slow. It's too resource intensive. Uh, data management data, governments, digital transformation itself. It has to be collaborative on. There has to be in a personalization to data users. Um, in the environment we find ourselves in. Now, it has to be about enabling self service as well. Um, a one size fits all model when it comes to those things around. Data doesn't work. As Santiago was saying, it needs to be adapted toe how the data is used. Andi, who is using it on in order to do this cos enterprises organizations really need to know their data. They need to understand what data they hold, where it is on what the sensitivity of it is they can then any more agile way apply appropriate controls on access so that people themselves are and groups within businesses are our job and could innovate. Otherwise, everything grinds to a halt, and you risk falling behind your competitors. >>Yeah, that one size fits all term just doesn't apply when you're talking about adaptive and agility. So we heard from Santiago about some of the impact that they're making with First Bank of Nigeria. Used to talk to us about some of the business outcomes that you're seeing other customers make leveraging automation that they could not do >>before it's it's automatically being able to classify terabytes, terabytes of data or even petabytes of data across different sources to find duplicates, which you can then re mediate on. Deletes now, with the capabilities that iota offers on the Oracle offers, you can do things not just where the five times or 10 times improvement, but it actually enables you to do projects for Stop that otherwise would fail or you would just not be able to dio I mean, uh, classifying multi terrible and multi petabytes states across different sources, formats very large volumes of data in many scenarios. You just can't do that manually. I mean, we've worked with government departments on the issues there is expect are the result of fragmented data. There's a lot of different sources. There's lot of different formats and without these newer technologies to address it with automation on machine learning, the project isn't durable. But now it is on that that could lead to a revolution in some of these businesses organizations >>to enable that revolution that there's got to be the right cultural mindset. And one of the when Santiago was talking about folks really kind of adapted that. The thing I always call that getting comfortably uncomfortable. But that's hard for organizations to. The technology is here to enable that. But well, you're talking with customers use. How do you help them build the trust in the confidence that the new technologies and a new approaches can deliver what they need? How do you help drive the kind of a tech in the culture? >>It's really good question is because it can be quite scary. I think the first thing we'd start with is to say, Look, the technology is here with businesses like I Tahoe. Unlike Oracle, it's already arrived. What you need to be comfortable doing is experimenting being agile around it, Andi trying new ways of doing things. Uh, if you don't wanna get less behind that Santiago on the team that fbn are a great example off embracing it, testing it on a small scale on, then scaling up a Toyota, we offer what we call a data health check, which can actually be done very quickly in a matter of a few weeks. So we'll work with a customer. Picky use case, install the application, uh, analyzed data. Drive out Cem Cem quick winds. So we worked in the last few weeks of a large entity energy supplier, and in about 20 days, we were able to give them an accurate understanding of their critical data. Elements apply. Helping apply data protection policies. Minimize copies of the data on work out what data they needed to delete to reduce their infrastructure. Spend eso. It's about experimenting on that small scale, being agile on, then scaling up in a kind of very modern way. >>Great advice. Uh, Santiago, I'd like to go back to Is we kind of look at again that that topic of culture and the need to get that mindset there to facilitate these rapid changes, I want to understand kind of last question for you about how you're doing that from a digital transformation perspective. We know everything is accelerating in 2020. So how are you building resilience into your data architecture and also driving that cultural change that can help everyone in this shift to remote working and a lot of the the digital challenges and changes that we're all going through? >>The new technologies allowed us to discover the dating anyway. Toe flawed and see very quickly Information toe. Have new models off over in the data on giving autonomy to our different data units. Now, from that autonomy, they can then compose an innovator own ways. So for me now, we're talking about resilience because in a way, autonomy and flexibility in a organization in a data structure with platform gives you resilience. The organizations and the business units that I have experienced in the pandemic are working well. Are those that actually because they're not physically present during more in the office, you need to give them their autonomy and let them actually engaged on their own side that do their own job and trust them in a way on as you give them, that they start innovating and they start having a really interesting ideas. So autonomy and flexibility. I think this is a key component off the new infrastructure. But even the new reality that on then it show us that, yes, we used to be very kind off structure, policies, procedures as very important. But now we learn flexibility and adaptability of the same side. Now, when you have that a key, other components of resiliency speed, because people want, you know, to access the data and access it fast and on the site fast, especially changes are changing so quickly nowadays that you need to be ableto do you know, interact. Reiterate with your information to answer your questions. Pretty, um, so technology that allows you toe be flexible iterating on in a very fast job way continue will allow you toe actually be resilient in that way, because you are flexible, you adapt your job and you continue answering questions as they come without having everything, setting a structure that is too hard. We also are a partner off Oracle and Oracle. Embodies is great. They have embedded within the transactional system many algorithms that are allowing us to calculate as the transactions happened. What happened there is that when our customers engaged with algorithms and again without your powers, well, the machine learning that is there for for speeding the automation of how you find your data allows you to create a new alliance with the machine. The machine is their toe, actually, in a way to your best friend to actually have more volume of data calculated faster. In a way, it's cover more variety. I mean, we couldn't hope without being connected to this algorithm on >>that engagement is absolutely critical. Santiago. Thank you for sharing that. I do wanna rap really quickly. Good On one last question for you, Santiago talked about Oracle. You've talked about a little bit. As we look at digital resilience, talk to us a little bit in the last minute about the evolution of Oracle. What you guys were doing there to help your customers get the resilience that they have toe have to be not just survive but thrive. >>Yeah. Oracle has a cloud offering for infrastructure, database, platform service and a complete solutions offered a South on Daz. As Santiago also mentioned, We are using AI across our entire portfolio and by this will help our customers to focus on their business innovation and capitalize on data by enabling new business models. Um, and Oracle has a global conference with our cloud regions. It's massively investing and innovating and expanding their clouds. And by offering clouds as public cloud in our data centers and also as private cloud with clouded customer, we can meet every sovereignty and security requirements. And in this way we help people to see data in new ways. We discover insights and unlock endless possibilities. And and maybe 11 of my takeaways is if I If I speak with customers, I always tell them you better start collecting your data. Now we enable this partners like Iota help us as well. If you collect your data now, you are ready for tomorrow. You can never collect your data backwards, So that is my take away for today. >>You can't collect your data backwards. Excellently, John. Gentlemen, thank you for sharing all of your insights. Very informative conversation in a moment, we'll address the question. Do you know your data? >>Are you interested in test driving the iota Ho platform kick Start the benefits of data automation for your business through the Iota Ho Data Health check program. Ah, flexible, scalable sandbox environment on the cloud of your choice with set up service and support provided by Iota ho. Look time with a data engineer to learn more and see Io Tahoe in action from around the globe. It's the Cube presenting adaptive data governance brought to you by Iota Ho. >>In this next segment, we're gonna be talking to you about getting to know your data. And specifically you're gonna hear from two folks at Io Tahoe. We've got enterprise account execs to be to Davis here, as well as Enterprise Data engineer Patrick Simon. They're gonna be sharing insights and tips and tricks for how you could get to know your data and quickly on. We also want to encourage you to engage with the media and Patrick, use the chat feature to the right, send comments, questions or feedback so you can participate. All right, Patrick Savita, take it away. Alright. >>Thankfully saw great to be here as Lisa mentioned guys, I'm the enterprise account executive here in Ohio. Tahoe you Pat? >>Yeah. Hey, everyone so great to be here. I said my name is Patrick Samit. I'm the enterprise data engineer here in Ohio Tahoe. And we're so excited to be here and talk about this topic as one thing we're really trying to perpetuate is that data is everyone's business. >>So, guys, what patent I got? I've actually had multiple discussions with clients from different organizations with different roles. So we spoke with both your technical and your non technical audience. So while they were interested in different aspects of our platform, we found that what they had in common was they wanted to make data easy to understand and usable. So that comes back. The pats point off to being everybody's business because no matter your role, we're all dependent on data. So what Pan I wanted to do today was wanted to walk you guys through some of those client questions, slash pain points that we're hearing from different industries and different rules and demo how our platform here, like Tahoe, is used for automating Dozier related tasks. So with that said are you ready for the first one, Pat? >>Yeah, Let's do it. >>Great. So I'm gonna put my technical hat on for this one. So I'm a data practitioner. I just started my job. ABC Bank. I have, like, over 100 different data sources. So I have data kept in Data Lakes, legacy data, sources, even the cloud. So my issue is I don't know what those data sources hold. I don't know what data sensitive, and I don't even understand how that data is connected. So how can I saw who help? >>Yeah, I think that's a very common experience many are facing and definitely something I've encountered in my past. Typically, the first step is to catalog the data and then start mapping the relationships between your various data stores. Now, more often than not, this has tackled through numerous meetings and a combination of excel and something similar to video which are too great tools in their own part. But they're very difficult to maintain. Just due to the rate that we are creating data in the modern world. It starts to beg for an idea that can scale with your business needs. And this is where a platform like Io Tahoe becomes so appealing, you can see here visualization of the data relationships created by the I. O. Tahoe service. Now, what is fantastic about this is it's not only laid out in a very human and digestible format in the same action of creating this view, the data catalog was constructed. >>Um so is the data catalog automatically populated? Correct. Okay, so So what I'm using Iota hope at what I'm getting is this complete, unified automated platform without the added cost? Of course. >>Exactly. And that's at the heart of Iota Ho. A great feature with that data catalog is that Iota Ho will also profile your data as it creates the catalog, assigning some meaning to those pesky column underscore ones and custom variable underscore tents. They're always such a joy to deal with. Now, by leveraging this interface, we can start to answer the first part of your question and understand where the core relationships within our data exists. Uh, personally, I'm a big fan of this view, as it really just helps the i b naturally John to these focal points that coincide with these key columns following that train of thought, Let's examine the customer I D column that seems to be at the center of a lot of these relationships. We can see that it's a fairly important column as it's maintaining the relationship between at least three other tables. >>Now you >>notice all the connectors are in this blue color. This means that their system defined relationships. But I hope Tahoe goes that extra mile and actually creates thes orange colored connectors as well. These air ones that are machine learning algorithms have predicted to be relationships on. You can leverage to try and make new and powerful relationships within your data. >>Eso So this is really cool, and I can see how this could be leverage quickly now. What if I added new data sources or your multiple data sources and need toe identify what data sensitive can iota who detect that? >>Yeah, definitely. Within the hotel platform. There, already over 300 pre defined policies such as hip for C, C, P. A and the like one can choose which of these policies to run against their data along for flexibility and efficiency and running the policies that affect organization. >>Okay, so so 300 is an exceptional number. I'll give you that. But what about internal policies that apply to my organization? Is there any ability for me to write custom policies? >>Yeah, that's no issue. And it's something that clients leverage fairly often to utilize this function when simply has to write a rejects that our team has helped many deploy. After that, the custom policy is stored for future use to profile sensitive data. One then selects the data sources they're interested in and select the policies that meet your particular needs. The interface will automatically take your data according to the policies of detects, after which you can review the discoveries confirming or rejecting the tagging. All of these insights are easily exported through the interface. Someone can work these into the action items within your project management systems, and I think this lends to the collaboration as a team can work through the discovery simultaneously, and as each item is confirmed or rejected, they can see it ni instantaneously. All this translates to a confidence that with iota hope, you can be sure you're in compliance. >>So I'm glad you mentioned compliance because that's extremely important to my organization. So what you're saying when I use the eye a Tahoe automated platform, we'd be 90% more compliant that before were other than if you were going to be using a human. >>Yeah, definitely the collaboration and documentation that the Iot Tahoe interface lends itself to really help you build that confidence that your compliance is sound. >>So we're planning a migration. Andi, I have a set of reports I need to migrate. But what I need to know is, uh well, what what data sources? Those report those reports are dependent on. And what's feeding those tables? >>Yeah, it's a fantastic questions to be toe identifying critical data elements, and the interdependencies within the various databases could be a time consuming but vital process and the migration initiative. Luckily, Iota Ho does have an answer, and again, it's presented in a very visual format. >>Eso So what I'm looking at here is my entire day landscape. >>Yes, exactly. >>Let's say I add another data source. I can still see that unified 3 60 view. >>Yeah, One future that is particularly helpful is the ability to add data sources after the data lineage. Discovery has finished alone for the flexibility and scope necessary for any data migration project. If you only need need to select a few databases or your entirety, this service will provide the answers. You're looking for things. Visual representation of the connectivity makes the identification of critical data elements a simple matter. The connections air driven by both system defined flows as well as those predicted by our algorithms, the confidence of which, uh, can actually be customized to make sure that they're meeting the needs of the initiative that you have in place. This also provides tabular output in case you needed for your own internal documentation or for your action items, which we can see right here. Uh, in this interface, you can actually also confirm or deny the pair rejection the pair directions, allowing to make sure that the data is as accurate as possible. Does that help with your data lineage needs? >>Definitely. So So, Pat, My next big question here is So now I know a little bit about my data. How do I know I can trust >>it? So >>what I'm interested in knowing, really is is it in a fit state for me to use it? Is it accurate? Does it conform to the right format? >>Yeah, that's a great question. And I think that is a pain point felt across the board, be it by data practitioners or data consumers alike. Another service that I owe Tahoe provides is the ability to write custom data quality rules and understand how well the data pertains to these rules. This dashboard gives a unified view of the strength of these rules, and your dad is overall quality. >>Okay, so Pat s o on on the accuracy scores there. So if my marketing team needs to run, a campaign can read dependent those accuracy scores to know what what tables have quality data to use for our marketing campaign. >>Yeah, this view would allow you to understand your overall accuracy as well as dive into the minutia to see which data elements are of the highest quality. So for that marketing campaign, if you need everything in a strong form, you'll be able to see very quickly with these high level numbers. But if you're only dependent on a few columns to get that information out the door, you can find that within this view, eso >>you >>no longer have to rely on reports about reports, but instead just come to this one platform to help drive conversations between stakeholders and data practitioners. >>So I get now the value of IATA who brings by automatically capturing all those technical metadata from sources. But how do we match that with the business glossary? >>Yeah, within the same data quality service that we just reviewed, one can actually add business rules detailing the definitions and the business domains that these fall into. What's more is that the data quality rules were just looking at can then be tied into these definitions. Allowing insight into the strength of these business rules is this service that empowers stakeholders across the business to be involved with the data life cycle and take ownership over the rules that fall within their domain. >>Okay, >>so those custom rules can I apply that across data sources? >>Yeah, you could bring in as many data sources as you need, so long as you could tie them to that unified definition. >>Okay, great. Thanks so much bad. And we just want to quickly say to everyone working in data, we understand your pain, so please feel free to reach out to us. we are Website the chapel. Oh, Arlington. And let's get a conversation started on how iota Who can help you guys automate all those manual task to help save you time and money. Thank you. Thank >>you. Your Honor, >>if I could ask you one quick question, how do you advise customers? You just walk in this great example this banking example that you instantly to talk through. How do you advise customers get started? >>Yeah, I think the number one thing that customers could do to get started with our platform is to just run the tag discovery and build up that data catalog. It lends itself very quickly to the other needs you might have, such as thes quality rules. A swell is identifying those kind of tricky columns that might exist in your data. Those custom variable underscore tens I mentioned before >>last questions to be to anything to add to what Pat just described as a starting place. >>I'm no, I think actually passed something that pretty well, I mean, just just by automating all those manual task. I mean, it definitely can save your company a lot of time and money, so we we encourage you just reach out to us. Let's get that conversation >>started. Excellent. So, Pete and Pat, thank you so much. We hope you have learned a lot from these folks about how to get to know your data. Make sure that it's quality, something you can maximize the value of it. Thanks >>for watching. Thanks again, Lisa, for that very insightful and useful deep dive into the world of adaptive data governance with Iota Ho Oracle First Bank of Nigeria This is Dave a lot You won't wanna mess Iota, whose fifth episode in the data automation Siri's in that we'll talk to experts from Red Hat and Happiest Minds about their best practices for managing data across hybrid cloud Inter Cloud multi Cloud I T environment So market calendar for Wednesday, January 27th That's Episode five. You're watching the Cube Global Leader digital event technique

Published Date : Dec 10 2020

SUMMARY :

adaptive data governance brought to you by Iota Ho. Gentlemen, it's great to have you on the program. Lisa is good to be back. Great. Listen, we're gonna start with you. But to really try to address these customer concerns because, you know, we wanna we So it's exciting a J from the CEO's level. It's real satisfying to see how we're able. Let's let's go back over to you. But they need to understand what kind of data they have, what shape it's in what's dependent lot of a lot of frameworks these days are hardwired, so you can set up a set It's the technical metadata coming together with policies Is this book enterprise companies are doing now? help the organizations to digest their data is to And if it was me eating that food with you guys, I would be not using chopsticks. So if you look at the challenges for these data professionals, you know, they're either on a journey to the cloud. Well, as she digs into the databases, she starts to see that So a J talk us through some examples of where But I think it helped do this Bring it to life a little bit. And one of the things I was thinking when you were talking through some We can see that on the the graphic that we've just How are you seeing those technologies being think you know this But the very first step is understanding what you have in normalizing that So if I start to see this pattern of date one day to elsewhere, I'm going to say, in the beginning about what you guys were doing with Oracle. So Oracle came to us and said, you know, we can see things changing in 2021 a. J. Lester thank you so much for joining me on this segment Thank you. is the Cube, your global leader in high tech coverage. Enjoy the best this community has to offer on the Cube, Gentlemen, it's great to have you joining us in this in this panel. Can you talk to the audience a little bit about the first Bank of One of the oldest ignored the old in Africa because of the history And how does it help the first Bank of Nigeria to be able to innovate faster with the point, we have new technologies that allow you to do this method data So one of the things that you just said Santa kind of struck me to enable the users to be adaptive. Now it changed the reality, so they needed to adapt. I wanted to go to you as we talk about in the spirit of evolution, technology is changing. customer and for the customer means that we will help them with our technology and our resource is to achieve doing there to help your clients leverage automation to improve agility? So here's the first lunch on the latest innovation Some of the things that we've talked about, Otherwise, everything grinds to a halt, and you risk falling behind your competitors. Used to talk to us about some of the business outcomes that you're seeing other customers make leveraging automation different sources to find duplicates, which you can then re And one of the when Santiago was talking about folks really kind of adapted that. Minimize copies of the data can help everyone in this shift to remote working and a lot of the the and on the site fast, especially changes are changing so quickly nowadays that you need to be What you guys were doing there to help your customers I always tell them you better start collecting your data. Gentlemen, thank you for sharing all of your insights. adaptive data governance brought to you by Iota Ho. In this next segment, we're gonna be talking to you about getting to know your data. Thankfully saw great to be here as Lisa mentioned guys, I'm the enterprise account executive here in Ohio. I'm the enterprise data engineer here in Ohio Tahoe. So with that said are you ready for the first one, Pat? So I have data kept in Data Lakes, legacy data, sources, even the cloud. Typically, the first step is to catalog the data and then start mapping the relationships Um so is the data catalog automatically populated? i b naturally John to these focal points that coincide with these key columns following These air ones that are machine learning algorithms have predicted to be relationships Eso So this is really cool, and I can see how this could be leverage quickly now. such as hip for C, C, P. A and the like one can choose which of these policies policies that apply to my organization? And it's something that clients leverage fairly often to utilize this So I'm glad you mentioned compliance because that's extremely important to my organization. interface lends itself to really help you build that confidence that your compliance is Andi, I have a set of reports I need to migrate. Yeah, it's a fantastic questions to be toe identifying critical data elements, I can still see that unified 3 60 view. Yeah, One future that is particularly helpful is the ability to add data sources after So now I know a little bit about my data. the data pertains to these rules. So if my marketing team needs to run, a campaign can read dependent those accuracy scores to know what the minutia to see which data elements are of the highest quality. no longer have to rely on reports about reports, but instead just come to this one So I get now the value of IATA who brings by automatically capturing all those technical to be involved with the data life cycle and take ownership over the rules that fall within their domain. Yeah, you could bring in as many data sources as you need, so long as you could manual task to help save you time and money. you. this banking example that you instantly to talk through. Yeah, I think the number one thing that customers could do to get started with our so we we encourage you just reach out to us. folks about how to get to know your data. into the world of adaptive data governance with Iota Ho Oracle First Bank of Nigeria

ENTITIES

Entity	Category	Confidence
Amanda	PERSON	0.99+
Jason	PERSON	0.99+
Lisa	PERSON	0.99+
Patrick Simon	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Santiago	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
Yusuf Khan	PERSON	0.99+
Asia	LOCATION	0.99+
16	QUANTITY	0.99+
Santiago Castor	PERSON	0.99+
Ohio	LOCATION	0.99+
London	LOCATION	0.99+
ABC Bank	ORGANIZATION	0.99+
Patrick Savita	PERSON	0.99+
10 times	QUANTITY	0.99+
Sanjay	PERSON	0.99+
Angie	PERSON	0.99+
Wednesday, January 27th	DATE	0.99+
Africa	LOCATION	0.99+
Thio	PERSON	0.99+
John Vander Wal	PERSON	0.99+
2020	DATE	0.99+
Patrick	PERSON	0.99+
two columns	QUANTITY	0.99+
90%	QUANTITY	0.99+
Siri	TITLE	0.99+
Toyota	ORGANIZATION	0.99+
Bio Tahoe	ORGANIZATION	0.99+
Azaz	PERSON	0.99+
Pat	PERSON	0.99+
11	QUANTITY	0.99+
five times	QUANTITY	0.99+
Oracle Digital	ORGANIZATION	0.99+
J. Bihar	PERSON	0.99+
1%	QUANTITY	0.99+
Staley	PERSON	0.99+
Iot Tahoe	ORGANIZATION	0.99+
Iota ho	ORGANIZATION	0.99+
today	DATE	0.99+
Ron	PERSON	0.99+
first	QUANTITY	0.99+
10	QUANTITY	0.99+
Iota Ho	ORGANIZATION	0.99+
Andi	PERSON	0.99+
Io Tahoe	ORGANIZATION	0.99+
one date	QUANTITY	0.99+
One	QUANTITY	0.99+
excel	TITLE	0.99+
tomorrow	DATE	0.99+
3%	QUANTITY	0.99+
John	PERSON	0.99+
First Bank of Nigeria	ORGANIZATION	0.99+
Middle East	LOCATION	0.99+
Patrick Samit	PERSON	0.99+
I. O. Tahoe	ORGANIZATION	0.99+
first step	QUANTITY	0.99+
97%	QUANTITY	0.99+
Lester	PERSON	0.99+
two folks	QUANTITY	0.99+
Dave	PERSON	0.99+
2021	DATE	0.99+
fifth episode	QUANTITY	0.99+
one grain	QUANTITY	0.99+

4 3 Ruha for Transcript

>>Thank you. Thank you so much for having me. I'm thrilled to be in conversation with you today. And I thought I would just kick things off with some opening reflections on this really important session theme, and then we can jump into discussion. So I'd like us to, as a starting point, um, wrestle with these buzz words, empowerment and inclusion so that we can, um, have them be more than kind of big platitudes and really have them reflected in our workplace cultures and the things that we design and the technologies that we put out into the world. And so to do that, I think we have to move beyond techno determinism and I'll explain what that means in just a minute. And techno determinism comes in two forms. The first on your left is the idea that technology automate. Um, all of these emerging trends are going to harm us are going to necessarily, um, harm humanity. >>They're going to take all the jobs they're going to remove human agency. This is what we might call the techno dystopian version of the story. And this is what Hollywood loves to sell us in the form of movies like the matrix or Terminator. The other version on your right is the techno utopian story that technologies automation, the robots, as a shorthand are going to save humanity. They're going to make everything more efficient, more equitable. And in this case, on the surface, they seem like opposing narratives, right? They're telling us different stories. At least they have different endpoints, but when you pull back the screen and look a little bit more closely, you see that they share an underlying logic, that technology is in the driver's seat and that human beings, that social society can just respond to what's happening. But we don't, I really have a say in what technologies are designed. >>And so to move beyond techno determinism, the notion that technology is in the driver's seat, we have to put the human agents and agencies back into the story protagonists and think carefully about what the human desires, worldviews values assumptions are that animate the production of technology. We have to put the humans behind the screen back into view. And so that's a very first step in when we do that. We see as was already mentioned that it's a very homogenous group right now in terms of who gets the power and the resources to produce the digital and physical infrastructure that everyone else has to live with. And so, as a first step, we need to think about how to, to create more participation of those who are working behind the scenes to design technology. Now, to dig a little more deeper into this, I want to offer a kind of low tech example before we get to the more high tech ones. >>So what you see in front of you here is a simple park bench public it's located in Berkeley, California, which is where I went to graduate school. And on this one particular visit, I was living in Boston. And so I was back in California, it was February, it was freezing where I was coming from. And so I wanted to take a few minutes in between meetings to just lay out in the sun and soak in some vitamin D. And I quickly realized actually I couldn't lay down on the bench because of the way it had been designed with these arm rests at intermittent intervals. And so here I thought, okay, th th the armrests have a functional reason why they're there. I mean, you could literally rest your elbows there, or, um, you know, it can create a little bit of privacy of someone sitting there that you don't know. >>Um, when I was nine months pregnant, it could help me get up and down or for the elderly the same thing. So it has a lot of functional reasons, but I also thought about the fact that it prevents people who are, are homeless from sleeping on the bench. And this is the Bay area that we're talking about, where in fact, the tech boom has gone hand in hand with a housing crisis. Those things have grown in tandem. So innovation has grown with inequity because we have, I haven't thought carefully about how to address the social context in which technology grows and blossoms. And so I thought, okay, this crisis is growing in this area. And so perhaps this is a deliberate attempt to make sure that people don't sleep on the benches by the way that they're designed and where the, where they're implemented. And so this is what we might call structural inequity, by the way something is designed. >>It has certain yeah. Affects that exclude or harm different people. And so it may not necessarily be the intent, but that's the effect. And I did a little digging and I found, in fact, it's a global phenomenon, this thing that architect next call, hostile architecture around single occupancy, benches and Helsinki. So only one booty at a time, no Nolan down there. I've found caged benches in France. Yeah. And in this particular town, what's interesting here is that the mayor put these benches out in this little shopping Plaza and within 24 hours, the people in the town rally together and have them removed. So we see here that just because we, we have a discriminatory design in our public space, doesn't mean we have to live with it. We can actually work together to ensure that our public space reflects our better values. But I think my favorite example of all is the metered bench. >>And then this case, this bench is designed with spikes in them and to get the spikes to retreat into the bench, you have to feed the meter. You have to put some coins in, and I think it buys you about 15, 20 minutes, then the spikes come back up. And so you will be happy to know that in this case, uh, this was designed by a German artist to get people to think critically about issues of design, not the design of physical space, but the design of all kinds of things, public policies. And so we can think about how our public life in general is metered, that it serves those that can pay the price and others are excluded or harmed. Whether we're talking about education or healthcare. And the meter bench also presents something interesting for those of us who care about technology, it creates a technical fix for a social problem. >>In fact, it started out as art, but some municipalities in different parts of the world have actually adopted this in their public spaces, in their parks in order to deter so-called loiters from using that space. And so by a technical fix, we mean something that creates a short-term effect, right? It gets people who may want to sleep on it out of sight. They're unable to use it, but it doesn't address the underlying problems that create that need to sleep outside of the first place. And so, in addition to techno determinism, we have to think critically about technical fixes, that don't address the underlying issues that the tech tech technology is meant to solve. And so this is part of a broader issue of discriminatory design, and we can apply the bench metaphor to all kinds of things that we work with, or that we create. >>And the question we really have to continuously ask ourselves is what values are we building in to the physical and digital infrastructures around us? What are the spikes that we may unwittingly put into place? Or perhaps we didn't create the spikes. Perhaps we started a new job or a new position, and someone hands us something, this is the way things have always been done. So we inherit the spiked bench. What is our responsibility? When we notice that it's creating these kinds of harms or exclusions or technical fixes that are bypassing the underlying problem, what is our responsibility? All of this came to a head in the context of financial technologies. I don't know how many of you remember these high profile cases of tech insiders and CEOs who applied for apples, >>The Apple card. And in one case, a husband and wife applied, and the husband, the husband received a much higher limit, almost 20 times the limit as his, >>His wife, even though they shared bank accounts, they lived in common law state. Yeah. >>And so the question there was not only the fact that >>The husband was receiving a much better rate and a high and a better >>The interest rate and the limit, but also that there was no mechanism for the individuals involved to dispute what was happening. They didn't even know how, what the factors were that they were being judged that was creating this form of discrimination. So >>In terms of financial technologies, it's not simply the outcome, that's the issue, or that can be discriminatory, >>But the process that black box is all of the decision-making that makes it so that consumers and the general public have no way to question it, no way to understand how they're being judged adversely. And so it's the process, not only the product that we have to care a lot about. And so the case of the Apple card is part of a much broader phenomenon >>Of, um, races >>And sexist robots. This is how the headlines framed it a few years ago. And I was so interested in this framing because there was a first wave of stories that seemed to be shocked at the prospect, that technology is not neutral. Then there was a second wave of stories that seemed less surprised. Well, of course, technology inherits its creators biases. And now I think we've entered a phase of attempts to override and address the default settings of so-called racist and sexist robots for better or worse than here. Robots is just a kind of shorthand that the way that people are talking about automation and emerging technologies more broadly. And so, as I was encountering these headlines, I was thinking about how these are not problems simply brought on by machine learning or AI. They're not all brand new. And so I wanted to contribute to the conversation, a kind of larger context and a longer history for us to think carefully about the social dimensions of technology. And so I developed a concept called the new Jim code, >>Which plays on the phrase, >>Jim Crow, which is the way that the regime of white supremacy and inequality in this country was defined in a previous era. And I wanted us to think about how that legacy continues to haunt the present, how we might be coding bias into emerging technologies and the danger being that we imagine those technologies to be objective. And so this gives us a language to be able to name this phenomenon so that we can address it and change it under this larger umbrella of the new Jim code are four distinct ways that this phenomenon takes shape from the more obvious engineered inequity. Those are the kinds of inequalities tech mediated in the qualities that we can generally see coming. They're kind of obvious, but then we go down the line and we see it becomes harder to detect it's happening in our own backyards, it's happening around us. And we don't really have a view into the black box. And so it becomes more insidious. And so in the remaining couple of minutes, I'm just, just going to give you a taste of the last three of these, and then a move towards conclusion. Then we can start chatting. So when it comes to default discrimination, this is the way that social inequalities >>Become embedded in emerging technologies because designers of these technologies, aren't thinking carefully about history and sociology. A great example of this, uh, came to, um, uh, the headlines last fall when it was found that widely used healthcare algorithm, effecting millions of patients, um, was discriminating against black patients. And so what's especially important to note here is that this algorithm, healthcare algorithm does not explicitly take note of race. That is to say it is race neutral by using cost to predict healthcare needs this digital triaging system unwittingly reproduces health disparities, because on average, black people have incurred fewer costs for a variety of reasons, including structural inequality. So in my review of this study, by Obermeyer and colleagues, I want to draw attention to how indifference to social reality can be even more harmful than malicious intent. It doesn't have to be the intent of the designers to create this effect. >>And so we have to look carefully at how indifference is operating and how race neutrality can be a deadly force. When we move on to the next iteration of the new Jim code, coded exposure, there's a tension because on the one hand, you see this image where the darker skin individual is not being detected by the facial recognition system, right on the camera, on the computer. And so coded exposure names, this tension between wanting to be seen and included and recognized whether it's in facial recognition or in recommendation systems or in tailored advertising. But the opposite of that, the tension is with when you're over, it >>Included when you're surveilled, when you're >>Too centered. And so we should note that it's not simply in being left out, that's the problem, but it's in being included in harmful ways. And so I want us to think carefully about the rhetoric of inclusion and understand that inclusion is not simply an end point, it's a process, and it is possible to include people in harmful processes. And so we want to ensure that the process is not harmful for it to really be effective. The last iteration of the new Jim code. That means the, the most insidious let's say is technologies that are touted as helping us address bias. So they're not simply including people, but they're actively working to address bias. And so in this case, there are a lot of different companies that are using AI to hire, uh, create hiring, um, software and hiring algorithms, including this one higher view. >>And the idea is that there there's a lot that, um, AI can keep track of that human beings might miss. And so, so the software can make data-driven talent decisions after all the problem of employment discrimination is widespread and well-documented, so the logic goes, wouldn't this be even more reason to outsource decisions to AI? Well, let's think about this carefully. And this is the idea of techno benevolence, trying to do good without fully reckoning with what, how technology can reproduce inequalities. So some colleagues of mine at Princeton, um, tested a natural learning processing algorithm and was looking to see whether it exhibited the same, um, tendencies that psychologists have documented among humans. And what they found was that in fact, the algorithm associated black names with negative words and white names with pleasant sounding words. And so this particular audit builds on a classic study done around 2003 before all of the emerging technologies were on the scene where two university of Chicago economists sent out thousands of resumes to employers in Boston and Chicago. >>And all they did was change the names on those resumes. All of the other work history education were the same. And then they waited to see who would get called back and the applicants, the fictional applicants with white sounding names received 50% more callbacks than the, the black applicants. So if you're presented with that study, you might be tempted to say, well, let's let technology handle it since humans are so biased. But my colleagues here in computer science found that this natural language processing algorithm actually reproduced those same associations with black and white names. So two with gender coded words and names as Amazon learned a couple years ago, when its own hiring algorithm was found discriminating against women, nevertheless, it should be clear by now why technical fixes that claim to bypass human biases are so desirable. If only there was a way to slay centuries of racist and sexist demons with a social justice bot beyond desirable, more like magical, magical for employers, perhaps looking to streamline the grueling work of recruitment, but a curse from any job seekers as this headline puts it. >>Your next interview could be with a racist bot, bringing us back to that problem space. We started with just a few minutes ago. So it's worth noting that job seekers are already developing ways to subvert the system by trading answers to employers tests and creating fake applications as informal audits of their own. In terms of a more collective response. There's a Federation of European trade unions call you and I global that's developed a charter of digital rights for workers that touches on automated and AI based decisions to be included in bargaining agreements. And so this is one of many efforts to change the ecosystem, to change the context in which technology is being deployed to ensure more protections and more rights for everyday people in the U S there's the algorithmic accountability bill that's been presented. And it's one effort to create some more protections around this ubiquity of automated decisions. >>And I think we should all be calling for more public accountability when it comes to the widespread use of automated decisions. Another development that keeps me somewhat hopeful is that tech workers themselves are increasingly speaking out against the most egregious forms of corporate collusion with state sanctioned racism. And to get a taste of that, I encourage you to check out the hashtag tech, won't build it among other statements that they've made and walking out and petitioning their companies. One group said as the, at Google at Microsoft wrote as the people who build the technologies that Microsoft profits from, we refuse to be complicit in terms of education, which is my own ground zero. Um, it's a place where we can, we can grow a more historically and socially literate approach to tech design. And this is just one resource that you all can download, um, by developed by some wonderful colleagues at the data and society research Institute in New York. >>And the, the goal of this intervention is threefold to develop an intellectual understanding of how structural racism operates and algorithms, social media platforms and technologies not yet developed and emotional intelligence concerning how to resolve racially stressful situations within organizations and a commitment to take action, to reduce harms to communities of color. And so as a final way to think about why these things are so important, I want to offer, uh, a couple last provocations. The first is pressed to think a new about what actually is deep learning when it comes to computation. I want to suggest that computational depth when it comes to AI systems without historical or social depth is actually superficial learning. And so we need to have a much more interdisciplinary, integrated approach to knowledge production and to observing and understanding patterns that don't simply rely on one discipline in order to map reality. >>The last provocation is this. If as I suggested at the start in the inequity is woven into the very fabric of our society. It's built into the design of our, our policies, our physical infrastructures, and now even our digital infrastructures. That means that each twist coil and code is a chance for us to weave new patterns, practices, and politics. The vastness of the problems that we're up against will be their undoing. Once we, that we are pattern makers. So what does that look like? It looks like refusing colorblindness as an anecdote to tech media discrimination, rather than refusing to see difference. Let's take stock of how the training data and the models that we're creating. Have these built in decisions from the past that have often been discriminatory. It means actually thinking about the underside of inclusion, which can be targeting and how do we create a more participatory rather than predatory form of inclusion. And ultimately it also means owning our own power in these systems so that we can change the patterns of the past. If we're, if we inherit a spiked bench, that doesn't mean that we need to continue using it. We can work together to design more, just an equitable technologies. So with that, I look forward to our conversation.

Published Date : Nov 25 2020

SUMMARY :

And so to do that, I think we have to move And this is what Hollywood loves And so to move beyond techno determinism, the notion that technology is in the driver's seat, And so I was back in California, it was February, And so this is what we might call structural inequity, And so it may not necessarily be the intent, And so we can think about how our public life in general is metered, And so, in addition to techno determinism, we have to think critically about And the question we really have to continuously ask ourselves is what values And in one case, a husband and wife applied, and the husband, Yeah. the individuals involved to dispute what was happening. And so it's the process, And so I developed a concept called the new Jim code, And so in the remaining couple of minutes, I'm just, just going to give you a taste of the last three of And so what's especially And so we have to look carefully at how indifference is operating and how race neutrality can And so we should note that it's not simply in being left And the idea is that there there's a lot that, um, AI can keep track of that All of the other work history education were the same. And so this is one of many efforts to change the ecosystem, And I think we should all be calling for more public accountability when it comes And so we need to have a much more interdisciplinary, And ultimately it also means owning our own power in these systems so that we can change

ENTITIES

Entity	Category	Confidence
California	LOCATION	0.99+
France	LOCATION	0.99+
Boston	LOCATION	0.99+
Chicago	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Obermeyer	PERSON	0.99+
nine months	QUANTITY	0.99+
New York	LOCATION	0.99+
Google	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Berkeley, California	LOCATION	0.99+
today	DATE	0.99+
February	DATE	0.99+
one case	QUANTITY	0.99+
first step	QUANTITY	0.99+
Federation of European trade unions	ORGANIZATION	0.99+
one resource	QUANTITY	0.99+
first step	QUANTITY	0.99+
first	QUANTITY	0.99+
two forms	QUANTITY	0.99+
Jim Crow	PERSON	0.99+
Jim	PERSON	0.98+
U S	LOCATION	0.98+
millions of patients	QUANTITY	0.98+
Apple	ORGANIZATION	0.97+
20 minutes	QUANTITY	0.97+
one	QUANTITY	0.97+
Ruha	PERSON	0.97+
apples	ORGANIZATION	0.97+
Terminator	TITLE	0.96+
each twist	QUANTITY	0.96+
one discipline	QUANTITY	0.95+
Helsinki	LOCATION	0.95+
two university	QUANTITY	0.95+
one booty	QUANTITY	0.95+
four distinct ways	QUANTITY	0.95+
Hollywood	ORGANIZATION	0.94+
24 hours	QUANTITY	0.94+
One group	QUANTITY	0.94+
almost 20 times	QUANTITY	0.93+
centuries	QUANTITY	0.93+
few minutes ago	DATE	0.93+
about 15	QUANTITY	0.92+
one effort	QUANTITY	0.92+
first	EVENT	0.92+
German	OTHER	0.91+
couple years ago	DATE	0.9+
Nolan	PERSON	0.88+
three	QUANTITY	0.88+
50% more	QUANTITY	0.87+
matrix	TITLE	0.87+
last fall	DATE	0.87+
few years ago	DATE	0.86+
Bay	LOCATION	0.86+
2003	DATE	0.82+
data and society research Institute	ORGANIZATION	0.8+
single occupancy	QUANTITY	0.79+
couple	QUANTITY	0.78+
second wave	EVENT	0.76+
thousands of resumes	QUANTITY	0.76+
Princeton	ORGANIZATION	0.72+
first place	QUANTITY	0.71+
wave	EVENT	0.48+

4-video test

>>don't talk mhm, >>Okay, thing is my presentation on coherent nonlinear dynamics and combinatorial optimization. This is going to be a talk to introduce an approach we're taking to the analysis of the performance of coherent using machines. So let me start with a brief introduction to easing optimization. The easing model represents a set of interacting magnetic moments or spins the total energy given by the expression shown at the bottom left of this slide. Here, the signal variables are meditate binary values. The Matrix element J. I. J. Represents the interaction, strength and signed between any pair of spins. I. J and A Chive represents a possible local magnetic field acting on each thing. The easing ground state problem is to find an assignment of binary spin values that achieves the lowest possible value of total energy. And an instance of the easing problem is specified by giving numerical values for the Matrix J in Vector H. Although the easy model originates in physics, we understand the ground state problem to correspond to what would be called quadratic binary optimization in the field of operations research and in fact, in terms of computational complexity theory, it could be established that the easing ground state problem is np complete. Qualitatively speaking, this makes the easing problem a representative sort of hard optimization problem, for which it is expected that the runtime required by any computational algorithm to find exact solutions should, as anatomically scale exponentially with the number of spends and for worst case instances at each end. Of course, there's no reason to believe that the problem instances that actually arrives in practical optimization scenarios are going to be worst case instances. And it's also not generally the case in practical optimization scenarios that we demand absolute optimum solutions. Usually we're more interested in just getting the best solution we can within an affordable cost, where costs may be measured in terms of time, service fees and or energy required for a computation. This focuses great interest on so called heuristic algorithms for the easing problem in other NP complete problems which generally get very good but not guaranteed optimum solutions and run much faster than algorithms that are designed to find absolute Optima. To get some feeling for present day numbers, we can consider the famous traveling salesman problem for which extensive compilations of benchmarking data may be found online. A recent study found that the best known TSP solver required median run times across the Library of Problem instances That scaled is a very steep route exponential for end up to approximately 4500. This gives some indication of the change in runtime scaling for generic as opposed the worst case problem instances. Some of the instances considered in this study were taken from a public library of T SPS derived from real world Veil aside design data. This feels I TSP Library includes instances within ranging from 131 to 744,710 instances from this library with end between 6880 13,584 were first solved just a few years ago in 2017 requiring days of run time and a 48 core to King hurts cluster, while instances with and greater than or equal to 14,233 remain unsolved exactly by any means. Approximate solutions, however, have been found by heuristic methods for all instances in the VLS i TSP library with, for example, a solution within 0.14% of a no lower bound, having been discovered, for instance, with an equal 19,289 requiring approximately two days of run time on a single core of 2.4 gigahertz. Now, if we simple mindedly extrapolate the root exponential scaling from the study up to an equal 4500, we might expect that an exact solver would require something more like a year of run time on the 48 core cluster used for the N equals 13,580 for instance, which shows how much a very small concession on the quality of the solution makes it possible to tackle much larger instances with much lower cost. At the extreme end, the largest TSP ever solved exactly has an equal 85,900. This is an instance derived from 19 eighties VLSI design, and it's required 136 CPU. Years of computation normalized to a single cord, 2.4 gigahertz. But the 24 larger so called world TSP benchmark instance within equals 1,904,711 has been solved approximately within ophthalmology. Gap bounded below 0.474%. Coming back to the general. Practical concerns have applied optimization. We may note that a recent meta study analyzed the performance of no fewer than 37 heuristic algorithms for Max cut and quadratic pioneer optimization problems and found the performance sort and found that different heuristics work best for different problem instances selected from a large scale heterogeneous test bed with some evidence but cryptic structure in terms of what types of problem instances were best solved by any given heuristic. Indeed, their their reasons to believe that these results from Mexico and quadratic binary optimization reflected general principle of performance complementarity among heuristic optimization algorithms in the practice of solving heart optimization problems there. The cerise is a critical pre processing issue of trying to guess which of a number of available good heuristic algorithms should be chosen to tackle a given problem. Instance, assuming that any one of them would incur high costs to run on a large problem, instances incidence, making an astute choice of heuristic is a crucial part of maximizing overall performance. Unfortunately, we still have very little conceptual insight about what makes a specific problem instance, good or bad for any given heuristic optimization algorithm. This has certainly been pinpointed by researchers in the field is a circumstance that must be addressed. So adding this all up, we see that a critical frontier for cutting edge academic research involves both the development of novel heuristic algorithms that deliver better performance, with lower cost on classes of problem instances that are underserved by existing approaches, as well as fundamental research to provide deep conceptual insight into what makes a given problem in, since easy or hard for such algorithms. In fact, these days, as we talk about the end of Moore's law and speculate about a so called second quantum revolution, it's natural to talk not only about novel algorithms for conventional CPUs but also about highly customized special purpose hardware architectures on which we may run entirely unconventional algorithms for combinatorial optimization such as easing problem. So against that backdrop, I'd like to use my remaining time to introduce our work on analysis of coherent using machine architectures and associate ID optimization algorithms. These machines, in general, are a novel class of information processing architectures for solving combinatorial optimization problems by embedding them in the dynamics of analog, physical or cyber physical systems, in contrast to both MAWR traditional engineering approaches that build using machines using conventional electron ICS and more radical proposals that would require large scale quantum entanglement. The emerging paradigm of coherent easing machines leverages coherent nonlinear dynamics in photonic or Opto electronic platforms to enable near term construction of large scale prototypes that leverage post Simoes information dynamics, the general structure of of current CM systems has shown in the figure on the right. The role of the easing spins is played by a train of optical pulses circulating around a fiber optical storage ring. A beam splitter inserted in the ring is used to periodically sample the amplitude of every optical pulse, and the measurement results are continually read into a refugee A, which uses them to compute perturbations to be applied to each pulse by a synchronized optical injections. These perturbations, air engineered to implement the spin, spin coupling and local magnetic field terms of the easing Hamiltonian, corresponding to a linear part of the CME Dynamics, a synchronously pumped parametric amplifier denoted here as PPL and Wave Guide adds a crucial nonlinear component to the CIA and Dynamics as well. In the basic CM algorithm, the pump power starts very low and has gradually increased at low pump powers. The amplitude of the easing spin pulses behaviors continuous, complex variables. Who Israel parts which can be positive or negative, play the role of play the role of soft or perhaps mean field spins once the pump, our crosses the threshold for parametric self oscillation. In the optical fiber ring, however, the attitudes of the easing spin pulses become effectively Qantas ized into binary values while the pump power is being ramped up. The F P J subsystem continuously applies its measurement based feedback. Implementation of the using Hamiltonian terms, the interplay of the linear rised using dynamics implemented by the F P G A and the threshold conversation dynamics provided by the sink pumped Parametric amplifier result in the final state of the optical optical pulse amplitude at the end of the pump ramp that could be read as a binary strain, giving a proposed solution of the easing ground state problem. This method of solving easing problem seems quite different from a conventional algorithm that runs entirely on a digital computer as a crucial aspect of the computation is performed physically by the analog, continuous, coherent, nonlinear dynamics of the optical degrees of freedom. In our efforts to analyze CIA and performance, we have therefore turned to the tools of dynamical systems theory, namely, a study of modifications, the evolution of critical points and apologies of hetero clinic orbits and basins of attraction. We conjecture that such analysis can provide fundamental insight into what makes certain optimization instances hard or easy for coherent using machines and hope that our approach can lead to both improvements of the course, the AM algorithm and a pre processing rubric for rapidly assessing the CME suitability of new instances. Okay, to provide a bit of intuition about how this all works, it may help to consider the threshold dynamics of just one or two optical parametric oscillators in the CME architecture just described. We can think of each of the pulse time slots circulating around the fiber ring, as are presenting an independent Opio. We can think of a single Opio degree of freedom as a single, resonant optical node that experiences linear dissipation, do toe out coupling loss and gain in a pump. Nonlinear crystal has shown in the diagram on the upper left of this slide as the pump power is increased from zero. As in the CME algorithm, the non linear game is initially to low toe overcome linear dissipation, and the Opio field remains in a near vacuum state at a critical threshold. Value gain. Equal participation in the Popeo undergoes a sort of lazing transition, and the study states of the OPIO above this threshold are essentially coherent states. There are actually two possible values of the Opio career in amplitude and any given above threshold pump power which are equal in magnitude but opposite in phase when the OPI across the special diet basically chooses one of the two possible phases randomly, resulting in the generation of a single bit of information. If we consider to uncoupled, Opio has shown in the upper right diagram pumped it exactly the same power at all times. Then, as the pump power has increased through threshold, each Opio will independently choose the phase and thus to random bits are generated for any number of uncoupled. Oppose the threshold power per opio is unchanged from the single Opio case. Now, however, consider a scenario in which the two appeals air, coupled to each other by a mutual injection of their out coupled fields has shown in the diagram on the lower right. One can imagine that depending on the sign of the coupling parameter Alfa, when one Opio is lazing, it will inject a perturbation into the other that may interfere either constructively or destructively, with the feel that it is trying to generate by its own lazing process. As a result, when came easily showed that for Alfa positive, there's an effective ferro magnetic coupling between the two Opio fields and their collective oscillation threshold is lowered from that of the independent Opio case. But on Lee for the two collective oscillation modes in which the two Opio phases are the same for Alfa Negative, the collective oscillation threshold is lowered on Lee for the configurations in which the Opio phases air opposite. So then, looking at how Alfa is related to the J. I. J matrix of the easing spin coupling Hamiltonian, it follows that we could use this simplistic to a p o. C. I am to solve the ground state problem of a fair magnetic or anti ferro magnetic ankles to easing model simply by increasing the pump power from zero and observing what phase relation occurs as the two appeals first start delays. Clearly, we can imagine generalizing this story toe larger, and however the story doesn't stay is clean and simple for all larger problem instances. And to find a more complicated example, we only need to go to n equals four for some choices of J J for n equals, for the story remains simple. Like the n equals two case. The figure on the upper left of this slide shows the energy of various critical points for a non frustrated and equals, for instance, in which the first bifurcated critical point that is the one that I forget to the lowest pump value a. Uh, this first bifurcated critical point flows as symptomatically into the lowest energy easing solution and the figure on the upper right. However, the first bifurcated critical point flows to a very good but sub optimal minimum at large pump power. The global minimum is actually given by a distinct critical critical point that first appears at a higher pump power and is not automatically connected to the origin. The basic C am algorithm is thus not able to find this global minimum. Such non ideal behaviors needs to become more confident. Larger end for the n equals 20 instance, showing the lower plots where the lower right plot is just a zoom into a region of the lower left lot. It can be seen that the global minimum corresponds to a critical point that first appears out of pump parameter, a around 0.16 at some distance from the idiomatic trajectory of the origin. That's curious to note that in both of these small and examples, however, the critical point corresponding to the global minimum appears relatively close to the idiomatic projector of the origin as compared to the most of the other local minima that appear. We're currently working to characterize the face portrait topology between the global minimum in the antibiotic trajectory of the origin, taking clues as to how the basic C am algorithm could be generalized to search for non idiomatic trajectories that jump to the global minimum during the pump ramp. Of course, n equals 20 is still too small to be of interest for practical optimization applications. But the advantage of beginning with the study of small instances is that we're able reliably to determine their global minima and to see how they relate to the 80 about trajectory of the origin in the basic C am algorithm. In the smaller and limit, we can also analyze fully quantum mechanical models of Syrian dynamics. But that's a topic for future talks. Um, existing large scale prototypes are pushing into the range of in equals 10 to the 4 10 to 5 to six. So our ultimate objective in theoretical analysis really has to be to try to say something about CIA and dynamics and regime of much larger in our initial approach to characterizing CIA and behavior in the large in regime relies on the use of random matrix theory, and this connects to prior research on spin classes, SK models and the tap equations etcetera. At present, we're focusing on statistical characterization of the CIA ingredient descent landscape, including the evolution of critical points in their Eigen value spectra. As the pump power is gradually increased. We're investigating, for example, whether there could be some way to exploit differences in the relative stability of the global minimum versus other local minima. We're also working to understand the deleterious or potentially beneficial effects of non ideologies, such as a symmetry in the implemented these and couplings. Looking one step ahead, we plan to move next in the direction of considering more realistic classes of problem instances such as quadratic, binary optimization with constraints. Eso In closing, I should acknowledge people who did the hard work on these things that I've shown eso. My group, including graduate students Ed winning, Daniel Wennberg, Tatsuya Nagamoto and Atsushi Yamamura, have been working in close collaboration with Syria Ganguly, Marty Fair and Amir Safarini Nini, all of us within the Department of Applied Physics at Stanford University. On also in collaboration with the Oshima Moto over at NTT 55 research labs, Onda should acknowledge funding support from the NSF by the Coherent Easing Machines Expedition in computing, also from NTT five research labs, Army Research Office and Exxon Mobil. Uh, that's it. Thanks very much. >>Mhm e >>t research and the Oshie for putting together this program and also the opportunity to speak here. My name is Al Gore ism or Andy and I'm from Caltech, and today I'm going to tell you about the work that we have been doing on networks off optical parametric oscillators and how we have been using them for icing machines and how we're pushing them toward Cornum photonics to acknowledge my team at Caltech, which is now eight graduate students and five researcher and postdocs as well as collaborators from all over the world, including entity research and also the funding from different places, including entity. So this talk is primarily about networks of resonate er's, and these networks are everywhere from nature. For instance, the brain, which is a network of oscillators all the way to optics and photonics and some of the biggest examples or metal materials, which is an array of small resonate er's. And we're recently the field of technological photonics, which is trying thio implement a lot of the technological behaviors of models in the condensed matter, physics in photonics and if you want to extend it even further, some of the implementations off quantum computing are technically networks of quantum oscillators. So we started thinking about these things in the context of icing machines, which is based on the icing problem, which is based on the icing model, which is the simple summation over the spins and spins can be their upward down and the couplings is given by the JJ. And the icing problem is, if you know J I J. What is the spin configuration that gives you the ground state? And this problem is shown to be an MP high problem. So it's computational e important because it's a representative of the MP problems on NPR. Problems are important because first, their heart and standard computers if you use a brute force algorithm and they're everywhere on the application side. That's why there is this demand for making a machine that can target these problems, and hopefully it can provide some meaningful computational benefit compared to the standard digital computers. So I've been building these icing machines based on this building block, which is a degenerate optical parametric. Oscillator on what it is is resonator with non linearity in it, and we pump these resonate er's and we generate the signal at half the frequency of the pump. One vote on a pump splits into two identical photons of signal, and they have some very interesting phase of frequency locking behaviors. And if you look at the phase locking behavior, you realize that you can actually have two possible phase states as the escalation result of these Opio which are off by pie, and that's one of the important characteristics of them. So I want to emphasize a little more on that and I have this mechanical analogy which are basically two simple pendulum. But there are parametric oscillators because I'm going to modulate the parameter of them in this video, which is the length of the string on by that modulation, which is that will make a pump. I'm gonna make a muscular. That'll make a signal which is half the frequency of the pump. And I have two of them to show you that they can acquire these face states so they're still facing frequency lock to the pump. But it can also lead in either the zero pie face states on. The idea is to use this binary phase to represent the binary icing spin. So each opio is going to represent spin, which can be either is your pie or up or down. And to implement the network of these resonate er's, we use the time off blood scheme, and the idea is that we put impulses in the cavity. These pulses air separated by the repetition period that you put in or t r. And you can think about these pulses in one resonator, xaz and temporarily separated synthetic resonate Er's if you want a couple of these resonator is to each other, and now you can introduce these delays, each of which is a multiple of TR. If you look at the shortest delay it couples resonator wanted to 2 to 3 and so on. If you look at the second delay, which is two times a rotation period, the couple's 123 and so on. And if you have and minus one delay lines, then you can have any potential couplings among these synthetic resonate er's. And if I can introduce these modulators in those delay lines so that I can strength, I can control the strength and the phase of these couplings at the right time. Then I can have a program will all toe all connected network in this time off like scheme, and the whole physical size of the system scales linearly with the number of pulses. So the idea of opium based icing machine is didn't having these o pos, each of them can be either zero pie and I can arbitrarily connect them to each other. And then I start with programming this machine to a given icing problem by just setting the couplings and setting the controllers in each of those delight lines. So now I have a network which represents an icing problem. Then the icing problem maps to finding the face state that satisfy maximum number of coupling constraints. And the way it happens is that the icing Hamiltonian maps to the linear loss of the network. And if I start adding gain by just putting pump into the network, then the OPI ohs are expected to oscillate in the lowest, lowest lost state. And, uh and we have been doing these in the past, uh, six or seven years and I'm just going to quickly show you the transition, especially what happened in the first implementation, which was using a free space optical system and then the guided wave implementation in 2016 and the measurement feedback idea which led to increasing the size and doing actual computation with these machines. So I just want to make this distinction here that, um, the first implementation was an all optical interaction. We also had an unequal 16 implementation. And then we transition to this measurement feedback idea, which I'll tell you quickly what it iss on. There's still a lot of ongoing work, especially on the entity side, to make larger machines using the measurement feedback. But I'm gonna mostly focused on the all optical networks and how we're using all optical networks to go beyond simulation of icing Hamiltonian both in the linear and non linear side and also how we're working on miniaturization of these Opio networks. So the first experiment, which was the four opium machine, it was a free space implementation and this is the actual picture off the machine and we implemented a small and it calls for Mexico problem on the machine. So one problem for one experiment and we ran the machine 1000 times, we looked at the state and we always saw it oscillate in one of these, um, ground states of the icing laboratoria. So then the measurement feedback idea was to replace those couplings and the controller with the simulator. So we basically simulated all those coherent interactions on on FB g. A. And we replicated the coherent pulse with respect to all those measurements. And then we injected it back into the cavity and on the near to you still remain. So it still is a non. They're dynamical system, but the linear side is all simulated. So there are lots of questions about if this system is preserving important information or not, or if it's gonna behave better. Computational wars. And that's still ah, lot of ongoing studies. But nevertheless, the reason that this implementation was very interesting is that you don't need the end minus one delight lines so you can just use one. Then you can implement a large machine, and then you can run several thousands of problems in the machine, and then you can compare the performance from the computational perspective Looks so I'm gonna split this idea of opium based icing machine into two parts. One is the linear part, which is if you take out the non linearity out of the resonator and just think about the connections. You can think about this as a simple matrix multiplication scheme. And that's basically what gives you the icing Hambletonian modeling. So the optical laws of this network corresponds to the icing Hamiltonian. And if I just want to show you the example of the n equals for experiment on all those face states and the history Graham that we saw, you can actually calculate the laws of each of those states because all those interferences in the beam splitters and the delay lines are going to give you a different losses. And then you will see that the ground states corresponds to the lowest laws of the actual optical network. If you add the non linearity, the simple way of thinking about what the non linearity does is that it provides to gain, and then you start bringing up the gain so that it hits the loss. Then you go through the game saturation or the threshold which is going to give you this phase bifurcation. So you go either to zero the pie face state. And the expectation is that Theis, the network oscillates in the lowest possible state, the lowest possible loss state. There are some challenges associated with this intensity Durban face transition, which I'm going to briefly talk about. I'm also going to tell you about other types of non aerodynamics that we're looking at on the non air side of these networks. So if you just think about the linear network, we're actually interested in looking at some technological behaviors in these networks. And the difference between looking at the technological behaviors and the icing uh, machine is that now, First of all, we're looking at the type of Hamilton Ian's that are a little different than the icing Hamilton. And one of the biggest difference is is that most of these technological Hamilton Ian's that require breaking the time reversal symmetry, meaning that you go from one spin to in the one side to another side and you get one phase. And if you go back where you get a different phase, and the other thing is that we're not just interested in finding the ground state, we're actually now interesting and looking at all sorts of states and looking at the dynamics and the behaviors of all these states in the network. So we started with the simplest implementation, of course, which is a one d chain of thes resonate, er's, which corresponds to a so called ssh model. In the technological work, we get the similar energy to los mapping and now we can actually look at the band structure on. This is an actual measurement that we get with this associate model and you see how it reasonably how How? Well, it actually follows the prediction and the theory. One of the interesting things about the time multiplexing implementation is that now you have the flexibility of changing the network as you are running the machine. And that's something unique about this time multiplex implementation so that we can actually look at the dynamics. And one example that we have looked at is we can actually go through the transition off going from top A logical to the to the standard nontrivial. I'm sorry to the trivial behavior of the network. You can then look at the edge states and you can also see the trivial and states and the technological at states actually showing up in this network. We have just recently implement on a two D, uh, network with Harper Hofstadter model and when you don't have the results here. But we're one of the other important characteristic of time multiplexing is that you can go to higher and higher dimensions and keeping that flexibility and dynamics, and we can also think about adding non linearity both in a classical and quantum regimes, which is going to give us a lot of exotic, no classical and quantum, non innate behaviors in these networks. Yeah, So I told you about the linear side. Mostly let me just switch gears and talk about the nonlinear side of the network. And the biggest thing that I talked about so far in the icing machine is this face transition that threshold. So the low threshold we have squeezed state in these. Oh, pios, if you increase the pump, we go through this intensity driven phase transition and then we got the face stays above threshold. And this is basically the mechanism off the computation in these O pos, which is through this phase transition below to above threshold. So one of the characteristics of this phase transition is that below threshold, you expect to see quantum states above threshold. You expect to see more classical states or coherent states, and that's basically corresponding to the intensity off the driving pump. So it's really hard to imagine that it can go above threshold. Or you can have this friends transition happen in the all in the quantum regime. And there are also some challenges associated with the intensity homogeneity off the network, which, for example, is if one opioid starts oscillating and then its intensity goes really high. Then it's going to ruin this collective decision making off the network because of the intensity driven face transition nature. So So the question is, can we look at other phase transitions? Can we utilize them for both computing? And also can we bring them to the quantum regime on? I'm going to specifically talk about the face transition in the spectral domain, which is the transition from the so called degenerate regime, which is what I mostly talked about to the non degenerate regime, which happens by just tuning the phase of the cavity. And what is interesting is that this phase transition corresponds to a distinct phase noise behavior. So in the degenerate regime, which we call it the order state, you're gonna have the phase being locked to the phase of the pump. As I talked about non degenerate regime. However, the phase is the phase is mostly dominated by the quantum diffusion. Off the off the phase, which is limited by the so called shallow towns limit, and you can see that transition from the general to non degenerate, which also has distinct symmetry differences. And this transition corresponds to a symmetry breaking in the non degenerate case. The signal can acquire any of those phases on the circle, so it has a you one symmetry. Okay, and if you go to the degenerate case, then that symmetry is broken and you only have zero pie face days I will look at. So now the question is can utilize this phase transition, which is a face driven phase transition, and can we use it for similar computational scheme? So that's one of the questions that were also thinking about. And it's not just this face transition is not just important for computing. It's also interesting from the sensing potentials and this face transition, you can easily bring it below threshold and just operated in the quantum regime. Either Gaussian or non Gaussian. If you make a network of Opio is now, we can see all sorts off more complicated and more interesting phase transitions in the spectral domain. One of them is the first order phase transition, which you get by just coupling to Opio, and that's a very abrupt face transition and compared to the to the single Opio phase transition. And if you do the couplings right, you can actually get a lot of non her mission dynamics and exceptional points, which are actually very interesting to explore both in the classical and quantum regime. And I should also mention that you can think about the cup links to be also nonlinear couplings. And that's another behavior that you can see, especially in the nonlinear in the non degenerate regime. So with that, I basically told you about these Opio networks, how we can think about the linear scheme and the linear behaviors and how we can think about the rich, nonlinear dynamics and non linear behaviors both in the classical and quantum regime. I want to switch gear and tell you a little bit about the miniaturization of these Opio networks. And of course, the motivation is if you look at the electron ICS and what we had 60 or 70 years ago with vacuum tube and how we transition from relatively small scale computers in the order of thousands of nonlinear elements to billions of non elements where we are now with the optics is probably very similar to 70 years ago, which is a table talk implementation. And the question is, how can we utilize nano photonics? I'm gonna just briefly show you the two directions on that which we're working on. One is based on lithium Diabate, and the other is based on even a smaller resonate er's could you? So the work on Nana Photonic lithium naive. It was started in collaboration with Harvard Marko Loncar, and also might affair at Stanford. And, uh, we could show that you can do the periodic polling in the phenomenon of it and get all sorts of very highly nonlinear processes happening in this net. Photonic periodically polls if, um Diabate. And now we're working on building. Opio was based on that kind of photonic the film Diabate. And these air some some examples of the devices that we have been building in the past few months, which I'm not gonna tell you more about. But the O. P. O. S. And the Opio Networks are in the works. And that's not the only way of making large networks. Um, but also I want to point out that The reason that these Nana photonic goblins are actually exciting is not just because you can make a large networks and it can make him compact in a in a small footprint. They also provide some opportunities in terms of the operation regime. On one of them is about making cat states and Opio, which is, can we have the quantum superposition of the zero pie states that I talked about and the Net a photonic within? I've It provides some opportunities to actually get closer to that regime because of the spatial temporal confinement that you can get in these wave guides. So we're doing some theory on that. We're confident that the type of non linearity two losses that it can get with these platforms are actually much higher than what you can get with other platform their existing platforms and to go even smaller. We have been asking the question off. What is the smallest possible Opio that you can make? Then you can think about really wavelength scale type, resonate er's and adding the chi to non linearity and see how and when you can get the Opio to operate. And recently, in collaboration with us see, we have been actually USC and Creole. We have demonstrated that you can use nano lasers and get some spin Hamilton and implementations on those networks. So if you can build the a P. O s, we know that there is a path for implementing Opio Networks on on such a nano scale. So we have looked at these calculations and we try to estimate the threshold of a pos. Let's say for me resonator and it turns out that it can actually be even lower than the type of bulk Pip Llano Pos that we have been building in the past 50 years or so. So we're working on the experiments and we're hoping that we can actually make even larger and larger scale Opio networks. So let me summarize the talk I told you about the opium networks and our work that has been going on on icing machines and the measurement feedback. And I told you about the ongoing work on the all optical implementations both on the linear side and also on the nonlinear behaviors. And I also told you a little bit about the efforts on miniaturization and going to the to the Nano scale. So with that, I would like Thio >>three from the University of Tokyo. Before I thought that would like to thank you showing all the stuff of entity for the invitation and the organization of this online meeting and also would like to say that it has been very exciting to see the growth of this new film lab. And I'm happy to share with you today of some of the recent works that have been done either by me or by character of Hong Kong. Honest Group indicates the title of my talk is a neuro more fic in silica simulator for the communities in machine. And here is the outline I would like to make the case that the simulation in digital Tektronix of the CME can be useful for the better understanding or improving its function principles by new job introducing some ideas from neural networks. This is what I will discuss in the first part and then it will show some proof of concept of the game and performance that can be obtained using dissimulation in the second part and the protection of the performance that can be achieved using a very large chaos simulator in the third part and finally talk about future plans. So first, let me start by comparing recently proposed izing machines using this table there is elected from recent natural tronics paper from the village Park hard people, and this comparison shows that there's always a trade off between energy efficiency, speed and scalability that depends on the physical implementation. So in red, here are the limitation of each of the servers hardware on, interestingly, the F p G, a based systems such as a producer, digital, another uh Toshiba beautification machine or a recently proposed restricted Bozeman machine, FPD A by a group in Berkeley. They offer a good compromise between speed and scalability. And this is why, despite the unique advantage that some of these older hardware have trust as the currency proposition in Fox, CBS or the energy efficiency off memory Sisters uh P. J. O are still an attractive platform for building large organizing machines in the near future. The reason for the good performance of Refugee A is not so much that they operate at the high frequency. No, there are particular in use, efficient, but rather that the physical wiring off its elements can be reconfigured in a way that limits the funding human bottleneck, larger, funny and phenols and the long propagation video information within the system. In this respect, the LPGA is They are interesting from the perspective off the physics off complex systems, but then the physics of the actions on the photos. So to put the performance of these various hardware and perspective, we can look at the competition of bringing the brain the brain complete, using billions of neurons using only 20 watts of power and operates. It's a very theoretically slow, if we can see and so this impressive characteristic, they motivate us to try to investigate. What kind of new inspired principles be useful for designing better izing machines? The idea of this research project in the future collaboration it's to temporary alleviates the limitations that are intrinsic to the realization of an optical cortex in machine shown in the top panel here. By designing a large care simulator in silicone in the bottom here that can be used for digesting the better organization principles of the CIA and this talk, I will talk about three neuro inspired principles that are the symmetry of connections, neural dynamics orphan chaotic because of symmetry, is interconnectivity the infrastructure? No. Next talks are not composed of the reputation of always the same types of non environments of the neurons, but there is a local structure that is repeated. So here's the schematic of the micro column in the cortex. And lastly, the Iraqi co organization of connectivity connectivity is organizing a tree structure in the brain. So here you see a representation of the Iraqi and organization of the monkey cerebral cortex. So how can these principles we used to improve the performance of the icing machines? And it's in sequence stimulation. So, first about the two of principles of the estimate Trian Rico structure. We know that the classical approximation of the car testing machine, which is the ground toe, the rate based on your networks. So in the case of the icing machines, uh, the okay, Scott approximation can be obtained using the trump active in your position, for example, so the times of both of the system they are, they can be described by the following ordinary differential equations on in which, in case of see, I am the X, I represent the in phase component of one GOP Oh, Theo f represents the monitor optical parts, the district optical Parametric amplification and some of the good I JoJo extra represent the coupling, which is done in the case of the measure of feedback coupling cm using oh, more than detection and refugee A and then injection off the cooking time and eso this dynamics in both cases of CNN in your networks, they can be written as the grand set of a potential function V, and this written here, and this potential functionally includes the rising Maccagnan. So this is why it's natural to use this type of, uh, dynamics to solve the icing problem in which the Omega I J or the eyes in coping and the H is the extension of the icing and attorney in India and expect so. Not that this potential function can only be defined if the Omega I j. R. A. Symmetric. So the well known problem of this approach is that this potential function V that we obtain is very non convicts at low temperature, and also one strategy is to gradually deformed this landscape, using so many in process. But there is no theorem. Unfortunately, that granted conventions to the global minimum of There's even Tony and using this approach. And so this is why we propose, uh, to introduce a macro structures of the system where one analog spin or one D O. P. O is replaced by a pair off one another spin and one error, according viable. And the addition of this chemical structure introduces a symmetry in the system, which in terms induces chaotic dynamics, a chaotic search rather than a learning process for searching for the ground state of the icing. Every 20 within this massacre structure the role of the er variable eyes to control the amplitude off the analog spins toe force. The amplitude of the expense toe become equal to certain target amplitude a uh and, uh, and this is done by modulating the strength off the icing complaints or see the the error variable E I multiply the icing complaint here in the dynamics off air d o p. O. On then the dynamics. The whole dynamics described by this coupled equations because the e I do not necessarily take away the same value for the different. I thesis introduces a symmetry in the system, which in turn creates security dynamics, which I'm sure here for solving certain current size off, um, escape problem, Uh, in which the X I are shown here and the i r from here and the value of the icing energy showing the bottom plots. You see this Celtics search that visit various local minima of the as Newtonian and eventually finds the global minimum? Um, it can be shown that this modulation off the target opportunity can be used to destabilize all the local minima off the icing evertonians so that we're gonna do not get stuck in any of them. On more over the other types of attractors I can eventually appear, such as limits I contractors, Okot contractors. They can also be destabilized using the motivation of the target and Batuta. And so we have proposed in the past two different moderation of the target amateur. The first one is a modulation that ensure the uh 100 reproduction rate of the system to become positive on this forbids the creation off any nontrivial tractors. And but in this work, I will talk about another moderation or arrested moderation which is given here. That works, uh, as well as this first uh, moderation, but is easy to be implemented on refugee. So this couple of the question that represent becoming the stimulation of the cortex in machine with some error correction they can be implemented especially efficiently on an F B. G. And here I show the time that it takes to simulate three system and also in red. You see, at the time that it takes to simulate the X I term the EI term, the dot product and the rising Hamiltonian for a system with 500 spins and Iraq Spain's equivalent to 500 g. O. P. S. So >>in >>f b d a. The nonlinear dynamics which, according to the digital optical Parametric amplification that the Opa off the CME can be computed in only 13 clock cycles at 300 yards. So which corresponds to about 0.1 microseconds. And this is Toby, uh, compared to what can be achieved in the measurements back O C. M. In which, if we want to get 500 timer chip Xia Pios with the one she got repetition rate through the obstacle nine narrative. Uh, then way would require 0.5 microseconds toe do this so the submission in F B J can be at least as fast as ah one g repression. Uh, replicate pulsed laser CIA Um, then the DOT product that appears in this differential equation can be completed in 43 clock cycles. That's to say, one microseconds at 15 years. So I pieced for pouring sizes that are larger than 500 speeds. The dot product becomes clearly the bottleneck, and this can be seen by looking at the the skating off the time the numbers of clock cycles a text to compute either the non in your optical parts or the dog products, respect to the problem size. And And if we had infinite amount of resources and PGA to simulate the dynamics, then the non illogical post can could be done in the old one. On the mattress Vector product could be done in the low carrot off, located off scales as a look at it off and and while the guide off end. Because computing the dot product involves assuming all the terms in the product, which is done by a nephew, GE by another tree, which heights scarce logarithmic any with the size of the system. But This is in the case if we had an infinite amount of resources on the LPGA food, but for dealing for larger problems off more than 100 spins. Usually we need to decompose the metrics into ah, smaller blocks with the block side that are not you here. And then the scaling becomes funny, non inner parts linear in the end, over you and for the products in the end of EU square eso typically for low NF pdf cheap PGA you the block size off this matrix is typically about 100. So clearly way want to make you as large as possible in order to maintain this scanning in a log event for the numbers of clock cycles needed to compute the product rather than this and square that occurs if we decompose the metrics into smaller blocks. But the difficulty in, uh, having this larger blocks eyes that having another tree very large Haider tree introduces a large finding and finance and long distance start a path within the refugee. So the solution to get higher performance for a simulator of the contest in machine eyes to get rid of this bottleneck for the dot product by increasing the size of this at the tree. And this can be done by organizing your critique the electrical components within the LPGA in order which is shown here in this, uh, right panel here in order to minimize the finding finance of the system and to minimize the long distance that a path in the in the fpt So I'm not going to the details of how this is implemented LPGA. But just to give you a idea off why the Iraqi Yahiko organization off the system becomes the extremely important toe get good performance for similar organizing machine. So instead of instead of getting into the details of the mpg implementation, I would like to give some few benchmark results off this simulator, uh, off the that that was used as a proof of concept for this idea which is can be found in this archive paper here and here. I should results for solving escape problems. Free connected person, randomly person minus one spring last problems and we sure, as we use as a metric the numbers of the mattress Victor products since it's the bottleneck of the computation, uh, to get the optimal solution of this escape problem with the Nina successful BT against the problem size here and and in red here, this propose FDJ implementation and in ah blue is the numbers of retrospective product that are necessary for the C. I am without error correction to solve this escape programs and in green here for noisy means in an evening which is, uh, behavior with similar to the Cartesian mission. Uh, and so clearly you see that the scaring off the numbers of matrix vector product necessary to solve this problem scales with a better exponents than this other approaches. So So So that's interesting feature of the system and next we can see what is the real time to solution to solve this SK instances eso in the last six years, the time institution in seconds to find a grand state of risk. Instances remain answers probability for different state of the art hardware. So in red is the F B g. A presentation proposing this paper and then the other curve represent Ah, brick a local search in in orange and silver lining in purple, for example. And so you see that the scaring off this purpose simulator is is rather good, and that for larger plant sizes we can get orders of magnitude faster than the state of the art approaches. Moreover, the relatively good scanning off the time to search in respect to problem size uh, they indicate that the FPD implementation would be faster than risk. Other recently proposed izing machine, such as the hope you know, natural complimented on memories distance that is very fast for small problem size in blue here, which is very fast for small problem size. But which scanning is not good on the same thing for the restricted Bosman machine. Implementing a PGA proposed by some group in Broken Recently Again, which is very fast for small parliament sizes but which canning is bad so that a dis worse than the proposed approach so that we can expect that for programs size is larger than 1000 spins. The proposed, of course, would be the faster one. Let me jump toe this other slide and another confirmation that the scheme scales well that you can find the maximum cut values off benchmark sets. The G sets better candidates that have been previously found by any other algorithms, so they are the best known could values to best of our knowledge. And, um or so which is shown in this paper table here in particular, the instances, uh, 14 and 15 of this G set can be We can find better converse than previously known, and we can find this can vary is 100 times faster than the state of the art algorithm and CP to do this which is a very common Kasich. It s not that getting this a good result on the G sets, they do not require ah, particular hard tuning of the parameters. So the tuning issuing here is very simple. It it just depends on the degree off connectivity within each graph. And so this good results on the set indicate that the proposed approach would be a good not only at solving escape problems in this problems, but all the types off graph sizing problems on Mexican province in communities. So given that the performance off the design depends on the height of this other tree, we can try to maximize the height of this other tree on a large F p g a onda and carefully routing the components within the P G A and and we can draw some projections of what type of performance we can achieve in the near future based on the, uh, implementation that we are currently working. So here you see projection for the time to solution way, then next property for solving this escape programs respect to the prime assize. And here, compared to different with such publicizing machines, particularly the digital. And, you know, 42 is shown in the green here, the green line without that's and, uh and we should two different, uh, hypothesis for this productions either that the time to solution scales as exponential off n or that the time of social skills as expression of square root off. So it seems, according to the data, that time solution scares more as an expression of square root of and also we can be sure on this and this production show that we probably can solve prime escape problem of science 2000 spins, uh, to find the rial ground state of this problem with 99 success ability in about 10 seconds, which is much faster than all the other proposed approaches. So one of the future plans for this current is in machine simulator. So the first thing is that we would like to make dissimulation closer to the rial, uh, GOP oh, optical system in particular for a first step to get closer to the system of a measurement back. See, I am. And to do this what is, uh, simulate Herbal on the p a is this quantum, uh, condoms Goshen model that is proposed described in this paper and proposed by people in the in the Entity group. And so the idea of this model is that instead of having the very simple or these and have shown previously, it includes paired all these that take into account on me the mean off the awesome leverage off the, uh, European face component, but also their violence s so that we can take into account more quantum effects off the g o p. O, such as the squeezing. And then we plan toe, make the simulator open access for the members to run their instances on the system. There will be a first version in September that will be just based on the simple common line access for the simulator and in which will have just a classic or approximation of the system. We don't know Sturm, binary weights and museum in term, but then will propose a second version that would extend the current arising machine to Iraq off F p g. A, in which we will add the more refined models truncated, ignoring the bottom Goshen model they just talked about on the support in which he valued waits for the rising problems and support the cement. So we will announce later when this is available and and far right is working >>hard comes from Universal down today in physics department, and I'd like to thank the organizers for their kind invitation to participate in this very interesting and promising workshop. Also like to say that I look forward to collaborations with with a file lab and Yoshi and collaborators on the topics of this world. So today I'll briefly talk about our attempt to understand the fundamental limits off another continues time computing, at least from the point off you off bullion satisfy ability, problem solving, using ordinary differential equations. But I think the issues that we raise, um, during this occasion actually apply to other other approaches on a log approaches as well and into other problems as well. I think everyone here knows what Dorien satisfy ability. Problems are, um, you have boolean variables. You have em clauses. Each of disjunction of collaterals literally is a variable, or it's, uh, negation. And the goal is to find an assignment to the variable, such that order clauses are true. This is a decision type problem from the MP class, which means you can checking polynomial time for satisfy ability off any assignment. And the three set is empty, complete with K three a larger, which means an efficient trees. That's over, uh, implies an efficient source for all the problems in the empty class, because all the problems in the empty class can be reduced in Polian on real time to reset. As a matter of fact, you can reduce the NP complete problems into each other. You can go from three set to set backing or two maximum dependent set, which is a set packing in graph theoretic notions or terms toe the icing graphs. A problem decision version. This is useful, and you're comparing different approaches, working on different kinds of problems when not all the closest can be satisfied. You're looking at the accusation version offset, uh called Max Set. And the goal here is to find assignment that satisfies the maximum number of clauses. And this is from the NPR class. In terms of applications. If we had inefficient sets over or np complete problems over, it was literally, positively influenced. Thousands off problems and applications in industry and and science. I'm not going to read this, but this this, of course, gives a strong motivation toe work on this kind of problems. Now our approach to set solving involves embedding the problem in a continuous space, and you use all the east to do that. So instead of working zeros and ones, we work with minus one across once, and we allow the corresponding variables toe change continuously between the two bounds. We formulate the problem with the help of a close metrics. If if a if a close, uh, does not contain a variable or its negation. The corresponding matrix element is zero. If it contains the variable in positive, for which one contains the variable in a gated for Mitt's negative one, and then we use this to formulate this products caused quote, close violation functions one for every clause, Uh, which really, continuously between zero and one. And they're zero if and only if the clause itself is true. Uh, then we form the define in order to define a dynamic such dynamics in this and dimensional hyper cube where the search happens and if they exist, solutions. They're sitting in some of the corners of this hyper cube. So we define this, uh, energy potential or landscape function shown here in a way that this is zero if and only if all the clauses all the kmc zero or the clauses off satisfied keeping these auxiliary variables a EMS always positive. And therefore, what you do here is a dynamics that is a essentially ingredient descend on this potential energy landscape. If you were to keep all the M's constant that it would get stuck in some local minimum. However, what we do here is we couple it with the dynamics we cooperated the clothes violation functions as shown here. And if he didn't have this am here just just the chaos. For example, you have essentially what case you have positive feedback. You have increasing variable. Uh, but in that case, you still get stuck would still behave will still find. So she is better than the constant version but still would get stuck only when you put here this a m which makes the dynamics in in this variable exponential like uh, only then it keeps searching until he finds a solution on deer is a reason for that. I'm not going toe talk about here, but essentially boils down toe performing a Grady and descend on a globally time barren landscape. And this is what works. Now I'm gonna talk about good or bad and maybe the ugly. Uh, this is, uh, this is What's good is that it's a hyperbolic dynamical system, which means that if you take any domain in the search space that doesn't have a solution in it or any socially than the number of trajectories in it decays exponentially quickly. And the decay rate is a characteristic in variant characteristic off the dynamics itself. Dynamical systems called the escape right the inverse off that is the time scale in which you find solutions by this by this dynamical system, and you can see here some song trajectories that are Kelty because it's it's no linear, but it's transient, chaotic. Give their sources, of course, because eventually knowledge to the solution. Now, in terms of performance here, what you show for a bunch off, um, constraint densities defined by M overran the ratio between closes toe variables for random, said Problems is random. Chris had problems, and they as its function off n And we look at money toward the wartime, the wall clock time and it behaves quite value behaves Azat party nominally until you actually he to reach the set on set transition where the hardest problems are found. But what's more interesting is if you monitor the continuous time t the performance in terms off the A narrow, continuous Time t because that seems to be a polynomial. And the way we show that is, we consider, uh, random case that random three set for a fixed constraint density Onda. We hear what you show here. Is that the right of the trash hold that it's really hard and, uh, the money through the fraction of problems that we have not been able to solve it. We select thousands of problems at that constraint ratio and resolve them without algorithm, and we monitor the fractional problems that have not yet been solved by continuous 90. And this, as you see these decays exponentially different. Educate rates for different system sizes, and in this spot shows that is dedicated behaves polynomial, or actually as a power law. So if you combine these two, you find that the time needed to solve all problems except maybe appear traction off them scales foreign or merely with the problem size. So you have paranormal, continuous time complexity. And this is also true for other types of very hard constraints and sexual problems such as exact cover, because you can always transform them into three set as we discussed before, Ramsey coloring and and on these problems, even algorithms like survey propagation will will fail. But this doesn't mean that P equals NP because what you have first of all, if you were toe implement these equations in a device whose behavior is described by these, uh, the keys. Then, of course, T the continue style variable becomes a physical work off. Time on that will be polynomial is scaling, but you have another other variables. Oxidative variables, which structured in an exponential manner. So if they represent currents or voltages in your realization and it would be an exponential cost Al Qaeda. But this is some kind of trade between time and energy, while I know how toe generate energy or I don't know how to generate time. But I know how to generate energy so it could use for it. But there's other issues as well, especially if you're trying toe do this son and digital machine but also happens. Problems happen appear. Other problems appear on in physical devices as well as we discuss later. So if you implement this in GPU, you can. Then you can get in order off to magnitude. Speed up. And you can also modify this to solve Max sad problems. Uh, quite efficiently. You are competitive with the best heuristic solvers. This is a weather problems. In 2016 Max set competition eso so this this is this is definitely this seems like a good approach, but there's off course interesting limitations, I would say interesting, because it kind of makes you think about what it means and how you can exploit this thes observations in understanding better on a low continues time complexity. If you monitored the discrete number the number of discrete steps. Don't buy the room, Dakota integrator. When you solve this on a digital machine, you're using some kind of integrator. Um and you're using the same approach. But now you measure the number off problems you haven't sold by given number of this kid, uh, steps taken by the integrator. You find out you have exponential, discrete time, complexity and, of course, thistles. A problem. And if you look closely, what happens even though the analog mathematical trajectory, that's the record here. If you monitor what happens in discrete time, uh, the integrator frustrates very little. So this is like, you know, third or for the disposition, but fluctuates like crazy. So it really is like the intervention frees us out. And this is because of the phenomenon of stiffness that are I'll talk a little bit a more about little bit layer eso. >>You know, it might look >>like an integration issue on digital machines that you could improve and could definitely improve. But actually issues bigger than that. It's It's deeper than that, because on a digital machine there is no time energy conversion. So the outside variables are efficiently representing a digital machine. So there's no exponential fluctuating current of wattage in your computer when you do this. Eso If it is not equal NP then the exponential time, complexity or exponential costs complexity has to hit you somewhere. And this is how um, but, you know, one would be tempted to think maybe this wouldn't be an issue in a analog device, and to some extent is true on our devices can be ordered to maintain faster, but they also suffer from their own problems because he not gonna be affect. That classes soldiers as well. So, indeed, if you look at other systems like Mirandizing machine measurement feedback, probably talk on the grass or selected networks. They're all hinge on some kind off our ability to control your variables in arbitrary, high precision and a certain networks you want toe read out across frequencies in case off CM's. You required identical and program because which is hard to keep, and they kind of fluctuate away from one another, shift away from one another. And if you control that, of course that you can control the performance. So actually one can ask if whether or not this is a universal bottleneck and it seems so aside, I will argue next. Um, we can recall a fundamental result by by showing harder in reaction Target from 1978. Who says that it's a purely computer science proof that if you are able toe, compute the addition multiplication division off riel variables with infinite precision, then you could solve any complete problems in polynomial time. It doesn't actually proposals all where he just chose mathematically that this would be the case. Now, of course, in Real warned, you have also precision. So the next question is, how does that affect the competition about problems? This is what you're after. Lots of precision means information also, or entropy production. Eso what you're really looking at the relationship between hardness and cost of computing off a problem. Uh, and according to Sean Hagar, there's this left branch which in principle could be polynomial time. But the question whether or not this is achievable that is not achievable, but something more cheerful. That's on the right hand side. There's always going to be some information loss, so mental degeneration that could keep you away from possibly from point normal time. So this is what we like to understand, and this information laws the source off. This is not just always I will argue, uh, in any physical system, but it's also off algorithm nature, so that is a questionable area or approach. But China gets results. Security theoretical. No, actual solar is proposed. So we can ask, you know, just theoretically get out off. Curiosity would in principle be such soldiers because it is not proposing a soldier with such properties. In principle, if if you want to look mathematically precisely what the solar does would have the right properties on, I argue. Yes, I don't have a mathematical proof, but I have some arguments that that would be the case. And this is the case for actually our city there solver that if you could calculate its trajectory in a loss this way, then it would be, uh, would solve epic complete problems in polynomial continuous time. Now, as a matter of fact, this a bit more difficult question, because time in all these can be re scared however you want. So what? Burns says that you actually have to measure the length of the trajectory, which is a new variant off the dynamical system or property dynamical system, not off its parameters ization. And we did that. So Suba Corral, my student did that first, improving on the stiffness off the problem off the integrations, using implicit solvers and some smart tricks such that you actually are closer to the actual trajectory and using the same approach. You know what fraction off problems you can solve? We did not give the length of the trajectory. You find that it is putting on nearly scaling the problem sites we have putting on your skin complexity. That means that our solar is both Polly length and, as it is, defined it also poorly time analog solver. But if you look at as a discreet algorithm, if you measure the discrete steps on a digital machine, it is an exponential solver. And the reason is because off all these stiffness, every integrator has tow truck it digitizing truncate the equations, and what it has to do is to keep the integration between the so called stability region for for that scheme, and you have to keep this product within a grimace of Jacoby in and the step size read in this region. If you use explicit methods. You want to stay within this region? Uh, but what happens that some off the Eigen values grow fast for Steve problems, and then you're you're forced to reduce that t so the product stays in this bonded domain, which means that now you have to you're forced to take smaller and smaller times, So you're you're freezing out the integration and what I will show you. That's the case. Now you can move to increase its soldiers, which is which is a tree. In this case, you have to make domain is actually on the outside. But what happens in this case is some of the Eigen values of the Jacobean, also, for six systems, start to move to zero. As they're moving to zero, they're going to enter this instability region, so your soul is going to try to keep it out, so it's going to increase the data T. But if you increase that to increase the truncation hours, so you get randomized, uh, in the large search space, so it's it's really not, uh, not going to work out. Now, one can sort off introduce a theory or language to discuss computational and are computational complexity, using the language from dynamical systems theory. But basically I I don't have time to go into this, but you have for heart problems. Security object the chaotic satellite Ouch! In the middle of the search space somewhere, and that dictates how the dynamics happens and variant properties off the dynamics. Of course, off that saddle is what the targets performance and many things, so a new, important measure that we find that it's also helpful in describing thesis. Another complexity is the so called called Makarov, or metric entropy and basically what this does in an intuitive A eyes, uh, to describe the rate at which the uncertainty containing the insignificant digits off a trajectory in the back, the flow towards the significant ones as you lose information because off arrows being, uh grown or are developed in tow. Larger errors in an exponential at an exponential rate because you have positively up north spawning. But this is an in variant property. It's the property of the set of all. This is not how you compute them, and it's really the interesting create off accuracy philosopher dynamical system. A zay said that you have in such a high dimensional that I'm consistent were positive and negatively upon of exponents. Aziz Many The total is the dimension of space and user dimension, the number off unstable manifold dimensions and as Saddam was stable, manifold direction. And there's an interesting and I think, important passion, equality, equality called the passion, equality that connect the information theoretic aspect the rate off information loss with the geometric rate of which trajectory separate minus kappa, which is the escape rate that I already talked about. Now one can actually prove a simple theorems like back off the envelope calculation. The idea here is that you know the rate at which the largest rated, which closely started trajectory separate from one another. So now you can say that, uh, that is fine, as long as my trajectory finds the solution before the projective separate too quickly. In that case, I can have the hope that if I start from some region off the face base, several close early started trajectories, they kind of go into the same solution orphaned and and that's that's That's this upper bound of this limit, and it is really showing that it has to be. It's an exponentially small number. What? It depends on the end dependence off the exponents right here, which combines information loss rate and the social time performance. So these, if this exponents here or that has a large independence or river linear independence, then you then you really have to start, uh, trajectories exponentially closer to one another in orderto end up in the same order. So this is sort off like the direction that you're going in tow, and this formulation is applicable toe all dynamical systems, uh, deterministic dynamical systems. And I think we can We can expand this further because, uh, there is, ah, way off getting the expression for the escaped rate in terms off n the number of variables from cycle expansions that I don't have time to talk about. What? It's kind of like a program that you can try toe pursuit, and this is it. So the conclusions I think of self explanatory I think there is a lot of future in in, uh, in an allo. Continue start computing. Um, they can be efficient by orders of magnitude and digital ones in solving empty heart problems because, first of all, many of the systems you like the phone line and bottleneck. There's parallelism involved, and and you can also have a large spectrum or continues time, time dynamical algorithms than discrete ones. And you know. But we also have to be mindful off. What are the possibility of what are the limits? And 11 open question is very important. Open question is, you know, what are these limits? Is there some kind off no go theory? And that tells you that you can never perform better than this limit or that limit? And I think that's that's the exciting part toe to derive thes thes this levian 10.

Published Date : Sep 27 2020

SUMMARY :

bifurcated critical point that is the one that I forget to the lowest pump value a. the chi to non linearity and see how and when you can get the Opio know that the classical approximation of the car testing machine, which is the ground toe, than the state of the art algorithm and CP to do this which is a very common Kasich. right the inverse off that is the time scale in which you find solutions by first of all, many of the systems you like the phone line and bottleneck.

ENTITIES

Entity	Category	Confidence
Exxon Mobil	ORGANIZATION	0.99+
Andy	PERSON	0.99+
Sean Hagar	PERSON	0.99+
Daniel Wennberg	PERSON	0.99+
Chris	PERSON	0.99+
USC	ORGANIZATION	0.99+
Caltech	ORGANIZATION	0.99+
2016	DATE	0.99+
100 times	QUANTITY	0.99+
Berkeley	LOCATION	0.99+
Tatsuya Nagamoto	PERSON	0.99+
two	QUANTITY	0.99+
1978	DATE	0.99+
Fox	ORGANIZATION	0.99+
six systems	QUANTITY	0.99+
Harvard	ORGANIZATION	0.99+
Al Qaeda	ORGANIZATION	0.99+
September	DATE	0.99+
second version	QUANTITY	0.99+
CIA	ORGANIZATION	0.99+
India	LOCATION	0.99+
300 yards	QUANTITY	0.99+
University of Tokyo	ORGANIZATION	0.99+
today	DATE	0.99+
Burns	PERSON	0.99+
Atsushi Yamamura	PERSON	0.99+
0.14%	QUANTITY	0.99+
48 core	QUANTITY	0.99+
0.5 microseconds	QUANTITY	0.99+
NSF	ORGANIZATION	0.99+
15 years	QUANTITY	0.99+
CBS	ORGANIZATION	0.99+
NTT	ORGANIZATION	0.99+
first implementation	QUANTITY	0.99+
first experiment	QUANTITY	0.99+
123	QUANTITY	0.99+
Army Research Office	ORGANIZATION	0.99+
first	QUANTITY	0.99+
1,904,711	QUANTITY	0.99+
one	QUANTITY	0.99+
six	QUANTITY	0.99+
first version	QUANTITY	0.99+
Steve	PERSON	0.99+
2000 spins	QUANTITY	0.99+
five researcher	QUANTITY	0.99+
Creole	ORGANIZATION	0.99+
three set	QUANTITY	0.99+
second part	QUANTITY	0.99+
third part	QUANTITY	0.99+
Department of Applied Physics	ORGANIZATION	0.99+
10	QUANTITY	0.99+
each	QUANTITY	0.99+
85,900	QUANTITY	0.99+
One	QUANTITY	0.99+
one problem	QUANTITY	0.99+
136 CPU	QUANTITY	0.99+
Toshiba	ORGANIZATION	0.99+
Scott	PERSON	0.99+
2.4 gigahertz	QUANTITY	0.99+
1000 times	QUANTITY	0.99+
two times	QUANTITY	0.99+
two parts	QUANTITY	0.99+
131	QUANTITY	0.99+
14,233	QUANTITY	0.99+
more than 100 spins	QUANTITY	0.99+
two possible phases	QUANTITY	0.99+
13,580	QUANTITY	0.99+
5	QUANTITY	0.99+
4	QUANTITY	0.99+
one microseconds	QUANTITY	0.99+
first step	QUANTITY	0.99+
first part	QUANTITY	0.99+
500 spins	QUANTITY	0.99+
two identical photons	QUANTITY	0.99+
3	QUANTITY	0.99+
70 years ago	DATE	0.99+
Iraq	LOCATION	0.99+
one experiment	QUANTITY	0.99+
zero	QUANTITY	0.99+
Amir Safarini Nini	PERSON	0.99+
Saddam	PERSON	0.99+

4 Breaking Down Your Data Grant Gibson and Janet George

from the cube studios in Palo Alto in Boston it's the cube covering empowering the autonomous enterprise brought to you by Oracle consulting welcome back everybody to this special digital event coverage that the cube is looking into the rebirth of Oracle consulting Janet George is here she's group vp autonomous for advanced analytics with machine learning and artificial intelligence at oracle and she's joined by grant gibson is a group vp of growth and strategy at oracle folks welcome to the cube thanks so much for coming on thank you thank you great I want to start with you because you get strategy in your title like just start big picture what is the strategy with Oracle specifically as it relates to autonomous and also consulting sure so I think you know Oracle has a deep legacy of strengthened data and over the company's successful history it's evolved what that is from steps along the way if you look at the modern enterprise of Oracle client I think there's no denying that we've entered the age of AI that everyone knows that artificial intelligence and machine learning are a key to their success in the business marketplace going forward and while generally it's acknowledge that it's a transformative technology and people know that they need to take advantage of it it's the how that's really tricky and that most enterprises in order to really get an enterprise level ROI on an AI investment need to engage in projects of significant scope and going from realizing there's an opportunity to realize and there's a threat to mobilizing yourself to capitalize on it is a is a daunting task for an enemy certainly one that's you know anybody that's got any sort of legacy of success has built-in processes that's built in systems has built in skillsets and making that leap to be an autonomous enterprise is is challenging for companies to wrap their heads around so as part of the rebirth of Oracle consulting we've developed a practice around how to both manage the the technology needs for that transformation as well as the human needs as well as the data science needs to it so rather there's about five or six things that I want to followup with you there so there's gonna be good conversations Janet so ever since I've been in the industry we're talking about AI in sort of start stop start stop we had the AI winter and now it seems to be here it's almost feel like that the the technology never lived up to its promise you didn't have the horsepower a compute power you know enough data maybe so we're here today feels like we are entering a new era why is that and and how will the technology perform this time so for AI to perform it's very reliant on the data we entered the age of AI without having the right data for AI so you can imagine that we we just launched into AI without our data being ready to be training sex for AI so we started with bi data or we started the data that was already historically transformed formatted had logical structures physical structures this data was sort of trapped in many different tools and then suddenly AI comes along and we say take this data our historical data we haven't tested to see if this has labels in it this has learning capability in it we just thrust the data to AI and that's why we saw the initial wave of AI sort of failing because it was not ready to fall AI ready for the generation of AI and part of I think the leap that clients are finding success with now is getting the Apple data types and you're moving from the zeros and ones of structured data to image language written language spoken language you're capturing different data sets in ways that prior tools never could and so the classifications that come out of it the insights that come out of it the business process transformation comes out of it is different than what we would have understood under the structured data format so I think it's that combination of really being able to push massive amounts of data through a cloud product to be able to process it at scale that is what I think is the combination that takes it to the next plateau for sure the language that we use today I feel like is going to change and you just started to touch on some of them you know sensing you know they're our senses and you know the visualization and the the the the auditory so it's it's sort of this new experience that customers are saying a lot of this machine intelligence behind them I call it the autonomous enterprise right the journey to be the autonomous enterprise and when you're on this journey to be the autonomous enterprise you need really the platform that can help you be cloud is that platform which can help you get to the autonomous journey but the autonomous journey does not end with the cloud right or doesn't end with the dead lake these are just infrastructures that are basic necessary necessities for being on that on that autonomous journey but at the end it's about how do you train and scale at a very large scale training that needs to happen on this platform for AI to be successful and if you are an autonomous enterprise then you have really figured out how to tap into AI and machine learning in a way that nobody else has to derive business value if you will so you've got the platform you've got the data and now you're actually tapping into the autonomous components AI and machine learning to derive business intelligence and business value so I want to get into a little bit of Oracle's role but to do that I want to talk a little bit more about the industry so if you think about the way this the industry seems to be restructuring around data there historically Industries had their own stack or value chain and if you were in the finance industry you were there for life you know so when you think about banking for example highly regulated industry think about our geek culture these are highly regulated industries they're come it was very difficult to disrupt these industries but now you look at an Amazon right and what does an Amazon or any other tech giant like Apple have they have incredible amounts of data they understand how people use or how they want to do banking and so they've cut off the tap of cash or Amazon pay and these things are starting to eat into the market right so you would have never thought an Amazon could be a competition to your banking industry just because of regulations but they are not hindered by the regulations because they're starting at a different level and so they become an instant threat and an instant destructor to these highly regulated industries that's what data does right then you use data as you DNA for your business and you are sort of born in data or you figured out how to be autonomous if you will capture value from that data in a very significant manner then you can get into industries that are not traditionally your own industry it can be like the food industry it can be the cloud industry the book industry you know different industries so you know that that's what I see happening with the tech giants so great this is a really interesting point that Gina is making that you mentioned you started off with like a couple of industries that are highly regulated harder to disrupt you know music got disrupted publishing got disrupted but you've got these regulated businesses you know defense automotive actually hasn't been truly disrupted yet so I'm Tesla maybes a harbinger and so you've got this spectrum of disruption but is anybody safe from disruption okay I don't think anyone's ever safe from it it's it's changed in evolution right that you whether it's you know swapping horseshoes for cars or TV for movies or Netflix or any sort of evolution of a business you I wouldn't coast on any of them and I think to earlier question around the value that we can help bring to Oracle customers is that you know we have a rich stack of applications and I find that the space between the applications the data that that spans more than one of them is a ripe playground for innovations that where the data already exists inside a company but it's trapped from both a technology and a business perspective and that's where I think really any company can take advantage of knowing its data better and changing itself to take advantage of what's already there yet powerful bit people always throw the bromide out the data is the new oil and we've said no data is far more valuable because you can use it in a lot of different places or you can use once and it's has to follow laws of scarcity data if you can unlock it and so a lot of the incumbents they have built a business around whatever a factory or you know process and people a lot of the the trillion-dollar start in us that they're become trillionaires you know I'm talking about data is at the core their data company so so it seems like a big challenge for you you're incumbent customers clients is to put data hit the core be able to break down those silos how do they do that grading down silos is really super critical for any business it was okay to operate in a silo for example you would think that oh you know I could just be payroll in expense reports and it wouldn't man matter if I get into vendor performance management or purchasing that can operate as a silo but anymore we are finding that there are tremendous insights between vendor performance management I expensive all these things are all connected so you can't afford to have your data set in silos so grading down that silo actually gives the business very good performance right insights that they didn't have before so that's one way to go but but another phenomena happens when you start to great down the silos you start to recognize what data you don't have to take your business to the next level right that awareness will not happen when you're working with existing data so that awareness comes into form when you great the silos and you start to figure out you need to go after different set of data to get you to new product creation what would that look like new test insights or new capex avoidance then that data is just you have to go through the eye tration to be able to figure that out which takes is what you're saying happy so this notion of the autonomous under president help me here because I get kind of autonomous and automation coming into IT IT ops I'm interested in how you see customers taking that beyond the technology organization into the enterprise I think when AI is a technology problem the company is it at a loss ai has to be a business problem ai has to inform the business strategy ai has two main companies the successful companies that have done so 90 percent of our investments are going towards data we know that and and most of it going towards AI data out there about this right and so we looked at what are these ninety cup ninety percent of the company's investments where are these going and who is doing this right and who's not doing this right one of the things we are seeing as results is that the companies that are doing it right have brought data into their business strategy they've changed their business model right so it's not like making a better taxi but coming up with uber right so it's not like saying okay I'm going to have all these I'm going to be the drug manufacturing company I'm going to put drugs out there in the market versus I'm going to do connected health right and so how does data serve the business model of being connected health rather than being a drug company selling drugs to my customers right it's a completely different way of looking at it and so now I is informing drug discovery AI is not helping you just put more drugs to the market rather it's helping you come up with new drugs that will help the process of connected game there's a lot of discussion in the press about you know the ethics of AI and how far should we take AI and how far can we take it from a technology standpoint long roadmap there but how far should we take it do you feel as though public policy will take care of that a lot of that narrative is just kind of journalists looking for you know the negative story well that's sort itself out how much time do you spend with your customers talking about that we in Oracle we're building our data science platform with an explicit feature called explain ability off the model on how the model came up with the features what features it picked we can rearrange the features that the model picked so I think explain ability is very important for ordinary people to trust AI because we can't trust AI even even data scientists contrast AI right to a large extent so for us to get to that level where we can really trust what AI is picking in terms of a model we need to have explained ability and I think a lot of the companies right now are starting to make that as part of their platform well we're definitely entering a new era the the age of AI of the autonomous enterprise folks thanks very much for a great segment really appreciate it yeah our pleasure thank you for having us thank you alright and thank you and keep it right there we're right back with our next guest for this short break you're watching the cubes coverage of the rebirth of Oracle consulting right back you [Music]

Published Date : May 8 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
Janet George	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Apple	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
90 percent	QUANTITY	0.99+
oracle	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Tesla	ORGANIZATION	0.99+
grant gibson	PERSON	0.99+
one	QUANTITY	0.98+
Janet	PERSON	0.98+
ninety percent	QUANTITY	0.98+
Boston	LOCATION	0.98+
uber	ORGANIZATION	0.98+
Gina	PERSON	0.97+
today	DATE	0.97+
Breaking Down Your Data	TITLE	0.96+
two main companies	QUANTITY	0.95+
six things	QUANTITY	0.95+
Netflix	ORGANIZATION	0.95+
both	QUANTITY	0.94+
ninety cup	QUANTITY	0.93+
more than one	QUANTITY	0.92+
one way	QUANTITY	0.9+
zeros	QUANTITY	0.85+
about five	QUANTITY	0.83+
lot of the companies	QUANTITY	0.73+
wave	EVENT	0.7+
Grant Gibson	PERSON	0.68+
trillion	QUANTITY	0.65+
lot	QUANTITY	0.58+
once	QUANTITY	0.57+

test 4/17/2020

I'm going alive I'm live right now let's send you this link and see if you can get on here so this is private see if I can break this out this is [Music] [Music] [Music] [Music] hello they're coming you live from Chuck alley studio here in Mountain View California and I'm on YouTube live I hope I'm not securing anything outta been out there for two minutes now let's be able to do a live private stream and be able to have that account that link to people - yeah okay yes you see me voice what's up what's up what's up so this is a private link I don't know if you can hear me that's a private link and if you give the link to whoever you want to see it oh you can't hear me hmm one two one two one two three four stop that

Published Date : Apr 17 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
two minutes	QUANTITY	0.99+
4/17/2020	DATE	0.96+
Mountain View California	LOCATION	0.96+
YouTube	ORGANIZATION	0.85+
Chuck alley	ORGANIZATION	0.74+

amir and atif 4 9 2020

from the cube studios in Palo Alto in Boston connecting with thought leaders all around the world this is a cube conversation I am stupid a man and this is a special cube conversation we've been talking a lot of course for many years about the ascent of cloud and today in 2020 multi cloud is a big piece of the discussion and we're really happy to help unveil coming out of cell al kiram which is helping the networking challenges when it comes to multi cloud and I have the two co-founders they are brothers I have Amir who is the CEO and a DIF who is the CTO the Khan brothers thank you so much for joining us and congratulations on the launch of the company thank you sue for having us on the show it's a pleasure to see you again all right so Amir we've had you on the program your previous company that you've done was of course the fella you the two of you have worked together at I believe five companies successful companies acquired you know the most recent one into Cisco so a mirror obviously you know you know strong networking theme your brother the CTO I was going to talk to us about the engineering but give us you know just the the story of Al Kyra what you've been building and now ready to unveil to the world certainly needs to so in around 2018 timeframe we started looking into the next big problem to solve in the industry which was not only a substantial you know from the market size perspective but also from the customers perspective was solving a major pain point so when we started looking into the cloud customers and started talking to our customers they were struggling from the cloud networking perspective even in a single cloud and it was a new environment for them and they had to understand all the nitty-gritty details of each one of these clouds and when you go to multi cloud environment it becomes exponentially complicated to address not only connectivity but how to deploy services like firewall and other services including low balancers and IP address management etc and remote access so we started digging deeper into this problem and start working with the customers and took a clean sheet of paper and came up with a very comprehensive approach to offering a solution which is as a service this time we are not shipping any hardware or software it is you know just like any other SAS application you just come to our portal I just drag and drop literally draw out your network and click on provision and you know come back after 40 minutes or so your full global cloud infrastructure is up and running so out if your brother laid out a pretty broad vision there any of us from the networking world we know there's a lot of complexity there and therefore it takes a lot of work when I want to do things simply as a service is you know a huge growth area bring us inside the engineering challenges that you and the team have been working on to build this solution second let's do so we've been working both our men and myself in the networking industry for more than 25 years now and our the way we have worked and what we have believed in is that we need to solve customer problems we never believed in like doing a science project so here also we started working with customers as we have always done in the past we understood the customers pain points the challenges they were facing especially in this case and in cloud networking space multi-cloud networking space based on the user requirements users or the customers use cases we started the building a service and here what we have built is a complete network as a service it's a multi cloud met work as a service which not only provides connectivity to multiple routes but also addresses the needs for bringing in networking services as well as security services making sure that you have a full policy based infrastructure on top of it you have deep visibility into into the clouds as well as into on-premise into and visibility into and monitoring troubleshooting and all of it is delivered to you as a service so that's what we have been doing here at ELQ here excellent so when we look at multi-cloud of course you know every cloud they have some similar things they have some different things they all tend to do things a little bit differently you know one of the secret sauces that have been talked about for the last few years is ESP BAM space like you and built with Nutella to help really enable those environments so if we've got a diagram here which I think will help explain a little bit as you know we're out here it how it plugs into these different environments walk us through a little bit what we're seeing here and what you're actually doing a tell Kira so here we are building a global unifying the multi cloud Network it's consumed as a service think of it as consuming it just like you would consume any other SAS like our SAS issue so you come to lqs portal you register and then there you go and you start building your global multi-cloud unified network with integrated services so here what you see is is a Elka's cloud services exchange with comprises of cloud exchange points you can bring these up these cloud exchange points up anywhere on the globe you can decide like what networking services security services you need in these cloud exchange points you can connect the multiple clouds from there you can bring your existing on-premise connector matiee into the CX PS all these CX B's have a full mesh of overlay high speed low latency connectivity among each other so there is a full network which comes up between these CX B's and this the whole infrastructure scales with customers as as a customer scale so it's a horizontally scalable veil a very highly redundant and resilient infrastructure which we have both all right so armor now that we understand the basics of the technology you've got some strong investors including Sequoia kleiner perkins give us you know what is being announced day you're coming out of stealth where are you with the product you know how many employees you have and where are you with the discussion of customer adoption so stew we're obviously bringing this to the market and we will be announcing it on April 15th it's available for the customers to consume our solution as a service on that day so they are welcome to reach out to us and we'll be happy to help them and as a matter of fact just come to our website and register for the service and yeah we rightly said that we have a superstar team of not only the venture capital companies but also the board members representing those companies the bill Cochran and mamoon Hamid Wright who the leading VCS are on the board of our company including myself inactive all right I'm all right love to actually bring up the second slide that we have here walk us through you said you know the service you know how do people get started how do they understand you know what would walk us through what what they do so the biggest challenge when we started looking into these problems you know Stu was that it was very complicated you have to piecemeal bring up instances and the cloud and stitch them together and when you try to integrate the services that was a different challenge for the customers right so we wanted to make sure that it was so simple and clean that the customer didn't even have to think about any underlying construct on any of the clouds they should not have to worry about learning each individual power from the you know networking perspective so here's your portal you just come you know step one is come to a portal or register step two is you start drawing your network based on your intent what on-prem an activity you want to bring into this service what type of services you need like all all the firewalls and then you know what pilots you need to connect and everything happens seamlessly the from on pram pram through services into the cloud and across multiple clouds it's a seamless service that we have created and with full analytics capabilities and full governance built in alright so I'll to bring us into what this means for customers you know how do they manage it you know is this the networking team is it the cloud architects you know what api's are there how does this fit into kind of what customers are doing today and you know solve some of those challenges that we laid out earlier in the discussion yes trauma from the customers perspective it's as I said it's it's completely delivered as a service customers come to our portal they draw out the network they select the services they click on provision and the whole network comes up within minutes so the main thing here is that from a customer's point of view if they are connecting to different clouds they don't need to understand any of the underlying specifics or underlying constructs of any of the of the cloud in order to bring can I bring up connectivity so we what we are doing here is we are abstracting the clouds here so we are building a virtual cloud network so if you if you think of if you compare it with what we did in the in the previous life be virtualized the when so here would be a doing is we are virtualizing the cloud network so underlying doesn't matter which cloud you sit on which cloud you need to connect to which networking services whether cloud native services or whether you you want to consume our care services or we also support like customer bringing in third-party services as well so it's all all offered from our platform all offered is a service for to the customer again no expertise required in any of the underlying networking constructs of any of these cards give us what we should be looking at from a technology roadmap from Akira through the rest of 2020 good question as to so as I mentioned earlier our roadmap is dictated by customer requirements so we prioritize what customers need from us so we have come out with a scalable platform we have come out with a marketplace for networking services in there in the near term we'll be expanding our market place with more services we will be addressing more use cases and when I talk about use cases I can give you some examples like there's a view you not just only need connectivity into cloud you might have different requirements from from throughput perspective or bandwidth perspective or different services that you need to front-end your cloud but you may have certain applications such as internet basing application where you eat like traffic coming in from the internet inbound to those applications you might need services like a load balancer like an external load balancer in our services exchange you might also need like a firewall you might need traffic engineering or sorry service eaning capability is where you would chain service through multiple or traffic through multiple of these services like a firewall in the load balancer so we have built a platform which gives you all those capabilities going forward we will be adding more services more use cases to it we have a long ways ahead of us and we will be putting all our effort in delivering a roadmap as we go all right so Amma your technical team definitely has their hands full and uh you know robust after work on uh give us the the high-level what we should be looking for out Kira for people that are out there you know multi-cloud and networking you know tend to get talked a lot there's many big companies and some small ones what will separate al Kira from the rest of the market today and what should we be looking to see the company's progression through 2020 yeah thanks for asking that yeah certainly I mean you know from the solution perspective out it's said that you know it's so fundamentally important to have a very strong basis right and that's what we have done we are bringing out a certain number of services and now we will continue to grow on that will create a big marketplace we will continue to improve on which clouds we connect to and how and we will be building our own services in certain cases as well now building a technology is just one piece of it we have to go out to market with a company that the customers can trust every single you know the department in that company whether it's sales or how they do business with us all the business back-end pieces have to be sorted out and that's what we've been working with and you know then go to market partners that is very very important right support is very important so let me spend a little bit of time on go to market strategy we have been working with the service riders so that we can extend our reach not only to the large customers but also to midsize customers across the globe so you will see us in the future announcing major service water partnerships as well as we've been working with large sis bars and system integration in a partners and also we have taking a slightly different approach this time because it's a service so we are going with telecom master agents which have been you know working with the service providers the cloud providers the cable providers as a channel and they have a huge reach into the customer base so we we have a very comprehensive strategy not only from the go to market in the technology perspective but also how we are going to support our customers and continue to build a relationship to build a lasting company yeah I'm a super important point there absolutely we've seen the maturation and change in the service providers as today they are working with many of the public cloud providers and they're as you said a close touch point and a trusted partner of our customers all right so before I let you go you know YouTuber brothers everybody in today's day and age is spending even more time with family but you know your your situation you've worked together for a long time what keeps bringing the two of you together working together and then talk about that ball so I mean we're very close-knit family we have four brothers and one sister and obviously active and I have been the closest because we have been working together for the longest we have at least work in five different companies together our families travel together we have three daughters each we live about five minutes you know walk from each other and we you know just have this bond where we not only have you know the family close but also very close-knit friends a circle which we both hang out with and we you know obviously have common interest in the sports as well we play squash and tennis and work out so after four if they want to take a stab at it but also yeah so we've always been very close in fact we've been together for the last like ever since I can remember like even even college days he was we were roommates for for some time also he ever say we have like our circle of friends is the same also so again we're very close and we work well together so we complement each other's skills and and it's it's worked out in the past hopefully it will work out again and I look forward to working with them for many many more years to come yeah well I'm or not - thank you so much for sharing the the coming out of stealth for Al Kyra we definitely look forward to watching your progress and you know seeing how you're helping customers in this multi-cloud world thank you for joining us - thank you so much thank you for having us all right I'm Stu minimun and thank you so much for watching this special cube conversation on the cube [Music]

Published Date : Apr 9 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
April 15th	DATE	0.99+
Amir	PERSON	0.99+
2020	DATE	0.99+
Palo Alto	LOCATION	0.99+
two	QUANTITY	0.99+
Al Kyra	PERSON	0.99+
ELQ	ORGANIZATION	0.99+
second slide	QUANTITY	0.99+
Boston	LOCATION	0.99+
Elka	ORGANIZATION	0.99+
today	DATE	0.99+
Cisco	ORGANIZATION	0.99+
one sister	QUANTITY	0.98+
step one	QUANTITY	0.98+
amir	PERSON	0.98+
more than 25 years	QUANTITY	0.98+
four	QUANTITY	0.98+
step two	QUANTITY	0.98+
five companies	QUANTITY	0.98+
both	QUANTITY	0.97+
five different companies	QUANTITY	0.96+
four brothers	QUANTITY	0.95+
Stu minimun	PERSON	0.95+
about five minutes	QUANTITY	0.95+
two co-founders	QUANTITY	0.95+
each	QUANTITY	0.93+
single cloud	QUANTITY	0.93+
mamoon Hamid Wright	PERSON	0.93+
one piece	QUANTITY	0.92+
each one	QUANTITY	0.91+
Cochran	PERSON	0.91+
40 minutes	QUANTITY	0.9+
one	QUANTITY	0.9+
three daughters	QUANTITY	0.89+
Akira	ORGANIZATION	0.89+
Kira	PERSON	0.86+
each individual	QUANTITY	0.85+
Khan	PERSON	0.83+
Kira	ORGANIZATION	0.73+
lot	QUANTITY	0.72+
Sequoia kleiner perkins	PERSON	0.72+
VCS	ORGANIZATION	0.7+
atif 4 9	PERSON	0.64+
last few years	DATE	0.64+
SAS	ORGANIZATION	0.64+
minutes	QUANTITY	0.62+
tennis	TITLE	0.61+
around 2018	DATE	0.53+
lqs portal	ORGANIZATION	0.53+
squash	TITLE	0.52+
years	QUANTITY	0.48+
single	QUANTITY	0.46+
2020	OTHER	0.42+

Breaking Analysis: CIOs Plan on 4% Budget Declines for 2020

from the cube studios in Palo Alto in Boston connecting with thought leaders all around the world this is a cube conversation [Music] hello everybody and welcome to this week wiki bond cube insights powered by ETR in this breaking analysis we want to update you on the latest spending data from EGR as you know we've been tracking this weekly saga kodachi is here he's the director of research at ET our saga thanks for coming on thanks for having me again Dave really appreciate it yes so so let me remind everybody so we entered the Year this year 2020 with a consensus IT spend for cast of plus 4% once coronavirus hit ET are launched its latest survey in March and we saw those numbers you'll come down last week we reported well the first report we made was it looked like it was flat last week we reported a slight negative and today we want to update you guys on those numbers so saga before we get into the data just give us the high level on where you guys are at in terms of your survey yeah no problem so currently we are forecasting a decline in global IT budgets about negative 4% I think what's happened you know over the last you know 10 or 15 days is you've just seen more and more information released that's given organizations more of an understanding of just how severe this you know epidemic is and so what we've been able to do on our end is kind of do an event study analysis or simulation analysis kind of what you're seeing here a really pinpoint the time period where organizations understood the severity of the epidemic and then really trying to measure the declines in IT budgets from there great so guys bring that slide back up I want to share with our audience what's happening here so what ETR has done is an event-based analysis and what you can see is where the survey launched on 3/11 you could see how sentiment has declined literally daily as the data rolled in then you see the US declared a national emergency you saw that the federal plan leaked for that you know penned pandemic protect projection and obviously New York became a hot spot and then you can see this the stimulus package in it and sagger it looks like there's a slight uptick here but generally speaking it's down now it could be worse but you guys were the first to report the offset from work it worked from home infrastructure we'll talk about that a little bit talk about this event analysis and what you're seeing here and how you compressed the analysis hosting these events no problem so let's start with a blue line here and just so the audience knows the x-axis is going to be date and the y-axis is going to be annual growth or decline in nit budgets what you're seeing here and if we start with the blue line is we started pulling on 3/11 and on that date we started to ask you know fortune 100 is fortune 500 how their budget was going to change based on the impacts of coded nineteen versus their original expectations coming into coming into the year and again consensus estimates coming to the year were positive four percent so if you track that line all the way through you get to a decline of about one percent now what's the issue of starting polling on 3/11 or using that blue line well one of the big issues is a few days later the US declared a national emergency so more information was released right I think organizations that took the survey in the first two days didn't have a complete picture as to what's going on and then effectively a week later you saw federal documents get leaked stating how bad this epidemic was right in terms of the last 18 18 plus months and so what we did was we did it effectively an event based analysis or defuse different simulation where if you take a look at the yellow and red lines to start what we're doing is we're effectively saying okay let's ignore everyone that took the survey prior to that let's take their budgets in terms of how they indicated change versus their original expectations for 2020 and then let's go ahead and map that and if you look at the yellow line as an example that goes to a decline of 2% and then once I think you know the next shoe dropped in terms of organizations understanding this is not going to be a few weeks or this is not the common cold or flu once organizations knew this was going to be an 18 plus epidemic you can see if we started pulling respondents from there how much more negative it gets and of course once NYC became the epicenter you saw a little another shoe drop so now those those scenarios or simulations are taking us between a decline of three and four percent and then of course if we look at that last purple line there when the stimulus got announced what we are seeing is it looks like it may have bottomed down we have to continue tracking it because you know again it's just a few days since the stimulus is was passed and so let's see if the data starts improve a little bit or at least stabilize but I think from the last three events in terms of the the federal plan being leaked NYC becoming the epicenter and the stimulus it looks like the market now is fully aware of what's going on and now we're kind of seeing some stabilization in the data in terms of the declines for 2020 so between the feds action and the the fiscal stimulus we've we've seen some optimism although people are really cautious of course remember folks this would be worse were it not for the shift in spend to work from home infrastructure not just collaboration and visualization tools but other infrastructure around that network bandwidth security desktop virtualization etc so guys if you bring up the next chart I want to set this up we've been reporting this framework for a while now what this shows is what the sentiment is in terms of the budget change and you can see the gray bar now is 35% it started at 40% so that's dropped so the percentage of CIO saying no change the green is held pretty steady at around 20 to 22% that's it's roughly in there and the red you know has been has been shifting and you can see most of the green ie spending more in 2020 is focused on that you know one to two ten percent but but Sagar bring us up to date now we're going to settle in it right now about three and a half to four percent on the negative side give us some color on this chart please yeah no problem so the best way to connect this chart with what we saw earlier is this is a snapshot so this is a single day so this is the data that is feeding the time series chart kind of help the audience understand what's going on so if we were to look at this exact chart Oh since March 11 you would see that midpoint Average effectively coming down every day and that's effectively what's making up that time series in terms of this chart you know Dave you kind of hit it right on the nail you're kind of seeing the positivity remain or be stable and again that's that work from home infrastructure as you as you mentioned right the collaboration pools no the virtualization support services networking bandwidth all that stuff right being more and more security but on the negative side I think what you're seeing is that again as organizations now understand the severity of the epidemic I think as we understand further and we've talked about this you know a few weeks ago that organizations were anticipating less demand they were anticipating an uptick in broken supply chains now you're starting to see some of that play out and as a result you're seeing organizations get more and more negative and that's why that midpoint average it keeps declining that's why those red bars keep going up is the the impacts in you know based on the data are are now starting to be to be seen and so you know let's see if the stimulus stabilizes this data and we'll continue tracking that you know over the next few weeks the next few months okay so basically we're coming in - three and a half to four percent that's where we are today we're not going to get detailed into some of the vendors today we talked a little bit about that last week and go back to last week's breaking analysis you can see some of that vendor commentary I want to talk about what happens next ETR now we'll go into a two-week quite self-imposed quiet period and really start crunching the data at the end of that quiet period they will release to their private clients the their latest thinking in a webcast after that time we at the cube are allowed to share public information and we're gonna drill down into some of the segments that our community is most interested in but-but-but etrs going quiet now so saga maybe you can explain that sequence and fill in any holes that I missed there yeah no problem the next two weeks so we've we've collected a tremendous amount of data you know we're over you know we're at a hundred fortune 100 organizations you know almost three four hundred global two thousand organizations and so we're at a point now where it's time to start aggregating the data start really analyzing it going through this Koga drill down that we conducted but also we conducted a tremendous study on technology spending intentions of crossing over 350 vendors dozens of Technology sectors and so now it's really a time to kind of drill in and you know what what we're looking for or even some of the biggest takeaways from from this Cove it you know drill down is you know if if you started polling before 3:23 chances are your forecast is gonna come in light and I think that's one of the things that we've learned as we're kind of going into this to hear it is we really want to measure the impact starting right around that 3:23 timeframe it looks right around then based on that time series chart that we showed earlier that's when the market fully understood the impact of this epidemic and so as we start over the next two weeks even though we started pulling a little bit early we really want to focus on that second set a second half of responses because that's probably gonna be more indicative of what's going on I think the second thing is gonna be look if condition of conditions continue to deteriorate things can get worse and so we may come out of the next two weeks with this data that we collected and again have to continue indicating that you know the environment has continued coming down and you know maybe we may have to make adjustments as we see fit so I think that's kind of you know this whole situation is so dynamic still and so we're gonna do our best in the next week and a half to kind of get this data to market to at least give everyone an idea here's how everything stands right now and so that people have a good benchmark and then move forward yeah so this is as close to real time really as you can get in some of this IT spending world saga mentioned some of the numbers and in the global 2000 fortune fortune 100 1000 this this end now just the reminder is up over 1200 I believe right Sahra the total and that you've collected this this month that's correct exactly every time we've been doing one of these it's been going up another a couple hundred respondents so yeah we're at a very comfortable level now our sample right now represents five hundred and fifty five billion dollars in annual IP spend you know and global IT spend every year is a little over you know three trillion so this is a significant significant portion of a global IT spend and we feel comfortable at this point kind of going into that quiet period as you mentioned and really start to dig through the results that you know now that we've kind of you know covered the the 10,000 foot or the macro layer so to speak in terms of where budgets are going now it's really time to start drilling down and do the sectors and vendors because this is this is not going to be a every vendors going down or whatever maybe there's so many different dynamics here some vendors are going to do very well because the work for MoMA infrastructure and I think some vendors are gonna do very poorly because one they're not only on the legacy side but they're not really aligned from this whole work from home infrastructure movement so you're gonna see a lot of bifurcation you know as we get into 53 that's right and we're gonna dig into all those segments we're gonna look at the work from home we're gonna look at the traditional stuff we're gonna look at cloud we're gonna drill into specific segments that are that are of interest to our community it's a pleasure to really have you on here Sagar thank you for for sharing giving us access to this data and and stay safe and we will be watching go to ETR dot plus and you know check out what's happening there Silicon Engel Tom will obviously cover this and I published weekly on wiki bond comm again that saga thanks so much for coming on the cube yeah no problem thank you so much and looking forward to catching up in a few weeks all right then thank you for watching everybody this is Dave a latte for the cube or wiki bounce cube insights powered by ETR we'll see you next time [Music]

Published Date : Apr 2 2020

**Summary and Sentiment Analysis are not been shown because of improper transcript**

ENTITIES

Entity	Category	Confidence
March 11	DATE	0.99+
35%	QUANTITY	0.99+
2020	DATE	0.99+
40%	QUANTITY	0.99+
four percent	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
three trillion	QUANTITY	0.99+
two-week	QUANTITY	0.99+
last week	DATE	0.99+
Dave	PERSON	0.99+
March	DATE	0.99+
2%	QUANTITY	0.99+
one	QUANTITY	0.99+
today	DATE	0.99+
a week later	DATE	0.99+
three	QUANTITY	0.99+
3/11	DATE	0.99+
10	QUANTITY	0.99+
10,000 foot	QUANTITY	0.99+
EGR	ORGANIZATION	0.98+
Boston	LOCATION	0.98+
pandemic	EVENT	0.98+
New York	LOCATION	0.98+
first	QUANTITY	0.98+
US	ORGANIZATION	0.97+
dozens	QUANTITY	0.97+
Koga	ORGANIZATION	0.97+
first two days	QUANTITY	0.97+
ETR	ORGANIZATION	0.97+
15 days	QUANTITY	0.97+
single day	QUANTITY	0.97+
first report	QUANTITY	0.97+
100	QUANTITY	0.97+
ET	ORGANIZATION	0.96+
Sagar	PERSON	0.96+
two thousand organizations	QUANTITY	0.96+
over 350 vendors	QUANTITY	0.95+
about one percent	QUANTITY	0.94+
a few days later	DATE	0.94+
second set	QUANTITY	0.94+
this week	DATE	0.93+
second thing	QUANTITY	0.93+
next week and	DATE	0.93+
about three and a half	QUANTITY	0.92+
five hundred and fifty five billion dollars	QUANTITY	0.92+
this month	DATE	0.92+
4%	QUANTITY	0.92+
one of the things	QUANTITY	0.91+
over 1200	QUANTITY	0.91+
second half of responses	QUANTITY	0.9+
22%	QUANTITY	0.9+
next few months	DATE	0.9+
next few weeks	DATE	0.89+
a few weeks ago	DATE	0.89+
three and a half	QUANTITY	0.88+
next two weeks	DATE	0.87+
Silicon Engel Tom	ORGANIZATION	0.86+
two ten percent	QUANTITY	0.85+
every year	QUANTITY	0.83+
around 20	QUANTITY	0.82+
this year	DATE	0.82+
days	QUANTITY	0.82+
hundred fortune	QUANTITY	0.8+
a few weeks	QUANTITY	0.79+
3:23	DATE	0.78+
18 plus epidemic	QUANTITY	0.78+
last 18 18	DATE	0.78+
couple hundred respondents	QUANTITY	0.77+
1000	QUANTITY	0.76+
Cove	ORGANIZATION	0.76+
500	ORGANIZATION	0.76+
almost three four hundred	QUANTITY	0.75+
100 organizations	QUANTITY	0.74+
NYC	LOCATION	0.72+
every day	QUANTITY	0.71+
plus 4%	QUANTITY	0.71+
53	OTHER	0.71+

UNLIST TILL 4/2 - The Road to Autonomous Database Management: How Domo is Delivering SLAs for Less

hello everybody and thank you for joining us today at the virtual Vertica BBC 2020 today's breakout session is entitled the road to autonomous database management how Domo is delivering SLA for less my name is su LeClair I'm the director of marketing at Vertica and I'll be your host for this webinar joining me is Ben white senior database engineer at Domo but before we begin I want to encourage you to submit questions or comments during the virtual session you don't have to wait just type your question or comment in the question box below the slides and click Submit there will be a Q&A session at the end of the presentation we'll answer as many questions as we're able to during that time any questions that we aren't able to address or drew our best to answer them offline alternatively you can visit vertical forums to post your questions there after the session our engineering team is planning to join the forum to keep the conversation going also as a reminder you can maximize your screen by clicking the double arrow button in the lower right corner of the slide and yes this virtual session is being recorded and will be available to view on demand this week we'll send you notification as soon as it's ready now let's get started then over to you greetings everyone and welcome to our virtual Vertica Big Data conference 2020 had we been in Boston the song you would have heard playing in the intro would have been Boogie Nights by heatwaves if you've never heard of it it's a great song to fully appreciate that song the way I do you have to believe that I am a genuine database whisperer then you have to picture me at 3 a.m. on my laptop tailing a vertical log getting myself all psyched up now as cool as they may sound 3 a.m. boogie nights are not sustainable they don't scale in fact today's discussion is really all about how Domo engineers the end of 3 a.m. boogie nights again well I am Ben white senior database engineer at Domo and as we heard the topic today the road to autonomous database management how Domo is delivering SLA for less the title is a mouthful in retrospect I probably could have come up with something snazzy er but it is I think honest for me the most honest word in that title is Road when I hear that word it evokes for me thoughts of the journey and how important it is to just enjoy it when you truly embrace the journey often you look up and wonder how did we get here where are we and of course what's next right now I don't intend to come across this too deep so I'll submit there's nothing particularly prescient and simply noticing the elephant in the room when it comes to database economy my opinion is then merely and perhaps more accurately my observation the office context imagine a place where thousands and thousands of users submit millions of ad-hoc queries every hour now imagine someone promised all these users that we could deliver bi leverage at cloud scale in record time I know what many of you should be thinking who in the world would do such a thing of course that news was well received and after the cheers from executives and business analysts everywhere and chance of Keep Calm and query on finally started to subside someone that turns an ass that's possible we can do that right except this is no imaginary place this is a very real challenge we face the demo through imaginative engineering demo continues to redefine what's possible the beautiful minds at Domo truly embrace the database engineering paradigm that one size does not fit all that little philosophical nugget is one I would pick up while reading the white papers and books of some guy named stone breaker so to understand how I and by extension Domo came to truly value analytic database administration look no further than that philosophy and what embracing it would mean it meant really that while others were engineering skyscrapers we would endeavor to build Datta neighborhoods with a diverse kapala G of database configuration this is where our journey at Domo really gets under way without any purposeful intent to define our destination not necessarily thinking about database as a service or anything like that we had planned this ecosystem of clusters capable of efficiently performing varied workloads we achieve this with custom configurations for node count resource pool configuration parameters etc but it also meant concerning ourselves with the unattended consequences of our ambition the impact of increased DDL activities on the catalog system overhead in general what would be the management requirements of an ever-evolving infrastructure we would be introducing multiple points of failure what are the advantages the disadvantages those types of discussions and considerations really help to define what would be the basic characteristics of our system the database itself needed to be trivial redundant potentially ephemeral customizable and above all scalable and we'll get more into that later with this knowledge of what we were getting into automation would have to be an integral part of development one might even say automation will become the first point of interest on our journey now using popular DevOps tools like saltstack terraform ServiceNow everything would be automated I mean it discluded everything from larger multi-step tasks like database designs database cluster creation and reboots to smaller routine tasks like license updates move-out and projection refreshes all of this cool automation certainly made it easier for us to respond to problems within the ecosystem these methods alone still if our database administration reactionary and reacting to an unpredictable stream of slow query complaints is not a good way to manage a database in fact that's exactly how three a.m. Boogie Nights happen and again I understand there was a certain appeal to them but ultimately managing that level of instability is not sustainable earlier I mentioned an elephant in the room which brings us to the second point of interest on our road to autonomy analytics more specifically analytic database administration why our analytics so important not just in this case but generally speaking I mean we have a whole conference set up to discuss it domo itself is self-service analytics the answer is curiosity analytics is the method in which we feed the insatiable human curiosity and that really is the impetus for analytic database administration analytics is also the part of the road I like to think of as a bridge the bridge if you will from automation to autonomy and with that in mind I say to you my fellow engineers developers administrators that as conductors of the symphony of data we call analytics we have proven to be capable producers of analytic capacity you take pride in that and rightfully so the challenge now is to become more conscientious consumers in some way shape or form many of you already employ some level of analytics to inform your decisions far too often we are using data that would be categorized as nagging perhaps you're monitoring slow queries in the management console better still maybe you consult the workflows analyzing how about a logging and alerting system like sumo logic if you're lucky you do have demo where you monitor and alert on query metrics like this all examples of analytics that help inform our decisions being a Domo the incorporation of analytics into database administration is very organic in other words pretty much company mandated as a company that provides BI leverage a cloud scale it makes sense that we would want to use our own product could be better at the business of doma adoption of stretches across the entire company and everyone uses demo to deliver insights into the hands of the people that need it when they need it most so it should come as no surprise that we have from the very beginning use our own product to make informed decisions as it relates to the application back engine in engineering we call it our internal system demo for Domo Domo for Domo in its current iteration uses a rules-based engine with elements through machine learning to identify and eliminate conditions that cause slow query performance pulling data from a number of sources including our own we could identify all sorts of issues like global query performance actual query count success rate for instance as a function of query count and of course environment timeout errors this was a foundation right this recognition that we should be using analytics to be better conductors of curiosity these types of real-time alerts were a legitimate step in the right direction for the engineering team though we saw ourselves in an interesting position as far as demo for demo we started exploring the dynamics of using the platform to not only monitor an alert of course but to also triage and remediate just how much economy could we give the application what were the pros and cons of that Trust is a big part of that equation trust in the decision-making process trust that we can mitigate any negative impacts and Trust in the very data itself still much of the data comes from systems that interacted directly and in some cases in directly with the database by its very nature much of the data was past tense and limited you know things that had already happened without any reference or correlation to the condition the mayor to those events fortunately the vertical platform holds a tremendous amount of information about the transaction it had performed its configurations the characteristics of its objects like tables projections containers resource pools etc this treasure trove of metadata is collected in the vertical system tables and the appropriately named data collector tables as a version 9 3 there are over 190 tables that define the system tables while the data collector is the collection of 215 components a rich collection can be found in the vertical system tables these tables provide a robust stable set of views that let you monitor information about your system resources background processes workload and performance allowing you to more efficiently profile diagnose and correlate historical data such as low streams query profiles to pool mover operations and more here you see a simple query to retrieve the names and descriptions of the system tables and an example of some of the tables you'll find the system tables are divided into two schemas the catalog schema contains information about persistent objects and the monitor schema tracks transient system States most of the tables you find there can be grouped into the following areas system information system resources background processes and workload and performance the Vertica data collector extends system table functionality by gathering and retaining aggregating information about your database collecting the data collector mixes information available in system table a moment ago I show you how you get a list of the system tables in their description but here we see how to get that information for the data collector tables with data from the data collecting tables in the system tables we now have enough data to analyze that we would describe as conditional or leading data that will allow us to be proactive in our system management this is a big deal for Domo and particularly Domo for demo because from here we took the critical next step where we analyze this data for conditions we know or suspect lead to poor performance and then we can suggest the recommended remediation really for the first time we were using conditional data to be proactive in a database management in record time we track many of the same conditions the Vertica support analyzes via scrutinize like tables with too many production or non partition fact tables which can negatively affect query performance and life in vertical in viral suggests if the table has a data a time step column you recommend the partitioning by the month we also can track catalog sizes percentage of total memory and alert thresholds and trigger remediations requests per hour is a very important metric in determining when a trigger are scaling solution tracking memory usage over time allows us to adjust resource pool parameters to achieve the optimal performance for the workload of course the workload analyzer is a great example of analytic database administration I mean from here one can easily see the logical next step where we were able to execute these recommendations manually or automatically be of some configuration parameter now when I started preparing for this discussion this slide made a lot of sense as far as the logical next iteration for the workload analyzing now I left it in because together with the next slide it really illustrates how firmly Vertica has its finger on the pulse of the database engineering community in 10 that OS management console tada we have the updated work lies will load analyzer we've added a column to show tuning commands the management console allows the user to select to run certain recommendations currently tuning commands that are louder and alive statistics but you can see where this is going for us using Domo with our vertical connector we were able to then pull the metadata from all of our clusters we constantly analyze that data for any number of known conditions we build these recommendations into script that we can then execute immediately the actions or we can save it to a later time for manual execution and as you would expect those actions are triggered by thresholds that we can set from the moment nyan mode was released to beta our team began working on a serviceable auto-scaling solution the elastic nature of AI mode separated store that compute clearly lent itself to our ecosystems requirement for scalability in building our system we worked hard to overcome many of the obstacles they came with the more rigid architecture of enterprise mode but with the introduction is CRM mode we now have a practical way of giving our ecosystem at Domo the architectural elasticity our model requires using analytics we can now scale our environment to match demand what we've built is a system that scales without adding management overhead or our necessary cost all the while maintaining optimal performance well we're really this is just our journey up to now and which begs the question what's next for us we expand the use of Domo for Domo within our own application stack maybe more importantly we continue to build logic into the tools we have by bringing machine learning and artificial intelligence to our analysis and decision making really do to further illustrate those priorities we announced the support for Amazon sage maker autopilot at our demo collusive conference just a couple of weeks ago for vertical the future must include in database economy the enhanced capabilities in the new management console to me are clear nod to that future in fact with a streamline and lightweight database design process all the pieces should be in place versions deliver economists database management itself we'll see well I would like to thank you for listening and now of course we will have a Q&A session hopefully very robust thank you [Applause]

Published Date : Mar 31 2020

SUMMARY :

conductors of the symphony of data we

ENTITIES

Entity	Category	Confidence
Boston	LOCATION	0.99+
Vertica	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Domo	ORGANIZATION	0.99+
3 a.m.	DATE	0.99+
Amazon	ORGANIZATION	0.99+
today	DATE	0.99+
first time	QUANTITY	0.98+
this week	DATE	0.97+
over 190 tables	QUANTITY	0.97+
two schemas	QUANTITY	0.96+
second point	QUANTITY	0.96+
215 components	QUANTITY	0.96+
first point	QUANTITY	0.96+
three a.m.	DATE	0.96+
Boogie Nights	TITLE	0.96+
millions of ad-hoc queries	QUANTITY	0.94+
Domo	TITLE	0.93+
Vertica Big Data conference 2020	EVENT	0.93+
Ben white	PERSON	0.93+
10	QUANTITY	0.91+
thousands of users	QUANTITY	0.9+
one size	QUANTITY	0.89+
saltstack	TITLE	0.88+
4/2	DATE	0.86+
a couple of weeks ago	DATE	0.84+
Datta	ORGANIZATION	0.82+
end of 3 a.m.	DATE	0.8+
Boogie Nights	EVENT	0.78+
double arrow	QUANTITY	0.78+
every hour	QUANTITY	0.74+
ServiceNow	TITLE	0.72+
DevOps	TITLE	0.72+
Database Management	TITLE	0.69+
su LeClair	PERSON	0.68+
many questions	QUANTITY	0.63+
SLA	TITLE	0.62+
The Road	TITLE	0.58+
Vertica BBC	ORGANIZATION	0.56+
2020	EVENT	0.55+
database management	TITLE	0.52+
Domo Domo	TITLE	0.46+
version 9 3	OTHER	0.44+

UNLIST TILL 4/2 - Vertica Database Designer - Today and Tomorrow

>> Jeff: Hello everybody and thank you for joining us today for the Virtual VERTICA BDC 2020. Today's breakout session has been titled, "VERTICA Database Designer Today and Tomorrow." I'm Jeff Healey, Product VERTICA Marketing, I'll be your host for this breakout session. Joining me today is Yuanzhe Bei, Senior Technical Manager from VERTICA Engineering. But before we begin, (clearing throat) I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions, as we're able to during that time, any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VERTICA forums at forum.vertica.com to post your questions there after the session. Our engineering team is planning to join the forums, to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button at the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you Yuanzhe. >> Yuanzhe: Thanks Jeff. Hi everyone, my name is Yuanzhe Bei, I'm a Senior Technical Manager at VERTICA Server RND Group. I run the query optimizer, catalog and the disaggregated engine team. Very glad to be here today, to talk about, the "VERTICA Database Designer Today and Tomorrow". This presentation will be organized as the following; I will first refresh some knowledge about, VERTICA fundamentals such as Tables and Projections, which will bring to the question, "What is Database Designer?" and "Why we need this tool?". Then I will take you through a deep dive, into a Database Designer or we call DBD, and see how DBD's internals works, after that I'll show you some exciting DBD improvements, we have planned for 10.0 release and lastly, I will share with you, some DBD future roadmap we planned next. As most of you should already know, VERTICA is built on a columnar architecture. That means, data is stored column wise. Here we can see a very simple example, of table with four columns, and the many of you may also know, table in VERTICA is a virtual concept. It's just a logical representation of data, which means user can write SQL query, to reference the table names and column, just like other relational database management system, but the actual physical storage of data, is called Projection. A Projection can reference a subset, or all of the columns all to its anchor table, and must be sorted by at least one column. Each table need at least one C for projection which reference all the columns to the table. If you load data to a table with no projection, and automated, auto production will be created, which will be arbitrarily assorted by, the first couple of columns in the table. As you can imagine, even though such other production, can be used to answer any query, the performance is not optimized in most cases. A common practice in VERTICA, is to create multiple projections, contain difference step of column, and sorted in different ways on the same table. When query is sent to the server, the optimizer will pick the projection, that can answer the query in the most efficient way. For example, here you can say, let's say you have a query, that select columns B, D, C and sorted by B and D, the third projection will be ideal, because the data is already sorted, so you can save the sorting costs while executing the query. Basically when you choose the design of the projection, you need to consider four things. First and foremost, of course the sort order. The data already sorted in the right way, can benefit quite a lot of the query actually, like Ordered by, Group By, Analytics, Merge, Join, Predicates and so on. The select column group is also important, because the projection must contain, all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, VERTICA is the distributed database, and allow projection to be segmented, based on the hash of a set of columns, which is beneficial if the segmentation merged, the join keys or group keys. And finally encoding of each per columns is also part of the design, because the data is sorted in different way, may completely change the optimal encoding for each column. This example only show the benefit of the first two, but you can imagine the rest too are also important. But even for that, it doesn't sound that hard, right? Well I hope you change your mind already when you see this, at least I do. These machine generated queries, really beats me. It will probably take an experienced DBA hours, to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries, in the regular work logs in the real world. So what can we do? That's why we need DBD. DBD is a tool integrated in the VERTICA server, that it can help DBA to perform an access, on their work log query, tabled schema and data, and then automatically figure out, the most optimized projection design for their workload. In addition, DBD also a sophisticated tool, that can take customize by a user, by sending a lot of parameters objectives and so on. And lastly, DBD has access to the optimizer, so DB knows what kind of attribute, the projection need to have, in order to have the optimizer to benefit from them. DBD has been there for years, and I'm sure there are plenty of materials available online, to show you how DBD can be used in different scenarios, whether to achieve the query optimize, or load optimize, whether it's the comprehensive design, or the incremental design, whether it's a dumping deployment script, and manual deployment later, or let the DBD do the order deployment for you, and the many other options. I'm not planning to talk about this today, instead, I will take the opportunity today, to open this black box DBD, and show you what exactly hide inside. DBD is a complex tool and I have tried my best to summarize the DBD design process into seven steps; Extract, Permute, Prune, Build, Score, Identify and Encode. What do they mean? Don't worry, I will show you step by step. The first step is Extract. Extract Interesting Columns. In this step, DBD pass the design queries, and figure out the operations that can be benefited, by the potential projection design, and extract the corresponding columns, as interesting columns. So Predicates, Group By, Order By, Joint Condition, and analytics are all interesting Column to the DBD. As you can see this three simple sample queries, DBD can extract the interest in column sets on the right. Some of these column sets are unordered. For example, the green one for Group By a1 and b1, the DBD extracts the interesting column set, and put them in the own orders set, because either data sorted by a1 first or b1 first, can benefit from this Group By operation. Some of the other sets are ordered, and the best example is here, order by clause a2 and b2, and obviously you cannot sort it by b2 and then a2. These interesting columns set will be used as if, to extend to actual projection sort order candidates. The next step is Permute, once DBD extract all the C's, it will enumerate sort order using C, and how does DBD do that? I'm starting with a very simple example. So here you can see DBD can enumerate two sort orders, by extending d1 with the unordered set a1, b1, and the derived at two sort order candidates, d1, a1, b1, and d1, b1, a1. This sort order can benefit queries with predicate on d1, and also benefit queries by Group By a1, b1, when a1, sorry when d1 is constant. So with the same idea, DBD will try to extend other States with each other, and populate more sort order permutations. You can imagine that how many of them, there could be many of them, these candidates, based on how many queries you have in the design and that can be handled of the sort order candidates. That comes to the third step, which is Pruning. This step is to limit the candidates sort order, so that the design won't be running forever. DBD uses very simple capping mechanism. It sorts all the, sort all the candidates, are ranked by length, and only a certain number of the sort order, with longest length, will be moved forward to the next step. And now we have all the sort orders candidate, that we want to try, but whether this sort order candidate, will be actually be benefit from the optimizer, DBD need to ask the optiizer. So this step before that happens, this step has to build those projection candidate, in the catalog. So this step will build, will generates the projection DBL's, surround the sort order, and create this projection in the catalog. These projections won't be loaded with real data, because that takes a lot of time, instead, DBD will copy over the statistic, on existing projections, to this projection candidates, so that the optimizer can use them. The next step is Score. Scoring with optimizer. Now projection candidates are built in the catalog. DBD can send a work log queries to optimizer, to generate a query plan. And then optimizer will return the query plan, DBD will go through the query plan, and investigate whether, there are certain benefits being achieved. The benefits list have been growing over time, when optimizer add more optimizations. Let's say in this case because the projection candidates, can be sorted by the b1 and a1, it is eligible for Group By Pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries, will be aggregated and then recorded, for each projection candidate. We are almost there. Now we have all the total benefit score, for the projection candidates, we derived on the work log queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner d1, b1 and a1. Sometimes you need to find more winners, because the chosen winner may only benefit a subset, of the work log query you provided to the DBD. So in order to have the rest of the queries, to be also benefit, you need more projections. So in this case, DBD will go to the next iteration, and let's say in this case find to another winner, d1, c1, to benefit the work log queries, that cannot be benefit by d1, b1 and a1. The number of iterations and thus the winner outcome, DBD really depends on the design objective that uses that. It can be load optimized, which means that only one, super projection winner will be selected, or query optimized, where DBD try to create as many projections, to cover most of the work log queries, or somewhat balance an objective in the middle. The last step is to decide encoding, for each projection columns, for the projection winners. Because the data are sorted differently, the encoding benefits, can be very different from the existing projection. So choose the right projection encoding design, will save the disk footprint a significant factor. So it's worth the effort, to find out the best thing encoding. DBD picks the encoding, based on the actual sampling the data, and measure the storage footprint. For example, in this case, the projection winner has three columns, and say each column has a few encoding options. DBD will write the sample data in the way this projection is sorted, and then you can see with different encoding, the disk footprint is different. DBD will then compare the disk footprint of each, of different options for each column, and pick the best encoding options, based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basic that how DBD internal works, of course, I think we've heard it quite a lot. For example, I didn't mention how the DBD handles segmentation, but the idea is similar to analyze the sort order. But I hope this section gave you some basic idea, about DBD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DBD in many ways. In this talk I will highlight four issues in old DBD and describe how the 10.0 version new DBD, will address those issues. The first issue is that a DBD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday, with the new or different projection can help speed it up? However, to answer a simple question like this using DBD, user will be very likely to have the documentation open on the side, because they have to go through it's whole complex flow, from creating a projection, run the design, get outputs and then create a design in the end. And that's not there yet, for each step, there are several functions user need to call in order. So adding these up, user need to write the quite long script with dozens of functions, it's just too complicated, and most of you may find it annoying. They either manually tune the projection to themselves, or simply live with the performance and come back, when it gets really slow again, and of course in most situations, they never come back to use the DBD. In 10.0 VERTICA support the new simplified API, to run DBD easily. There will be just one function designer_single_run and one argument, the interval that you think, your query was slow. In this case, user complained about it yesterday. So what does this user to need to do, is just specify one day, as argument and run it. The user don't need to provide anything else, because the DBD will look up his query or history, within that time window and automatically populate design, run design and export the projection design, and the clean up, no user intervention needed. No need to have the documentation on the side and carefully write a script, and a debug, just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes to another issue. To fully utilize this single round function, users are encouraged to run DBD on the production cluster. However, in fact, VERTICA used to not recommend, to run a design on a production cluster. One of the reasons issue, is that DBD picks massive locks, both table locks and catalog locks, which will badly interfere the running workload, on a production cluster. As of 10.0, we eliminated all the table and ten catalog locks from DBD. Yes, we eliminate 100% of them, simple improvement, clear win. The third issue, which user may not be aware of, is that DBD writes intermediate result. into real VERTICA tables, the real DBD have to do that is, DBD is the background task. So the intermediate results, some user needs to monitor it, the progress of the DBD in concurrent session. For complex design, the intermediate result can be quite massive, and as a result, many lost files will be created, and written to the disk, and we should both stress, the catalog, and that the disk can slow down the design. For ER mode, it's even worse because, the table are shared on communal storage. So writing to the regular table, means that it has to upload the data, to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer, and make this shared in memory data structure. Monitoring queries will go directly look up, in memory data structure, and go through the system table, and return the results. No Intermediate Results files will be written anymore. Another expensive lubidge of local disk for DBD is encoding design, as I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DBD need to actually write down, the sample data to the disk, using the different encoding options, and to find out which ones have the smallest footprint, or pick it as the best choice. These written sample data will be useless after this, and it will be wiped out right away, and you can imagine this is a huge waste of the system resource. In 10.0 we improve this process. So instead of writing, the different encoded data on the disk, and then read the file size, DBD aggregate the data block size on-the-fly. The data block will not be written to the disk, so the overall encoding and design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put a significant amount of the resource on the improving the DBD in 10.0, is because the VERTICA DBD, as essential component of the out of box performance design campaign. To simply illustrate the timeline, we are now on the second step, where we significantly reduced, the running overhead of the DBD, so that user will no longer fear, to run DBD on their production cluster. Please be noted that as of 10.0, we haven't really started changing, how DBD design algorithm works, so that what we have discussed in the deep dive today, still holds. For the next phase of DBD, we will briefly make the design process smarter, and this will include better enumeration mechanism, so that the pruning is more intelligence rather than brutal, then that will result in better design quality, and also faster design. The longer term is to make DBD to achieve the automation. What entail automation and what I really mean is that, instead of having user to decide when to use DBD, until their query is slow, VERTICA have to know, detect this event, and have have DBD run automatically for users, and suggest the better projections design, if the existing projection is not good enough. Of course, there will be a lot of work that need to be done, before we can actually fully achieve the automation. But we are working on that. At the end of day, what the user really wants, is the fast database, right? And thank you for listening to my presentation. so I hope you find it useful. Now let's get ready for the Q&A.

Published Date : Mar 30 2020

SUMMARY :

at the end of the presentation. and the many of you may also know,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Yuanzhe Bei	PERSON	0.99+
Jeff Healey	PERSON	0.99+
100%	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
one day	QUANTITY	0.99+
second step	QUANTITY	0.99+
third step	QUANTITY	0.99+
tomorrow	DATE	0.99+
third issue	QUANTITY	0.99+
today	DATE	0.99+
First	QUANTITY	0.99+
yesterday	DATE	0.99+
Each benefit	QUANTITY	0.99+
Today	DATE	0.99+
third projection	QUANTITY	0.99+
One	QUANTITY	0.99+
b2	OTHER	0.99+
each column	QUANTITY	0.99+
first issue	QUANTITY	0.99+
one column	QUANTITY	0.99+
three columns	QUANTITY	0.99+
VERTICA Engineering	ORGANIZATION	0.99+
Yuanzhe	PERSON	0.99+
each step	QUANTITY	0.98+
Each table	QUANTITY	0.98+
first step	QUANTITY	0.98+
DBD	TITLE	0.98+
DBD	ORGANIZATION	0.98+
seven steps	QUANTITY	0.98+
DBL	ORGANIZATION	0.98+
each	QUANTITY	0.98+
one argument	QUANTITY	0.98+
VERTICA	TITLE	0.98+
each projection	QUANTITY	0.97+
first two	QUANTITY	0.97+
first	QUANTITY	0.97+
this week	DATE	0.97+
hundreds	QUANTITY	0.97+
one function	QUANTITY	0.97+
clause a2	OTHER	0.97+
one	QUANTITY	0.97+
each per columns	QUANTITY	0.96+
Tomorrow	DATE	0.96+
both	QUANTITY	0.96+
four issues	QUANTITY	0.95+
VERTICA	ORGANIZATION	0.95+
b1	OTHER	0.95+
single round	QUANTITY	0.94+
4/2	DATE	0.94+
first couple of columns	QUANTITY	0.92+
VERTICA Database Designer Today and Tomorrow	TITLE	0.91+
Vertica	ORGANIZATION	0.91+
10.0	QUANTITY	0.89+
one function call	QUANTITY	0.89+
a1	OTHER	0.89+
four things	QUANTITY	0.88+
c1	OTHER	0.87+
two sort order	QUANTITY	0.85+

UNLIST TILL 4/2 - Sizing and Configuring Vertica in Eon Mode for Different Use Cases

>> Jeff: Hello everybody, and thank you for joining us today, in the virtual Vertica BDC 2020. Today's Breakout session is entitled, "Sizing and Configuring Vertica in Eon Mode for Different Use Cases". I'm Jeff Healey, and I lead Vertica Marketing. I'll be your host for this Breakout session. Joining me are Sumeet Keswani, and Shirang Kamat, Vertica Product Technology Engineers, and key leads on the Vertica customer success needs. But before we begin, I encourage you to submit questions or comments during the virtual session, you don't have to wait, just type your question or comment in the question box below the slides, and click submit. There will be a Q&A session at the end of the presentation, we will answer as many questions as we're able to during that time, any questions we don't address, we'll do our best to answer them off-line. Alternatively, visit Vertica Forums, at forum.vertica.com, post your question there after the session. Our Engineering Team is planning to join the forums to keep the conversation going. Also as reminder, that you can maximize your screen by clicking the double arrow button in the lower-right corner of the slides, and yes, this virtual session is being recorded, and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. Now let's get started! Over to you, Shirang. >> Shirang: Thanks Jeff. So, for today's presentation, we have picked Eon Mode concepts, we are going to go over sizing guidelines for Eon Mode, some of the use cases that you can benefit from using Eon Mode. And at last, we are going to talk about, some tips and tricks that can help you configure and manage your cluster. Okay. So, as you know, Vertica has two modes of operation, Eon Mode and Enterprise Mode. So the question that you may have is, which mode should I implement? So let's look at what's there in the Enterprise Mode. Enterprise Mode, you have a cluster, with general purpose compute nodes, that have locally at their storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now, amount of data that you can store in Enterprise Mode cluster, depends on the total disk capacity of the cluster. Again, Enterprise Mode is more suitable for on premise and cloud deployments. Now, let's look at Eon Mode. To take advantage of cloud economics, Vertica implemented Eon Mode, which is getting very popular among our customers. In Eon Mode, we have compute and storage, that are separated by introducing S3 Bucket, or, S3 compliant storage. Now because of this separation of compute and storage, you can take advantages like mapping all dynamic scale-out and scale-in. Isolation of your workload, as well as you can load data in your cluster, without having to worry about the total disk capacity of your local nodes. Obviously, you know, it's obvious from what they accept, Eon Mode is suitable for cloud deployment. Some of our customers who take advantage of the features of Eon Mode, are also deploying it on premise, by introducing S3 compliant slash web storage. Okay? So, let's look at some of the terminologies used in Eon Mode. The four things that I want to talk about are, communal storage. It's a shared storage, or S3 compliant shared storage, a bucket that is accessible from all the nodes in your cluster. Shard, is a segment of data, stored on the communal storage. Subscription, is the binding with nodes and shards. And last, depot. Depot is a local copy or, a local cache, that can help query in group performance. So, shard is a segment of data stored in communal storage. When you create a Eon Mode cluster, you have to specify the shard count. Shard count decide the maximum number of nodes that will participate in your query. So, Vertica also will introduce a shard, called replica shard, that will hold the data for replicated projections. Subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards, and a shard has at least two nodes that subscribe to it for case 50. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber node holds up-to-date metadata for a catalog of files that are present in the shard. So, when you connect to Vertica node, Vertica will automatically assign you set of nodes and subscriptions that will process your query. There are two important system tables. There are node subscriptions, and session subscriptions, that can help you understand this a little bit more. So let's look at what's on the local disk of your Eon Mode cluster. So, on local disk, you have depot. Depot is a local file system cache, that can hold subset of the data, or copy of the data, in communal storage. Other things that are there, are temp storage, temp storage is used for storing data belonging to temporary tables, and, the data that spills through this, when you are processing queries. And last, is catalog. Catalog is a persistent copy of Vertica, catalog that is written to this. The writes happen at every commit. You only need the persistent copy at node startup. There is also a copy of Vertica catalog, stored in communal storage, called durability. The local copy is synced to the copy in communal storage via service, at the interval of five minutes. So, let's look at depot. Now, as I said before, depot is your file system cache. It's help to reduce network traffic, and slow performance of your queries. So, we make assumption, that when we load data in Vertica, that's the data that you may most frequently query. So, every data that is loaded in Vertica is first entering the depot, and then as a part of same transaction, also synced to communal storage for durability. So, when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first, to be used, and if the files are not found, the queries will access files from communal storage. Now, the behavior of... you know, the new files, should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot when writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query. Which means we will fetch them asynchronously into the depot, so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration around return, to tell Vertica to not fetch them when you run your query, and this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter, where you can change this behavior. Because the depot is going to be limited in size, compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full, or hit the capacity. To make space for new data that is coming in, Vertica will evict some of the files that are least frequently used. Hence, depot is going to be your query performance enhancer. You want to shape the extent of your depot. And, so what you want to do is, to decide what shall be in your depot. Now Vertica provides some of the policies, called pinning policies, that can help you pin of statistics table or addition of a table, into a depot, at subcluster level, or at the database level. And Sumeet will talk about this a bit more in his future slides. Now look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were evicted, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC_FILE_READS. DC_FILE_READS can be used to figure out if your transaction or query fetched with data from depot, from communal storage, or component. One of the important features of Eon Mode is a subcluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes together subscribed to all the shards, and can process your query independently. So when you connect one node in the subcluster, that node, along with other nodes in the subcluster, will only process your query. And because of that, we can achieve isolation as well as, you know, fetches, scale-out and scale-in without impacting what's happening on the cluster. The good thing about subclusters, is all the subclusters have access to the communal storage. And because of this, if you load data in one subcluster, it's accessible to the queries that are running in other subclusters. When we introduced subclusters, we knew that our customers would really love these features, and, some of the things that we were considering is, we knew that our customers would dynamically scale out and in, lots of-- they would add and remove lots of subclusters on demand, and we had to provide that ab-- we had to give this feature, or provide ability to add and remove subclusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their subclusters, that means, more than half of the nodes could be down. And we had to make adjustment to our quorum policy which requires at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the catalog and commit policy. To take care of all these three requirements we introduced two types of subclusters, primary subclusters, and secondary subclusters. Primary subclusters is the one that you get by default when you create your first Eon cluster. The nodes in the primary subclusters are always up, that means they stay up and participate in the quorum. The nodes in the primary subcluster are responsible for processing commits, and also maintain a persistent copy, of catalog on disk. This is a subcluster that you would use to process all your ETL jobs, because the topper more also runs on the node, in the primary subcluster. If you want now at this point, have another subcluster, where you would like to run queries, and also, build this cluster up and down depending on the demand or the, depending on the workload, you would create a new subcluster. And this subcluster will be off-site secondary in nature. Now secondary subclusters have nodes that don't participate in quorums, so if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit, though they maintain up-to-date copies of the catalog in memory. They don't store catalog on disk. And these are subclusters that you can add and remove very quickly, without impacting what is running on the other subclusters. We have customers running hundreds of nodes, subclusters with hundreds of nodes, and subclusters of size like 64 node, and they can bring this subcluster up and down, or add and remove, within few minutes. So before I go into the sizing of Eon Mode, I just want to say one more thing here. We are working very closely with some of our customers who are running Eon Mode and getting better feedback from that on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running Eon Mode, and want to be part of this group, I suggest that, you keep your cluster current with latest hot-fixes and work with us to give us feedback, and get the improvements that you need to be successful. So let's look at what there-- What we need, to size Eon clusters. Sizing Eon clusters is very different from sizing Enterprise Mode cluster. When you are running Enterprise Mode cluster or when you're sizing Vertica cluster running Enterprise Mode, you need to take into account the amount of data that you want to store, and the configuration of your node. Depending on which you decide, how many nodes you will need, and then start the cluster. In Eon Mode, to size a cluster, you need few things like, what should be your shard count. Now, shard count decides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a subcluster, the instance type you will pick for running statistic subcluster, and how many subclusters you will need, and how many of them should be running all the time, and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count up front, and you can't change it once your database is up and running. So, we... So, you need to pick shard count depending the number of nodes, are the same number of nodes that you will need to process a query. Now one thing that we want to remember here, is this is not amount of data that you have in database, but this is amount of data your queries will process. So, you may have data for six years, but if your queries process last month of data, on most of the occasions, or if your dashboards are processing up to six weeks, or ten minutes, based on whatever your needs are, you will decide or pick the number of shards, shard count and nodes, based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And, that means, the maximum number of nodes in a subcluster that will process queries is going to be 12. If you feel that, you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number, like 48, and you go with three nodes in your subcluster, that means node subscribes to 16 primary and 16 secondary shard subscription, which totals to 32 subscriptions per node. That will leave your catalog in a broken state. So, pick shard count appropriately, don't pick prime numbers, we suggest 12 should work for most of our customers, if you think you process more than, you know, the regular, the regular number that, or you think that your customers, you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. Okay? We are also coming up with features in Vertica like current scaling, that will help you run more-- run queries on more than, more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now, the next thing is, you need to pick how many nodes you need within your subclusters, to process your query. Ideal number would be node number equal to shard count, or, if you want to pick a number that is less, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. When... So over here, you can have, option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option, where you have 12 nodes and 12 shards, is more suitable for, more suitable for batch applications, whereas two subclusters with, with six nodes each, is more suitable for desktop type applications. Picking subclusters is, it depends on your workload, you can add remove nodes relative to isolation, or Elastic Throughput Scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogenous. So this is my last slide before I hand over to Sumeet. And this I think is very important slide that I want you to pay attention to. When you pick instance, you are going to pick instance based on workload and query budget. I want to make it clear here that we want you to pay attention to the local disk, because you have depot on your local disk, which is going to be your query performance enhancer for all kinds of deployment, in cloud, as well as on premise. So you'd expect of what you read, or what you heard, depots still play a very important role in every Eon deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk, at least two terabytes is what we suggest. i3 instances in Amazon have, you know, come up with a good local disk that is very helpful, and some of our customers are benefiting from. With that said, I want to pass it over to Sumeet. >> Sumeet: So, hi everyone, my name is Sumeet Keswani, and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in Eon Mode. After that, I will go into some technical details of SQL, and then I'll blend that into the best practices, in Eon Mode. And finally, we'll go through some tips and tricks. So let's get started with the use cases. So a very basic use case that users will encounter, when they start Eon Mode the first time, is they will have two subclusters. The first subcluster will be the primary subcluster, used for ETL, like Shirang mentioned. And this subcluster will be mostly on, or always on. And there will be another subcluster used for, purely for queries. And this subcluster is the secondary subcluster and it will be on sometimes. Depending on the use case. Maybe from nine to five, or Monday to Friday, depending on what application is running on it, or what users are doing on it. So this is the most basic use case, something users get started with to get their feet wet. Now as the use of the deployment of Eon Mode with subcluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary subcluster which is used for ETL, typically a larger subcluster where there is more heavier ETL running, pretty much non-stop. Then they have the usual query subcluster which will use for queries, but they may add another one, another secondary subcluster for ad-hoc workloads. The motivation for this subcluster is to isolate the unpredictable workload from the predictable workload, so as not to impact certain isolates. So you may have ad-hoc queries, or users that are running larger queries or bad workloads that occur once in a while, from running on a secondary subcluster, on a different secondary subcluster, so as to not impact the more predictable workload running on the first subcluster. Now there is no reason why these two subclusters need to have the same instances, they can have different number of nodes, different instance types, different depot configurations. And everything can be different. Another benefit is, they can be metered differently, they can be costed differently, so that the appropriate user or tenant can be billed the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced Eon Mode deployment here. As you see, there is the primary subcluster of course, used for ETL, very heavy ETL, and that's always on. There are numerous secondary subclusters, some for predictable applications that have a very fine-tuned workload that needs a definite performance. There are other subclusters that have different usages, some for ad-hoc queries, others for demanding tenants, there could be still more subclusters for different departments, like Finance, that need it maybe at the end of the quarter. So very, very different applications, and this is the full and final promise of Eon, where there is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in Eon Mode, which we call Hibernate and Revive. So what is Hibernate? Hibernating a Vertica database is the act of dissociating all the computers on the database, and shutting it down. At this point, you shut down all compute. You still pay for storage, because your data is in the S3 bucket, but all the compute has been shut down, and you do not pay for compute anymore. If you have reserved instances, or any other instances you can use them for different applications, and your Vertica database is shut down. So this is very similar to stop database, in Eon Mode, you're stopping all compute. The benefit of course being that you pay nothing anymore for compute. So what is Revive, then? The Revive is the opposite of Hibernate, where you now associate compute with your S3 bucket or your storage, and start up the database. There is one limitation here that you should be aware of, is that the size of the database that you have during Hibernate, you must revive it the same size. So if you have a 12-node primary subcluster when hibernating, you need to provision 12 nodes in order to revive. So one best practice comes down to this, is that you must shrink your database to the smallest size possible before you hibernate, so that you can revive it in the same size, and you don't have to spin up a ton of compute in order to revive. So basically, what this means is, when you have decided to hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you hibernate it. And the benefit being, is when you do revive, you will have, you will be able to do so with the mimimum number of nodes. And of course, before you hibernate, you must cleanly shut down the database, so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in Eon Mode, we sometimes get the question, "We're in S3, and S3 has nine nines of reliability, we need a backup." Yes, we highly recommend backups, you can back-up by using the VBR script, you can back-up your database to another bucket, you can also copy the bucket and revive to a different, revive a different instance of your database. This is very useful because many times people want staging or development databases, and they need some of the data from production, and this is a nice way to get that. And it also makes sure that if you accidentally delete something you will be able to get back your data. Okay, so let's go into best practices now. I will start, let's talk about the depot first, which is the biggest performance enhancer that we see for queries. So, I want to state very clearly that reading from S3, or a remote object store like S3 is very slow, because data has to go over the network, and it's very expensive. You will pay for access cost. This is where S3 is not very cheap, is that every time you access the data, there is an ATI and access cost levied. Now the depot is a performance enhancing feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data, since it's available on a local and permanent volume. Hence depot shaping is a very important aspect of performance tuning in an Eon database. What we ask you to do is, if you are going to use a specific table or partition frequency, you can choose to pin it, in the depot, so that if your depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the depot. So therefore, depot, depot shaping is the act of setting eviction policies, instead you prevent the eviction of files that you believe you need to keep, so for example, you may keep the most recent year's data or the most recent, recent partition in the depot, and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the depot, but it is not subcluster-based. Future versions of Vertica will allow you fine-tuning the depot based on each subcluster. So, let's now go and understand a little bit of internals of how a SQL query works in Eon Mode. And, once I explain this, we will blend into best practice and it will become much more clearer why we recommend certain things. So, since S3 is our layer of durability, where data is persistent in an Eon database. When you run an insert query, like, insert into table value one, or something similar. Data is synchronously written into S3. So, it will control returns back to the client, the copy of the data is first stored in the local depot, and then uploaded to S3. And only then do we hand the control back to the client. This ensures that if something bad were to happen, the data will be persistent. The second, the second types of SQL transactions are what we call DTLs, which are catalog operations. So for example, you create a table, or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage, the storage in S3 is immutable. You can never append to a file in S3. And, the way transaction logs work is, they are append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge which we resolve by appending to the transaction log locally in the catalog, and then there is a service that syncs the catalog to S3 every five minutes. So this poses an interesting challenge, right. If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally let's look at, drops or truncates in Eon. Now a drop or a truncate is really a combination of the first two things that we spoke about, when you drop a table, you are making, a drop operation, you are making a metadata change. You are telling Vertica that this table no longer exists, so we go into the transaction log, and append into the transaction log, that this table has been removed. This log of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now these files are on S3. And we can go about deleting them synchronously, but that would take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files, we put the files that need to be removed in a reaper queue. And return the control back to the client. And this has the performance benefit as to the drops appear to occur really fast. This also has a cost benefit, batching deletes, in big batches, is more performant, and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single cost. So if you batched your deletes, you could delete them very quickly. The disadvantage of this is if you were to terminate a Vertica customer abruptly, you could leak files in S3, because the reaper queue would not have had the chance to delete these files. Okay, so let's, let's go into best practices after speaking, after understanding some technical details. So, as I said, reading and writing to S3 is slow and costly. So, the first thing you can do is, avoid as many round trips to S3 as possible. The bigger the batches of data you load, the better. The better performance you get, per commit. The fact thing is, don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing which they think temporarily they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where you have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like, consider last year's data, or the year before that data. Those partitions are aggressively merging into a single container. And how do we know how many partitions are active and inactive? Well that's based on the configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers, so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic, and I said, accessing data is, from S3, slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions, and if you happen to load into inactive partitions, set your active partition count correctly. Okay, let's talk about the reaper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted asynchronously. If you were were to terminate a Vertica customer without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course if you use local temporary tables this problem does not occur because the files were never created in S3, but if you are using regular tables, you must allow Vertica enough time to delete these files, and you can change the interval at which we delete, and how much time we allow to delete and shut down, by exiting some configuration parameters that I have mentioned here. And, yeah. Okay, so let's talk a little bit about a catalog at this point. So, the catalog is synced every five minutes onto S3 for persistence. And, the catalog truncation version is the minimum, minimal viable version of the catalog to which we can revive. So, for instance, if somebody destroyed a Vertica cluster, the entire Vertica cluster, the catalog truncation version is the mimimum viable version that you will be able to revive to. Now, in order to make sure that the catalog truncation version is up to date, you must always shut down your Vertica cluster cleanly. This allows the catalog to be synced to S3. Now here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you're shutting down cleanly, so, this is only in cases of disaster or some event where all nodes were terminated, without... without the user's permission. And... And finally let's talk about backups, so one more time, we highly recommend you take backups, you know, S3 is designed for 99.9% availability, so there could be a, maybe an occasional down-time, making sure you have backups will help you if you accidentally drop a table. S3 will not protect you against data that was deleted by accident, so, having a backup helps you there. And why not backup, right, storage is cheap. You can replicate the entire bucket and have that as a backup, or have DR plus, you're running in a different region, which also sources a backup. So, we highly recommend that you make backups. So, so with this I would like to, end my presentation, and we're ready for any questions if you have it. Thank you very much. Thank you very much.

Published Date : Mar 30 2020

SUMMARY :

Also as reminder, that you can maximize your screen and get the improvements that you need to be successful. So, the first thing you can do is,

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Sumeet	PERSON	0.99+
Sumeet Keswani	PERSON	0.99+
Shirang Kamat	PERSON	0.99+
Jeff Healey	PERSON	0.99+
6 nodes	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
six years	QUANTITY	0.99+
ten minutes	QUANTITY	0.99+
12 nodes	QUANTITY	0.99+
Shirang	PERSON	0.99+
1,000 files	QUANTITY	0.99+
one	QUANTITY	0.99+
12 shards	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
99.9%	QUANTITY	0.99+
two modes	QUANTITY	0.99+
S3	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
first subcluster	QUANTITY	0.99+
first time	QUANTITY	0.99+
two options	QUANTITY	0.99+
first	QUANTITY	0.99+
first option	QUANTITY	0.99+
each	QUANTITY	0.99+
two subclusters	QUANTITY	0.99+
Each node	QUANTITY	0.99+
hundreds of nodes	QUANTITY	0.99+
Today	DATE	0.99+
each app	QUANTITY	0.99+
today	DATE	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
One	QUANTITY	0.98+
three nodes	QUANTITY	0.98+
SQL	TITLE	0.98+
Eon Mode	TITLE	0.98+
single container	QUANTITY	0.97+
this week	DATE	0.97+
16 secondary shard subscription	QUANTITY	0.97+
two types	QUANTITY	0.97+
Sizing and Configuring Vertica in Eon Mode for Different Use Cases	TITLE	0.97+
Vertica	TITLE	0.97+
one limitation	QUANTITY	0.97+

UNLIST TILL 4/2 - Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning

>> Paige: Hello, everybody, and thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled "Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning." I'm Paige Roberts, Opensource Relations Manager at Vertica, and I'll be your host for this session. Joining me is Vertica Software Engineer, George Larionov. >> George: Hi. >> Paige: (chuckles) That's George. So, before we begin, I encourage you guys to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. So, as soon as a question occurs to you, go ahead and type it in, and there will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to get to during that time. Any questions we don't get to, we'll do our best to answer offline. Now, alternatively, you can visit Vertica Forum to post your questions there, after the session. Our engineering team is planning to join the forums to keep the conversation going, so you can ask an engineer afterwards, just as if it were a regular conference in person. Also, reminder, you can maximize your screen by clicking the double-arrow button in the lower right corner of the slides. And, before you ask, yes, this virtual session is being recorded, and it will be available to view by the end this week. We'll send you a notification as soon as it's ready. Now, let's get started, over to you, George. >> George: Thank you, Paige. So, I've been introduced. I'm a Software Engineer at Vertica, and today I'm going to be talking about a new feature, Vertica's Integration with TensorFlow. So, first, I'm going to go over what is TensorFlow and what are neural networks. Then, I'm going to talk about why integrating with TensorFlow is a useful feature, and, finally, I am going to talk about the integration itself and give an example. So, as we get started here, what is TensorFlow? TensorFlow is an opensource machine learning library, developed by Google, and it's actually one of many such libraries. And, the whole point of libraries like TensorFlow is to simplify the whole process of working with neural networks, such as creating, training, and using them, so that it's available to everyone, as opposed to just a small subset of researchers. So, neural networks are computing systems that allow us to solve various tasks. Traditionally, computing algorithms were designed completely from the ground up by engineers like me, and we had to manually sift through the data and decide which parts are important for the task and which are not. Neural networks aim to solve this problem, a little bit, by sifting through the data themselves, automatically and finding traits and features which correlate to the right results. So, you can think of it as neural networks learning to solve a specific task by looking through the data without having human beings have to sit and sift through the data themselves. So, there's a couple necessary parts to getting a trained neural model, which is the final goal. By the way, a neural model is the same as a neural network. Those are synonymous. So, first, you need this light blue circle, an untrained neural model, which is pretty easy to get in TensorFlow, and, in edition to that, you need your training data. Now, this involves both training inputs and training labels, and I'll talk about exactly what those two things are on the next slide. But, basically, you need to train your model with the training data, and, once it is trained, you can use your trained model to predict on just the purple circle, so new training inputs. And, it will predict the training labels for you. You don't have to label it anymore. So, a neural network can be thought of as... Training a neural network can be thought of as teaching a person how to do something. For example, if I want to learn to speak a new language, let's say French, I would probably hire some sort of tutor to help me with that task, and I would need a lot of practice constructing and saying sentences in French. And a lot of feedback from my tutor on whether my pronunciation or grammar, et cetera, is correct. And, so, that would take me some time, but, finally, hopefully, I would be able to learn the language and speak it without any sort of feedback, getting it right. So, in a very similar manner, a neural network needs to practice on, example, training data, first, and, along with that data, it needs labeled data. In this case, the labeled data is kind of analogous to the tutor. It is the correct answers, so that the network can learn what those look like. But, ultimately, the goal is to predict on unlabeled data which is analogous to me knowing how to speak French. So, I went over most of the bullets. A neural network needs a lot of practice. To do that, it needs a lot of good labeled data, and, finally, since a neural network needs to iterate over the training data many, many times, it needs a powerful machine which can do that in a reasonable amount of time. So, here's a quick checklist on what you need if you have a specific task that you want to solve with a neural network. So, the first thing you need is a powerful machine for training. We discussed why this is important. Then, you need TensorFlow installed on the machine, of course, and you need a dataset and labels for your dataset. Now, this dataset can be hundreds of examples, thousands, sometimes even millions. I won't go into that because the dataset size really depends on the task at hand, but if you have these four things, you can train a good neural network that will predict whatever result you want it to predict at the end. So, we've talked about neural networks and TensorFlow, but the question is if we already have a lot of built-in machine-learning algorithms in Vertica, then why do we need to use TensorFlow? And, to answer that question, let's look at this dataset. So, this is a pretty simple toy dataset with 20,000 points, but it shows, it simulates a more complex dataset with some sort of two different classes which are not related in a simple way. So, the existing machine-learning algorithms that Vertica already has, mostly fail on this pretty simple dataset. Linear models can't really draw a good line separating the two types of points. Naïve Bayes, also, performs pretty badly, and even the Random Forest algorithm, which is a pretty powerful algorithm, with 300 trees gets only 80% accuracy. However, a neural network with only two hidden layers gets 99% accuracy in about ten minutes of training. So, I hope that's a pretty compelling reason to use neural networks, at least sometimes. So, as an aside, there are plenty of tasks that do fit the existing machine-learning algorithms in Vertica. That's why they're there, and if one of your tasks that you want to solve fits one of the existing algorithms, well, then I would recommend using that algorithm, not TensorFlow, because, while neural networks have their place and are very powerful, it's often easier to use an existing algorithm, if possible. Okay, so, now that we've talked about why neural networks are needed, let's talk about integrating them with Vertica. So, neural networks are best trained using GPUs, which are Graphics Processing Units, and it's, basically, just a different processing unit than a CPU. GPUs are good for training neural networks because they excel at doing many, many simple operations at the same time, which is needed for a neural network to be able to iterate through the training data many times. However, Vertica runs on CPUs and cannot run on GPUs at all because that's not how it was designed. So, to train our neural networks, we have to go outside of Vertica, and exporting a small batch of training data is pretty simple. So, that's not really a problem, but, given this information, why do we even need Vertica? If we train outside, then why not do everything outside of Vertica? So, to answer that question, here is a slide that Philips was nice enough to let us use. This is an example of production system at Philips. So, it consists of two branches. On the left, we have a branch with historical device log data, and this can kind of be thought of as a bunch of training data. And, all that data goes through some data integration, data analysis. Basically, this is where you train your models, whether or not they are neural networks, but, for the purpose of this talk, this is where you would train your neural network. And, on the right, we have a branch which has live device log data coming in from various MRI machines, CAT scan machines, et cetera, and this is a ton of data. So, these machines are constantly running. They're constantly on, and there's a bunch of them. So, data just keeps streaming in, and, so, we don't want this data to have to take any unnecessary detours because that would greatly slow down the whole system. So, this data in the right branch goes through an already trained predictive model, which need to be pretty fast, and, finally, it allows Philips to do some maintenance on these machines before they actually break, which helps Philips, obviously, and definitely the medical industry as well. So, I hope this slide helped explain the complexity of a live production system and why it might not be reasonable to train your neural networks directly in the system with the live device log data. So, a quick summary on just the neural networks section. So, neural networks are powerful, but they need a lot of processing power to train which can't really be done well in a production pipeline. However, they are cheap and fast to predict with. Prediction with a neural network does not require GPU anymore. And, they can be very useful in production, so we do want them there. We just don't want to train them there. So, the question is, now, how do we get neural networks into production? So, we have, basically, two options. The first option is to take the data and export it to our machine with TensorFlow, our powerful GPU machine, or we can take our TensorFlow model and put it where the data is. In this case, let's say that that is Vertica. So, I'm going to go through some pros and cons of these two approaches. The first one is bringing the data to the analytics. The pros of this approach are that TensorFlow is already installed, running on this GPU machine, and we don't have to move the model at all. The cons, however, are that we have to transfer all the data to this machine and if that data is big, if it's, I don't know, gigabytes, terabytes, et cetera, then that becomes a huge bottleneck because you can only transfer in small quantities. Because GPU machines tend to not be that big. Furthermore, TensorFlow prediction doesn't actually need a GPU. So, you would end up paying for an expensive GPU for no reason. It's not parallelized because you just have one GPU machine. You can't put your production system on this GPU, as we discussed. And, so, you're left with good results, but not fast and not where you need them. So, now, let's look at the second option. So, the second option is bringing the analytics to the data. So, the pros of this approach are that we can integrate with our production system. It's low impact because prediction is not processor intensive. It's cheap, or, at least, it's pretty much as cheap as your system was before. It's parallelized because Vertica was always parallelized, which we'll talk about in the next slide. There's no extra data movement. You get the benefit from model management in Vertica, meaning, if you import multiple TensorFlow models, you can keep track of their various attributes, when they were imported, et cetera. And, the results are right where you need them, inside your production pipeline. So, two cons are that TensorFlow is limited to just prediction inside Vertica, and, if you want to retrain your model, you need to do that outside of Vertica and, then, reimport. So, just as a recap of parallelization. Everything in Vertica is parallelized and distributed, and TensorFlow is no exception. So, when you import your TensorFlow model to your Vertica cluster, it gets copied to all the nodes, automatically, and TensorFlow will run in fenced mode which means that it the TensorFlow process fails for whatever reason, even though it shouldn't, but if it does, Vertica itself will not crash, which is obviously important. And, finally, prediction happens on each node. There are multiple threads of TensorFlow processes running, processing different little bits of data, which is faster, much faster, than processing the data line by line because it happens all in a parallelized fashion. And, so, the result is fast prediction. So, here's an example which I hope is a little closer to what everyone is used to than the usual machine learning TensorFlow example. This is the Boston housing dataset, or, rather, a small subset of it. Now, on the left, we have the input data to go back to, I think, the first slide, and, on the right, is the training label. So, the input data consists of, each line is a plot of land in Boston, along with various attributes, such as the level of crime in that area, how much industry is in that area, whether it's on the Charles River, et cetera, and, on the right, we have as the labels the median house value in that plot of land. And, so, the goal is to put all this data into the neural network and, finally, get a model which can train... I don't know, which can predict on new incoming data and predict a good housing value for that data. Now, I'm going to go through, step by step, how to actually use TensorFlow models in Vertica. So, the first step I won't go into much detail on because there are countless tutorials and resources online on how to use TensorFlow to train a neural network, so that's the first step. Second step is to save the model in TensorFlow's 'frozen graph' format. Again, this information is available online. The third step is to create a small, simple JSON file describing the inputs and outputs of the model, and what data type they are, et cetera. And, this is needed for Vertica to be able to translate from TensorFlow land into Vertica equal land, so that it can use a sequel table instead of the input set TensorFlow usually takes. So, once you have your model file and your JSON file, you want to put both of those files in a directory on a node, any node, in a Vertica cluster, and name that directory whatever you want your model to ultimately be called inside of Vertica. So, once you do that you can go ahead and import that directory into Vertica. So, this import model's function already exists in Vertica. All we added was a new category to be able to import. So, what you need to do is specify the pass to your neural network directory and specify that the category that the model is is a TensorFlow model. Once you successfully import, in order to predict, you run this brand new predict TensorFlow function, so, in this case, we're predicting on everything from the input table, which is what the star means. The model name is Boston housing net which is the name of your directory, and, then, there's a little bit of boilerplate. And, the two ID and value after the as are just the names of the columns of your outputs, and, finally, the Boston housing data is whatever sequel table you want to predict on that fits the import type of your network. And, this will output a bunch of predictions. In this case, values of houses that the network thinks are appropriate for all the input data. So, just a quick summary. So, we talked about what is TensorFlow and what are neural networks, and, then, we discussed that TensorFlow works best on GPUs because it needs very specific characteristics. That is TensorFlow works best for training on GPUs while Vertica is designed to use CPUs, and it's really good at storing and accessing a lot of data quickly. But, it's not very well designed for having neural networks trained inside of it. Then, we talked about how neural models are powerful, and we want to use them in our production flow. And, since prediction is fast, we can go ahead and do that, but we just don't want to train there, and, finally, I presented Vertica TensorFlow integration which allows importing a trained neural model, a trained neural TensorFlow model, into Vertica and predicting on all the data that is inside Vertica with few simple lines of sequel. So, thank you for listening. I'm going to take some questions, now.

Published Date : Mar 30 2020

SUMMARY :

and I'll be your host for this session. So, as soon as a question occurs to you, So, the second option is bringing the analytics to the data.

ENTITIES

Entity	Category	Confidence
Vertica	ORGANIZATION	0.99+
Philips	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
George	PERSON	0.99+
99%	QUANTITY	0.99+
20,000 points	QUANTITY	0.99+
second option	QUANTITY	0.99+
Charles River	LOCATION	0.99+
Google	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
Paige Roberts	PERSON	0.99+
third step	QUANTITY	0.99+
first step	QUANTITY	0.99+
George Larionov	PERSON	0.99+
first option	QUANTITY	0.99+
two things	QUANTITY	0.99+
first	QUANTITY	0.99+
Second step	QUANTITY	0.99+
Paige	PERSON	0.99+
each line	QUANTITY	0.99+
two branches	QUANTITY	0.99+
Today	DATE	0.99+
two options	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
300 trees	QUANTITY	0.99+
two approaches	QUANTITY	0.99+
millions	QUANTITY	0.99+
first slide	QUANTITY	0.99+
TensorFlow	TITLE	0.99+
Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning	TITLE	0.99+
two types	QUANTITY	0.99+
two different classes	QUANTITY	0.99+
today	DATE	0.99+
both	QUANTITY	0.99+
Vertica	TITLE	0.99+
first one	QUANTITY	0.98+
two cons	QUANTITY	0.97+
about ten minutes	QUANTITY	0.97+
two hidden layers	QUANTITY	0.97+
French	OTHER	0.96+
each node	QUANTITY	0.95+
one	QUANTITY	0.95+
end this week	DATE	0.94+
two ID	QUANTITY	0.91+
four things	QUANTITY	0.89+

UNLIST TILL 4/2 - Autonomous Log Monitoring

>> Sue: Hi everybody, thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "Autonomous Monitoring Using Machine Learning". My name is Sue LeClaire, director of marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, founder and CTO at Zebrium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slide and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Larry, over to you. >> Larry: Hey, thanks so much. So hi, my name's Larry Lancaster and I'm here to talk to you today about something that I think who's time has come and that's autonomous monitoring. So, with that, let's get into it. So, machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field or nowadays, that's been deployed, and bringing that data back, like log file stats, and then building stuff on top of it. So, tools to run the business or services to sell back to users and customers. And so, after doing that a few times, it kind of got to the point where I was really sort of sick of building the same kind of thing from scratch every time, so I figured, why not go start a company and do it so that we don't have to do it manually ever again. So, it's interesting to note, I've put a little sentence here saying, "companies where I got to use Vertica" So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of alpha. And I've really been enjoying their technology ever since. So, our vision is basically that I want a system that will characterize incidents before I notice. So an incident is, you know, we used to call it a support case or a ticket in IT, or a support case in support. Nowadays, you may have a DevOps team, or a set of SREs who are monitoring a production sort of deployment. And so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened. And so that's a pretty heady goal. And so I'm going to talk a little bit today about how we do that. So, if we look at logs in particular. Logs today, if you look at log monitoring. So monitoring is kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped, or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools. But they have a number of drawbacks. For one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good that if you have a new issue, if it's an unknown unknown problem, you're going to end up in a log file. So the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time-consuming. So, they're also fragile and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file. So you have this situation where, if I write a parser today, and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service, or whatever it is that I want to happen. What'll happen is later upstream, someone who's writing the code that produces that log message, they might do something really useful for me, or for users. And they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks. So it's a very fragile source for automation. And finally, because of that, people will set alerts on, "Oh, well tell me how many thousands of errors are happening every hour." Or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human-driven, slow, fragile process. So basically, we've set out to kind of up-level that a bit. So I touched on this already, right? The truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder, if that's the case, why do most people use metrics only for monitoring? And the reason is related to the problems I just described. They're already structured, right? So for logs, you've got this mess of stuff, so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is. So we have a model today, and this model used to work pretty well. And that model is called "index and search". And it basically means you treat log files like they're text documents. And so you index them and when there's some issue you have to drill into, then you go searching, right? So let's look at that model. So 20 years ago, we had sort of a shrink-wrap software delivery model. You had an incident. With that incident, maybe you had one customer and you had a monolithic application and a handful of log files. So it's perfectly natural, in fact, usually you could just v-item the log file, and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well because the developer or the support engineer had to be an expert in those few things, in those few log files, and understand what they meant. But today, everything has changed completely. So we live in a software as a service world. What that means is, for a given incident, first of all you're going to be affecting thousands of users. You're going to have, potentially, 100 services that are deployed in your environment. You're going to have 1,000 log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be index and search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person and their eyeball. So, you continue to drive up the amount of data that has to be sifted through, the complexity of the stack that has to be understood, and you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. So this model, I believe, is fundamentally broken. And that's why, I believe in five years you're going to be in a situation where most monitoring of unknown unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So now I'm going to talk a little bit about autonomous monitoring itself. So, autonomous monitoring basically means, if you can imagine in a monitoring platform and you watch the monitoring platform, maybe you watch the alerts coming from it or more importantly, you kind of watch the dashboards and try to see if something looks weird. So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happened. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. So here in this example, we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian Stack, so you can imagine you've got a Postgres database. And then you've got sort of Bitbucket, and Confluence, and Jira, and these various other components that need the database operating in order to function. So what this is doing is it's calling out, "Hey, the root cause is the database stopped and here's the symptoms." Now, you might be wondering, so what. I mean I could go write a script to do this sort of thing. Here's what's interesting about this very particular example, and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So, in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, Bitbucket, Confluence, there's no regexes that talk about starting, stopped, RDBMS, swallowed exception, and so on and so forth. So you might wonder how it's possible then, that something which is completely ignorant of the stack, could come up with this description, which is exactly what a human would have had to do, to figure out what happened. And I'm going to get into how we do that. But that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information, and understanding when something breaks. And I could give you the punchline right now, which is there are fundamental ways that software behaves when it's breaking. And by looking at hundreds of data sets that people have generously allowed us to use containing incidents, we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here. So there's a fella, David Gill, he's just a genius in the monitoring space. He's been working with us for the last couple of months. So he said, "You know what I'm going to do, is I'm going to run some chaos experiments." So for those of you who don't know what chaos engineering is, here's the idea. So basically, let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus. And basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up. And so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he deleted, basically one of the tests that was presented through litmus did a delete of a pod delete. And so that's going to basically take out some containers that are part of the service layer. And so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example. Because it's actually kind of eye-opening. So the chaos tool itself generates logs. And of course, through Kubernetes, all the log files locations that are on the host, and the container logs are known. And those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here, when it went to determine what the root cause was, was it noticed that there was this process that had these messages happen, initializing deletion lists, selection a pod to kill, blah blah blah. It's saying that the root cause is the chaos test. And it's absolutely right, that is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms. But this is what happens when you're able to kind of tease out root cause from symptoms autonomously, is you end up getting a much more meaningful answer, right? So here's another example. So essentially, we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where, I believe this is where we ran something out of disk space. So it opened an incident, but what's also interesting here is, you see that it pulled that metric to say that the spike in this metric was a symptom of this running out of space. So again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard-coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking at a stack the way a human being would. If you can imagine how you would walk in and monitor something, how you would think about it. You'd go looking around for rare things. Things that are not normal. And you would look for indicators of breakage, and you would see, do those seem to be correlated in some dimension? That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. We end up in a situation where we have a one-stop shop for incident root cause. So, how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them, and I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here, structure really should have an asterisk next to it, because metrics are mostly structured already. They have names. If you have your own scraper, as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data. And then we cross-correlate metrics and log anomalies. And then we create incidents. So this is at a high level, kind of what's happening without any sort of stack-specific logic built in. So we had some exciting recent validation. So Mayadata's a pretty big player in the Kubernetes space. Essentially, they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they're also involved, both in the OpenEBS project, as well as in the Litmius project I mentioned a moment ago. That's their tool for chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially, they said, "Oh okay, let's see if this is real." So what they did was they set up our collectors, which took three minutes in Kubernetes. And then they went and they, using Litmus, they reproduced eight incidents that their actual, real-world customers had hit. And they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So, like I said, there's no information included or required about, so if you imagine a log file for example. Now, commonly, over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp, or a severity, and maybe there's a PID, and maybe there's function name, and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file. But you know, of course, the contents change. So basically today, like if you look at a typical log manager, they'll talk about connectors. And what connectors means is, for an application it'll generate a certain prefix format in a log. And that means what's the format of the timestamp, and what else is in the prefix. And this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors, you don't have to describe the prefix format. That's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of working like a human would. You look at a log line, you know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable, but you'll figure it out as quickly as possible, and that's exactly how the system goes about it. As a result, we kind of embrace free-text logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free-text. Even structured logging typically will have a message attribute, which then inside of it has the free-text message. For us, that's not a bad thing. That's okay. The purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack just because you want a machine to handle it. They'll figure it out for themselves, right? So, you give us the logs and we'll figure out the grammar, not only for the prefix but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest. Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. As the data comes in, essentially we parse it and we categorize it, as I've mentioned. And when I say categorize, what I mean is, if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing. So this one will say "X happened five times" and then maybe a few lines below it'll say "X happened six times" but that's basically the same event type. It's just a different instance of that event type. And it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types and I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event types occurrences. So you can think of each event type occurring in time as sort of a point process. And then you can develop statistics and distributions on that, and you can do anomaly detection on those. Once we have all of that, we have extracted features, essentially, from metrics and from logs. We do pattern recognition on the correlations across different channels of information, so different event types, different log types, different hoses, different containers, and then of course across to the metrics. Based on all of this cross-correlation, we end up with a root cause identification. So that's essentially, at a high level, how it works. What's interesting, from the perspective of this call particularly, is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen. You can run analytical queries against that information so that you can quickly, in real-time, do anomaly detection against new data. So here's an example of that this looks like. And this kind of part of the work that we've done. At the top you see some examples of log lines, right? So that's kind of a snippet, it's three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors, right? I mean, it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there, you've got a timestamp, and a severity, and a function name. And then you've got some other information. And then finally, you have the variable part. And that's going to have sort of this checkpoint for memory scrubbers, probably something that's written in English, just so that the person who's reading the log file can understand. And then there's some parameters that are put in, right? So now, if you look at how we structure that, the way it looks is there's going to be three tables that correspond to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone, and so on. And date, and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message. And so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know, we're the first people to do this inline. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that, because nowadays when you say "column store", people sort of think, like, for example Cassandra's a column store, whatever, but it's not. Cassandra's not a column store in the sense that Vertica is. So Vertica was kind of built from the ground up to be... So it's the original column store. So back in the cStor project at Berkeley that Stonebraker was involved in, he said let's explore what kind of efficiencies we can get out of a real columnar database. And what he found was that, he and his grad students that started Vertica. What they found was that what they can do is they could build a database that gives orders of magnitude better query performance for the kinds of analytics I'm talking about here today. With orders of magnitude less data storage underneath. So building on top of machine data, as I mentioned, is hard, because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a columnar store. Because, if you imagine laying out sort of all the attributes of an event type, right? So you can imagine that each occurrence is going to have- So there may be, say, three or four function names that are going to occur for all the instances of a given event type. And so if you were to sort all of those event instances by function name, what you would find is that you have sort of long, million long runs of the same function name over and over. So what you have, in general, in machine data, is lots and lots of slowly varying attributes, lots of low-cardinality data that it's almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates through the analytical pipeline. Because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right? So the scale-out architecture, of course, is really suitable for petascale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture, and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. The performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical, you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I should mention that if you look at column stores, to me, Vertica is the one that has the full SQL support, it has the ODBC drivers, it has the ACID compliance. Which means I don't need to worry about these things as an application developer. So I'm laying out the reasons that I like to use Vertica. So I touched on this already, but essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings, like the one that Vertica does with pure storage that doesn't use S3. But what I find amazing is how well the system performs using S3 as an object store, and how they manage to keep an actual consistent database. And they do. We've had issues where we've gone and shut down hosts, or hosts have been shut down on us, and we have to restart the database and we don't have any consistency issues. It's unbelievable, the work that they've done. Essentially, another thing that's great about the way it works is you can use the S3 as a shared object store. You can have query nodes kind of querying from that set of files largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what, and who's reading what, and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is the sequence and time series analytics. For example, in our product, even though we do all of this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do, you can say, "Okay, if this kind of event happens within so much time, and then this kind of an event happens, but not this one," Then you can be alerted. So you can have these kind of sequences that you define of events that would indicate a problem. And we use their sequence analytics for that. So it kind of gives you really good performance on some of these queries where you're wanting to pull out sequences of events from a fact table. And timeseries analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance. And it's easy to use through SQL. So those are a couple of Vertica extensions that we use. So finally, I would like to encourage everybody, hey, come try us out. Should be up and running in a few minutes if you're using Kubernetes. If not, it's however long it takes you to run an installer. So you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q and A.

Published Date : Mar 30 2020

SUMMARY :

Also, just a reminder that you can maximize your screen And one of the kinds of alerts you can do, you can say,

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Larry Lancaster	PERSON	0.99+
David Gill	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
five times	QUANTITY	0.99+
Larry	PERSON	0.99+
S3	TITLE	0.99+
three minutes	QUANTITY	0.99+
six times	QUANTITY	0.99+
Sue	PERSON	0.99+
100 services	QUANTITY	0.99+
Zebrium	ORGANIZATION	0.99+
today	DATE	0.99+
three	QUANTITY	0.99+
five years	QUANTITY	0.99+
Today	DATE	0.99+
yesterday	DATE	0.99+
both	QUANTITY	0.99+
Kubernetes	TITLE	0.99+
one	QUANTITY	0.99+
thousands	QUANTITY	0.99+
two	QUANTITY	0.99+
SQL	TITLE	0.99+
one customer	QUANTITY	0.98+
three lines	QUANTITY	0.98+
three tables	QUANTITY	0.98+
each event	QUANTITY	0.98+
hundreds	QUANTITY	0.98+
first people	QUANTITY	0.98+
1,000 log streams	QUANTITY	0.98+
20 years ago	DATE	0.98+
eight incidents	QUANTITY	0.98+
tens of thousands of customers	QUANTITY	0.97+
later this week	DATE	0.97+
thousands of users	QUANTITY	0.97+
Stonebraker	ORGANIZATION	0.96+
each occurrence	QUANTITY	0.96+
Postgres	ORGANIZATION	0.96+
One thing	QUANTITY	0.95+
three event types	QUANTITY	0.94+
million	QUANTITY	0.94+
Vertica	TITLE	0.94+
one thing	QUANTITY	0.93+
4/2	DATE	0.92+
English	OTHER	0.92+
four function names	QUANTITY	0.86+
day one	QUANTITY	0.84+
Prometheus	TITLE	0.83+
one-stop	QUANTITY	0.82+
Berkeley	LOCATION	0.82+
Confluence	ORGANIZATION	0.79+
double arrow	QUANTITY	0.79+
last couple of months	DATE	0.79+
one of	QUANTITY	0.76+
cStor	ORGANIZATION	0.75+
a billion	QUANTITY	0.73+
Atlassian Stack	ORGANIZATION	0.72+
Eon	ORGANIZATION	0.71+
Bitbucket	ORGANIZATION	0.68+
couple more examples	QUANTITY	0.68+
Litmus	TITLE	0.65+

UNLIST TILL 4/2 - The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL

hello everybody and thank you for joining us today for the virtual verdict of BBC 2020 today's breakout session is entitled the shortest path to Vertica best practices for data warehouse migration ETL I'm Jeff Healey I'll leave verdict and marketing I'll be your host for this breakout session joining me today are Marco guesser and Mauricio lychee vertical product engineer is joining us from yume region but before we begin I encourage you to submit questions or comments or in the virtual session don't have to wait just type question in a comment in the question box below the slides that click Submit as always there will be a Q&A session the end of the presentation will answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forums that formed at vertical comm to post your questions there after the session our engineering team is planning to join the forums to keep the conversation going also reminder that you can maximize your screen by clicking the double arrow button and lower right corner of the sides and yes this virtual session is being recorded be available to view on demand this week send you a notification as soon as it's ready now let's get started over to you mark marco andretti oh hello everybody this is Marco speaking a sales engineer from Amir said I'll just get going ah this is the agenda part one will be done by me part two will be done by Mauricio the agenda is as you can see big bang or piece by piece and the migration of the DTL migration of the physical data model migration of et I saw VTL + bi functionality what to do with store procedures what to do with any possible existing user defined functions and migration of the data doctor will be by Maurice it you want to talk about emeritus Rider yeah hello everybody my name is Mauricio Felicia and I'm a birth record pre-sales like Marco I'm going to talk about how to optimize that were always using some specific vertical techniques like table flattening live aggregated projections so let me start with be a quick overview of the data browser migration process we are going to talk about today and normally we often suggest to start migrating the current that allows the older disease with limited or minimal changes in the overall architecture and yeah clearly we will have to port the DDL or to redirect the data access tool and we will platform but we should minimizing the initial phase the amount of changes in order to go go live as soon as possible this is something that we also suggest in the second phase we can start optimizing Bill arouse and which again with no or minimal changes in the architecture as such and during this optimization phase we can create for example dog projections or for some specific query or optimize encoding or change some of the visual spools this is something that we normally do if and when needed and finally and again if and when needed we go through the architectural design for these operations using full vertical techniques in order to take advantage of all the features we have in vertical and this is normally an iterative approach so we go back to name some of the specific feature before moving back to the architecture and science we are going through this process in the next few slides ok instead in order to encourage everyone to keep using their common sense when migrating to a new database management system people are you often afraid of it it's just often useful to use the analogy of how smooth in your old home you might have developed solutions for your everyday life that make perfect sense there for example if your old cent burner dog can't walk anymore you might be using a fork lifter to heap in through your window in the old home well in the new home consider the elevator and don't complain that the window is too small to fit the dog through this is very much in the same way as Narita but starting to make the transition gentle again I love to remain in my analogy with the house move picture your new house as your new holiday home begin to install everything you miss and everything you like from your old home once you have everything you need in your new house you can shut down themselves the old one so move each by feet and go for quick wins to make your audience happy you do bigbang only if they are going to retire the platform you are sitting on where you're really on a sinking ship otherwise again identify quick wings implement published and quickly in Vertica reap the benefits enjoy the applause use the gained reputation for further funding and if you find that nobody's using the old platform anymore you can shut it down if you really have to migrate you can still go to really go to big battle in one go only if you absolutely have to otherwise migrate by subject area use the group all similar clear divisions right having said that ah you start off by migrating objects objects in the database that's one of the very first steps it consists of migrating verbs the places where you can put the other objects into that is owners locations which is usually schemers then what do you have that you extract tables news then you convert the object definition deploy them to Vertica and think that you shouldn't do it manually never type what you can generate ultimate whatever you can use it enrolls usually there is a system tables in the old database that contains all the roads you can export those to a file reformat them and then you have a create role and create user scripts that you can apply to Vertica if LDAP Active Directory was used for the authentication the old database vertical supports anything within the l dubs standard catalogued schemas should be relatively straightforward with maybe sometimes the difference Vertica does not restrict you by defining a schema as a collection of all objects owned by a user but it supports it emulates it for old times sake Vertica does not need the catalog or if you absolutely need the catalog from the old tools that you use it it usually said it is always set to the name of the database in case of vertical having had now the schemas the catalogs the users and roles in place move the take the definition language of Jesus thought if you are allowed to it's best to use a tool that translates to date types in the PTL generated you might see as a mention of old idea to listen by memory to by the way several times in this presentation we are very happy to have it it actually can export the old database table definition because they got it works with the odbc it gets what the old database ODBC driver translates to ODBC and then it has internal translation tables to several target schema to several target DBMS flavors the most important which is obviously vertical if they force you to use something else there are always tubes like sequel plots in Oracle the show table command in Tara data etc H each DBMS should have a set of tools to extract the object definitions to be deployed in the other instance of the same DBMS ah if I talk about youth views usually a very new definition also in the old database catalog one thing that you might you you use special a bit of special care synonyms is something that were to get emulated different ways depending on the specific needs I said I stop you on the view or table to be referred to or something that is really neat but other databases don't have the search path in particular that works that works very much like the path environment variable in Windows or Linux where you specify in a table an object name without the schema name and then it searched it first in the first entry of the search path then in a second then in third which makes synonym hugely completely unneeded when you generate uvl we remained in the analogy of moving house dust and clean your stuff before placing it in the new house if you see a table like the one here at the bottom this is usually corpse of a bad migration in the past already an ID is usually an integer and not an almost floating-point data type a first name hardly ever has 256 characters and that if it's called higher DT it's not necessarily needed to store the second when somebody was hired so take good care in using while you are moving dust off your stuff and use better data types the same applies especially could string how many bytes does a string container contains for eurozone's it's not for it's actually 12 euros in utf-8 in the way that Vertica encodes strings and ASCII characters one died but the Euro sign thinks three that means that you have to very often you have when you have a single byte character set up a source you have to pay attention oversize it first because otherwise it gets rejected or truncated and then you you will have to very carefully check what their best science is the best promising is the most promising approach is to initially dimension strings in multiples of very initial length and again ODP with the command you see there would be - I you 2 comma 4 will double the lengths of what otherwise will single byte character and multiply that for the length of characters that are wide characters in traditional databases and then load the representative sample of your cells data and profile using the tools that we personally use to find the actually longest datatype and then make them shorter notice you might be talking about the issues of having too long and too big data types on projection design are we live and die with our projects you might know remember the rules on how default projects has come to exist the way that we do initially would be just like for the profiling load a representative sample of the data collector representative set of already known queries from the Vertica database designer and you don't have to decide immediately you can always amend things and otherwise follow the laws of physics avoid moving data back and forth across nodes avoid heavy iOS if you can design your your projections initially by hand encoding matters you know that the database designer is a very tight fisted thing it would optimize to use as little space as possible you will have to think of the fact that if you compress very well you might end up using more time in reading it this is the testimony to run once using several encoding types and you see that they are l e is the wrong length encoded if sorted is not even visible while the others are considerably slower you can get those nights and look it in look at them in detail I will go in detail you now hear about it VI migrations move usually you can expect 80% of everything to work to be able to live to be lifted and shifted you don't need most of the pre aggregated tables because we have live like regain projections many BI tools have specialized query objects for the dimensions and the facts and we have the possibility to use flatten tables that are going to be talked about later you might have to ride those by hand you will be able to switch off casting because vertical speeds of everything with laps Lyle aggregate projections and you have worked with molap cubes before you very probably won't meet them at all ETL tools what you will have to do is if you do it row by row in the old database consider changing everything to very big transactions and if you use in search statements with parameter markers consider writing to make pipes and using verticals copy command mouse inserts yeah copy c'mon that's what I have here ask you custom functionality you can see on this slide the verticals the biggest number of functions in the database we compare them regularly by far compared to any other database you might find that many of them that you have written won't be needed on the new database so look at the vertical catalog instead of trying to look to migrate a function that you don't need stored procedures are very often used in the old database to overcome their shortcomings that Vertica doesn't have very rarely you will have to actually write a procedure that involves a loop but it's really in our experience very very rarely usually you can just switch to standard scripting and this is basically repeating what Mauricio said in the interest of time I will skip this look at this one here the most of the database data warehouse migration talks should be automatic you can use you can automate GDL migration using ODB which is crucial data profiling it's not crucial but game-changing the encoding is the same thing you can automate at you using our database designer the physical data model optimization in general is game-changing you have the database designer use the provisioning use the old platforms tools to generate the SQL you have no objects without their onus is crucial and asking functions and procedures they are only crucial if they depict the company's intellectual property otherwise you can almost always replace them with something else that's it from me for now Thank You Marco Thank You Marco so we will now point our presentation talking about some of the Vertica that overall the presentation techniques that we can implement in order to improve the general efficiency of the dot arouse and let me start with a few simple messages well the first one is that you are supposed to optimize only if and when this is needed in most of the cases just a little shift from the old that allows to birth will provide you exhaust the person as if you were looking for or even better so in this case probably is not really needed to to optimize anything in case you want optimize or you need to optimize then keep in mind some of the vertical peculiarities for example implement delete and updates in the vertical way use live aggregate projections in order to avoid or better in order to limit the goodbye executions at one time used for flattening in order to avoid or limit joint and and then you can also implement invert have some specific birth extensions life for example time series analysis or machine learning on top of your data we will now start by reviewing the first of these ballots optimize if and when needed well if this is okay I mean if you get when you migrate from the old data where else to birth without any optimization if the first four month level is okay then probably you only took my jacketing but this is not the case one very easier to dispute in session technique that you can ask is to ask basket cells to optimize the physical data model using the birth ticket of a designer how well DB deal which is the vertical database designer has several interfaces here I'm going to use what we call the DB DB programmatic API so basically sequel functions and using other databases you might need to hire experts looking at your data your data browser your table definition creating indexes or whatever in vertical all you need is to run something like these are simple as six single sequel statement to get a very well optimized physical base model you see that we start creating a new design then we had to be redesigned tables and queries the queries that we want to optimize we set our target in this case we are tuning the physical data model in order to maximize query performances this is why we are using my design query and in our statement another possible journal tip would be to tune in order to reduce storage or a mix between during storage and cheering queries and finally we asked Vertica to produce and deploy these optimized design in a matter of literally it's a matter of minutes and in a few minutes what you can get is a fully optimized fiscal data model okay this is something very very easy to implement keep in mind some of the vertical peculiarities Vaska is very well tuned for load and query operations aunt Berta bright rose container to biscuits hi the Pharos container is a group of files we will never ever change the content of this file the fact that the Rose containers files are never modified is one of the political peculiarities and these approach led us to use minimal locks we can add multiple load operations in parallel against the very same table assuming we don't have a primary or unique constraint on the target table in parallel as a sage because they will end up in two different growth containers salad in read committed requires in not rocket fuel and can run concurrently with insert selected because the Select will work on a snapshot of the catalog when the transaction start this is what we call snapshot isolation the kappa recovery because we never change our rows files are very simple and robust so we have a huge amount of bandages due to the fact that we never change the content of B rows files contain indiarose containers but on the other side believes and updates require a little attention so what about delete first when you believe in the ethica you basically create a new object able it back so it appeared a bit later in the Rose or in memory and this vector will point to the data being deleted so that when the feed is executed Vertica will just ignore the rules listed in B delete records and it's not just about the leak and updating vertical consists of two operations delete and insert merge consists of either insert or update which interim is made of the little insert so basically if we tuned how the delete work we will also have tune the update in the merge so what should we do in order to optimize delete well remember what we said that every time we please actually we create a new object a delete vector so avoid committing believe and update too often we reduce work the work for the merge out for the removal method out activities that are run afterwards and be sure that all the interested projections will contain the column views in the dedicate this will let workers directly after access the projection without having to go through the super projection in order to create the vector and the delete will be much much faster and finally another very interesting optimization technique is trying to segregate the update and delete operation from Pyrenean third workload in order to reduce lock contention beliefs something we are going to discuss and these contain using partition partition operation this is exactly what I want to talk about now here you have a typical that arouse architecture so we have data arriving in a landing zone where the data is loaded that is from the data sources then we have a transformation a year writing into a staging area that in turn will feed the partitions block of data in the green data structure we have at the end those green data structure we have at the end are the ones used by the data access tools when they run their queries sometimes we might need to change old data for example because we have late records or maybe because we want to fix some errors that have been originated in the facilities so what we do in this case is we just copied back the partition we want to change or we want to adjust from the green interior a the end to the stage in the area we have a very fast operation which is Tokyo Station then we run our updates or our adjustment procedure or whatever we need in order to fix the errors in the data in the staging area and at the very same time people continues to you with green data structures that are at the end so we will never have contention between the two operations when we updating the staging area is completed what we have to do is just to run a swap partition between tables in order to swap the data that we just finished to adjust in be staging zone to the query area that is the green one at the end this swap partition is very fast is an atomic operation and basically what will happens is just that well exchange the pointer to the data this is a very very effective techniques and lot of customer useless so why flops on table and live aggregate for injections well basically we use slot in table and live aggregate objection to minimize or avoid joint this is what flatten table are used for or goodbye and this is what live aggregate projections are used for now compared to traditional data warehouses better can store and process and aggregate and join order of magnitudes more data that is a true columnar database joint and goodbye normally are not a problem at all they run faster than any traditional data browse that page there are still scenarios were deficits are so big and we are talking about petabytes of data and so quickly going that would mean be something in order to boost drop by and join performances and this is why you can't reduce live aggregate projections to perform aggregations hard loading time and limit the need for global appear on time and flux and tables to combine information from different entity uploading time and again avoid running joint has query undefined okay so live aggregate projections at this point in time we can use live aggregate projections using for built in aggregate functions which are some min Max and count okay let's see how this works suppose that you have a normal table in this case we have a table unit sold with three columns PIB their time and quantity which has been segmented in a given way and on top of this base table we call it uncle table we create a projection you see that we create the projection using the salad that will aggregate the data we get the PID we get the date portion of the time and we get the sum of quantity from from the base table grouping on the first two columns so PID and the date portion of day time okay what happens in this case when we load data into the base table all we have to do with load data into the base table when we load data into the base table we will feel of course big injections that assuming we are running with k61 we will have to projection to projections and we will know the data in those two projection with all the detail in data we are going to load into the table so PAB playtime and quantity but at the very same time at the very same time and without having to do nothing any any particular operation or without having to run any any ETL procedure we will also get automatically in the live aggregate projection for the data pre aggregated with be a big day portion of day time and the sum of quantity into the table name total quantity you see is something that we get for free without having to run any specific procedure and this is very very efficient so the key concept is that during the loading operation from VDL point of view is executed again the base table we do not explicitly aggregate data or we don't have any any plc do the aggregation is automatic and we'll bring the pizza to be live aggregate projection every time we go into the base table you see the two selection that we have we have on in this line on the left side and you see that those two selects will produce exactly the same result so running select PA did they trying some quantity from the base table or running the select star from the live aggregate projection will result exactly in the same data you know this is of course very useful but is much more useful result that if we and we can observe this if we run an explained if we run the select against the base table asking for this group data what happens behind the scene is that basically vertical itself that is a live aggregate projection with the data that has been already aggregating loading phase and rewrite your query using polite aggregate projection this happens automatically you see this is a query that ran a group by against unit sold and vertical decided to rewrite this clearly as something that has to be collected against the light aggregates projection because if I decrease this will save a huge amount of time and effort during the ETL cycle okay and is not just limited to be information you want to aggregate for example another query like select count this thing you might note that can't be seen better basically our goodbyes will also take advantage of the live aggregate injection and again this is something that happens automatically you don't have to do anything to get this okay one thing that we have to keep very very clear in mind Brassica what what we store in the live aggregate for injection are basically partially aggregated beta so in this example we have two inserts okay you see that we have the first insert that is entered in four volts and the second insert which is inserting five rules well in for each of these insert we will have a partial aggregation you will never know that after the first insert you will have a second one so better will calculate the aggregation of the data every time irin be insert it is a key concept and be also means that you can imagine lies the effectiveness of bees technique by inserting large chunk of data ok if you insert data row by row this technique live aggregate rejection is not very useful because for every goal that you insert you will have an aggregation so basically they'll live aggregate injection will end up containing the same number of rows that you have in the base table but if you everytime insert a large chunk of data the number of the aggregations that you will have in the library get from structure is much less than B base data so this is this is a key concept you can see how these works by counting the number of rows that you have in alive aggregate injection you see that if you run the select count star from the solved live aggregate rejection the query on the left side you will get four rules but actually if you explain this query you will see that he was reading six rows so this was because every of those two inserts that we're actively interested a few rows in three rows in India in the live aggregate projection so this is a key concept live aggregate projection keep partially aggregated data this final aggregation will always happen at runtime okay another which is very similar to be live aggregate projection or what we call top K projection we actually do not aggregate anything in the top case injection we just keep the last or limit the amount of rows that we collect using the limit over partition by all the by clothes and this again in this case we create on top of the base stable to top gay projection want to keep the last quantity that has been sold and the other one to keep the max quantity in both cases is just a matter of ordering the data in the first case using the B time column in the second page using quantity in both cases we fill projection with just the last roof and again this is something that we do when we insert data into the base table and this is something that happens automatically okay if we now run after the insert our select against either the max quantity okay or be lost wanted it okay we will get the very last you see that we have much less rows in the top k projections okay we told at the beginning that basically we can use for built-in function you might remember me max sum and count what if I want to create my own specific aggregation on top of the lid and customer sum up because our customers have very specific needs in terms of live aggregate projections well in this case you can code your own live aggregate production user-defined functions so you can create the user-defined transport function to implement any sort of complex aggregation while loading data basically after you implemented miss VPS you can deploy using a be pre pass approach that basically means the data is aggregated as loading time during the data ingestion or the batch approach that means that the data is when that woman is running on top which things to remember on live a granade projections they are limited to be built in function again some max min and count but you can call your own you DTF so you can do whatever you want they can reference only one table and for bass cab version before 9.3 it was impossible to update or delete on the uncle table this limit has been removed in 9.3 so you now can update and delete data from the uncle table okay live aggregate projection will follow the segmentation of the group by expression and in some cases the best optimizer can decide to pick the live aggregates objection or not depending on if depending on the fact that the aggregation is a consistent or not remember that if we insert and commit every single role to be uncoachable then we will end up with a live aggregate indirection that contains exactly the same number of rows in this case living block or using the base table it would be the same okay so this is one of the two fantastic techniques that we can implement in Burtka this live aggregate projection is basically to avoid or limit goodbyes the other which we are going to talk about is cutting table and be reused in order to avoid the means for joins remember that K is very fast running joints but when we scale up to petabytes of beta we need to boost and this is what we have in order to have is problem fixed regardless the amount of data we are dealing with so how what about suction table let me start with normalized schemas everybody knows what is a normalized scheme under is no but related stuff in this slide the main scope of an normalized schema is to reduce data redundancies so and the fact that we reduce data analysis is a good thing because we will obtain fast and more brides we will have to write into a database small chunks of data into the right table the problem with these normalized schemas is that when you run your queries you have to put together the information that arrives from different table and be required to run joint again jointly that again normally is very good to run joint but sometimes the amount of data makes not easy to deal with joints and joints sometimes are not easy to tune what happens in in the normal let's say traditional data browser is that we D normalize the schemas normally either manually or using an ETL so basically we have on one side in this light on the left side the normalized schemas where we can get very fast right on the other side on the left we have the wider table where we run all the three joints and pre aggregation in order to prepare the data for the queries and so we will have fast bribes on the left fast reads on the Left sorry fast bra on the right and fast read on the left side of these slides the probability in the middle because we will push all the complexity in the middle in the ETL that will have to transform be normalized schema into the water table and the way we normally implement these either manually using procedures that we call the door using ETL this is what happens in traditional data warehouse is that we will have to coach in ETL layer in order to round the insert select that will feed from the normalized schema and right into the widest table at the end the one that is used by the data access tools we we are going to to view store to run our theories so this approach is costly because of course someone will have to code this ETL and is slow because someone will have to execute those batches normally overnight after loading the data and maybe someone will have to check the following morning that everything was ok with the batch and is resource intensive of course and is also human being intensive because of the people that will have to code and check the results it ever thrown because it can fail and introduce a latency because there is a get in the time axis between the time t0 when you load the data into be normalized schema and the time t1 when we get the data finally ready to be to be queried so what would be inverter to facilitate this process is to create this flatten table with the flattened T work first you avoid data redundancy because you don't need the wide table on the normalized schema on the left side second is fully automatic you don't have to do anything you just have to insert the data into the water table and the ETL that you have coded is transformed into an insert select by vatika automatically you don't have to do anything it's robust and this Latin c0 is a single fast as soon as you load the data into the water table you will get all the joints executed for you so let's have a look on how it works in this case we have the table we are going to flatten and basically we have to focus on two different clauses the first one is you see that there is one table here I mentioned value 1 which can be defined as default and then the Select or set using okay the difference between the fold and set using is when the data is populated if we use default data is populated as soon as we know the data into the base table if we use set using Google Earth to refresh but everything is there I mean you don't need them ETL you don't need to code any transformation because everything is in the table definition itself and it's for free and of course is in latency zero so as soon as you load the other columns you will have the dimension value valued as well okay let's see an example here suppose here we have a dimension table customer dimension that is on the left side and we have a fact table on on the right you see that the fact table uses columns like o underscore name or Oh the score city which are basically the result of the salad on top of the customer dimension so Beezus were the join is executed as soon as a remote data into the fact table directly into the fact table without of course loading data that arise from the dimension all the data from the dimension will be populated automatically so let's have an example here suppose that we are running this insert as you can see we are running be inserted directly into the fact table and we are loading o ID customer ID and total we are not loading made a major name no city those name and city will be automatically populated by Vertica for you because of the definition of the flood table okay you see behave well all you need in order to have your widest tables built for you your flattened table and this means that at runtime you won't need any join between base fuck table and the customer dimension that we have used in order to calculate name and city because the data is already there this was using default the other option was is using set using the concept is absolutely the same you see that in this case on the on the right side we have we have basically replaced this all on the school name default with all underscore name set using and same is true for city the concept that I said is the same but in this case which we set using then we will have to refresh you see that we have to run these select trash columns and then the name of the table in this case all columns will be fresh or you can specify only certain columns and this will bring the values for name and city reading from the customer dimension so this technique this technique is extremely useful the difference between default and said choosing just to summarize the most important differences remember you just have to remember that default will relate your target when you load set using when you refresh end and in some cases you might need to use them both so in some cases you might want to use both default end set using in this example here we'll see that we define the underscore name using both default and securing and this means that we love the data populated either when we load the data into the base table or when we run the Refresh this is summary of the technique that we can implement in birth in order to make our and other browsers even more efficient and well basically this is the end of our presentation thank you for listening and now we are ready for the Q&A session you

Published Date : Mar 30 2020

SUMMARY :

the end to the stage in the area we have

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Tom	PERSON	0.99+
Marta	PERSON	0.99+
John	PERSON	0.99+
IBM	ORGANIZATION	0.99+
David	PERSON	0.99+
Dave	PERSON	0.99+
Peter Burris	PERSON	0.99+
Chris Keg	PERSON	0.99+
Laura Ipsen	PERSON	0.99+
Jeffrey Immelt	PERSON	0.99+
Chris	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Chris O'Malley	PERSON	0.99+
Andy Dalton	PERSON	0.99+
Chris Berg	PERSON	0.99+
Dave Velante	PERSON	0.99+
Maureen Lonergan	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Paul Forte	PERSON	0.99+
Erik Brynjolfsson	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Andrew McCafee	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Cheryl	PERSON	0.99+
Mark	PERSON	0.99+
Marta Federici	PERSON	0.99+
Larry	PERSON	0.99+
Matt Burr	PERSON	0.99+
Sam	PERSON	0.99+
Andy Jassy	PERSON	0.99+
Dave Wright	PERSON	0.99+
Maureen	PERSON	0.99+
Google	ORGANIZATION	0.99+
Cheryl Cook	PERSON	0.99+
Netflix	ORGANIZATION	0.99+
$8,000	QUANTITY	0.99+
Justin Warren	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Europe	LOCATION	0.99+
Andy	PERSON	0.99+
30,000	QUANTITY	0.99+
Mauricio	PERSON	0.99+
Philips	ORGANIZATION	0.99+
Robb	PERSON	0.99+
Jassy	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Mike Nygaard	PERSON	0.99+

UNLIST TILL 4/2 - End-to-End Security

>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled End-to-End Security in Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica. I'll be your host for this session. Joining me is Vertica Software Engineers, Fenic Fawkes and Chris Morris. Before we begin, I encourage you to submit your questions or comments during the virtual session. You don't have to wait until the end. Just type your question or comment in the question box below the slide as it occurs to you and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Also, you can visit Vertica forums to post your questions there after the session. Our team is planning to join the forums to keep the conversation going, so it'll be just like being at a conference and talking to the engineers after the presentation. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this whole session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. I think we're ready to get started. Over to you, Fen. >> Fenic: Hi, welcome everyone. My name is Fen. My pronouns are fae/faer and Chris will be presenting the second half, and his pronouns are he/him. So to get started, let's kind of go over what the goals of this presentation are. First off, no deployment is the same. So we can't give you an exact, like, here's the right way to secure Vertica because how it is to set up a deployment is a factor. But the biggest one is, what is your threat model? So, if you don't know what a threat model is, let's take an example. We're all working from home because of the coronavirus and that introduces certain new risks. Our source code is on our laptops at home, that kind of thing. But really our threat model isn't that people will read our code and copy it, like, over our shoulders. So we've encrypted our hard disks and that kind of thing to make sure that no one can get them. So basically, what we're going to give you are building blocks and you can pick and choose the pieces that you need to secure your Vertica deployment. We hope that this gives you a good foundation for how to secure Vertica. And now, what we're going to talk about. So we're going to start off by going over encryption, just how to secure your data from attackers. And then authentication, which is kind of how to log in. Identity, which is who are you? Authorization, which is now that we know who you are, what can you do? Delegation is about how Vertica talks to other systems. And then auditing and monitoring. So, how do you protect your data in transit? Vertica makes a lot of network connections. Here are the important ones basically. There are clients talk to Vertica cluster. Vertica cluster talks to itself. And it can also talk to other Vertica clusters and it can make connections to a bunch of external services. So first off, let's talk about client-server TLS. Securing data between, this is how you secure data between Vertica and clients. It prevents an attacker from sniffing network traffic and say, picking out sensitive data. Clients have a way to configure how strict the authentication is of the server cert. It's called the Client SSLMode and we'll talk about this more in a bit but authentication methods can disable non-TLS connections, which is a pretty cool feature. Okay, so Vertica also makes a lot of network connections within itself. So if Vertica is running behind a strict firewall, you have really good network, both physical and software security, then it's probably not super important that you encrypt all traffic between nodes. But if you're on a public cloud, you can set up AWS' firewall to prevent connections, but if there's a vulnerability in that, then your data's all totally vulnerable. So it's a good idea to set up inter-node encryption in less secure situations. Next, import/export is a good way to move data between clusters. So for instance, say you have an on-premises cluster and you're looking to move to AWS. Import/Export is a great way to move your data from your on-prem cluster to AWS, but that means that the data is going over the open internet. And that is another case where an attacker could try to sniff network traffic and pull out credit card numbers or whatever you have stored in Vertica that's sensitive. So it's a good idea to secure data in that case. And then we also connect to a lot of external services. Kafka, Hadoop, S3 are three of them. Voltage SecureData, which we'll talk about more in a sec, is another. And because of how each service deals with authentication, how to configure your authentication to them differs. So, see our docs. And then I'd like to talk a little bit about where we're going next. Our main goal at this point is making Vertica easier to use. Our first objective was security, was to make sure everything could be secure, so we built relatively low-level building blocks. Now that we've done that, we can identify common use cases and automate them. And that's where our attention is going. Okay, so we've talked about how to secure your data over the network, but what about when it's on disk? There are several different encryption approaches, each depends on kind of what your use case is. RAID controllers and disk encryption are mostly for on-prem clusters and they protect against media theft. They're invisible to Vertica. S3 and GCP are kind of the equivalent in the cloud. They also invisible to Vertica. And then there's field-level encryption, which we accomplish using Voltage SecureData, which is format-preserving encryption. So how does Voltage work? Well, it, the, yeah. It encrypts values to things that look like the same format. So for instance, you can see date of birth encrypted to something that looks like a date of birth but it is not in fact the same thing. You could do cool stuff like with a credit card number, you can encrypt only the first 12 digits, allowing the user to, you know, validate the last four. The benefits of format-preserving encryption are that it doesn't increase database size, you don't need to alter your schema or anything. And because of referential integrity, it means that you can do analytics without unencrypting the data. So again, a little diagram of how you could work Voltage into your use case. And you could even work with Vertica's row and column access policies, which Chris will talk about a bit later, for even more customized access control. Depending on your use case and your Voltage integration. We are enhancing our Voltage integration in several ways in 10.0 and if you're interested in Voltage, you can go see their virtual BDC talk. And then again, talking about roadmap a little, we're working on in-database encryption at rest. What this means is kind of a Vertica solution to encryption at rest that doesn't depend on the platform that you're running on. Encryption at rest is hard. (laughs) Encrypting, say, 10 petabytes of data is a lot of work. And once again, the theme of this talk is everyone has a different key management strategy, a different threat model, so we're working on designing a solution that fits everyone. If you're interested, we'd love to hear from you. Contact us on the Vertica forums. All right, next up we're going to talk a little bit about access control. So first off is how do I prove who I am? How do I log in? So, Vertica has several authentication methods. Which one is best depends on your deployment size/use case. Again, theme of this talk is what you should use depends on your use case. You could order authentication methods by priority and origin. So for instance, you can only allow connections from within your internal network or you can enforce TLS on connections from external networks but relax that for connections from your internal network. That kind of thing. So we have a bunch of built-in authentication methods. They're all password-based. User profiles allow you to set complexity requirements of passwords and you can even reject non-TLS connections, say, or reject certain kinds of connections. Should only be used by small deployments because you probably have an LDAP server, where you manage users if you're a larger deployment and rather than duplicating passwords and users all in LDAP, you should use LDAP Auth, where Vertica still has to keep track of users, but each user can then use LDAP authentication. So Vertica doesn't store the password at all. The client gives Vertica a username and password and Vertica then asks the LDAP server is this a correct username or password. And the benefits of this are, well, manyfold, but if, say, you delete a user from LDAP, you don't need to remember to also delete their Vertica credentials. You can just, they won't be able to log in anymore because they're not in LDAP anymore. If you like LDAP but you want something a little bit more secure, Kerberos is a good idea. So similar to LDAP, Vertica doesn't keep track of who's allowed to log in, it just keeps track of the Kerberos credentials and it even, Vertica never touches the user's password. Users log in to Kerberos and then they pass Vertica a ticket that says "I can log in." It is more complex to set up, so if you're just getting started with security, LDAP is probably a better option. But Kerberos is, again, a little bit more secure. If you're looking for something that, you know, works well for applications, certificate auth is probably what you want. Rather than hardcoding a password, or storing a password in a script that you use to run an application, you can instead use a certificate. So, if you ever need to change it, you can just replace the certificate on disk and the next time the application starts, it just picks that up and logs in. Yeah. And then, multi-factor auth is a feature request we've gotten in the past and it's not built-in to Vertica but you can do it using Kerberos. So, security is a whole application concern and fitting MFA into your workflow is all about fitting it in at the right layer. And we believe that that layer is above Vertica. If you're interested in more about how MFA works and how to set it up, we wrote a blog on how to do it. And now, over to Chris, for more on identity and authorization. >> Chris: Thanks, Fen. Hi everyone, I'm Chris. So, we're a Vertica user and we've connected to Vertica but once we're in the database, who are we? What are we? So in Vertica, the answer to that questions is principals. Users and roles, which are like groups in other systems. Since roles can be enabled and disabled at will and multiple roles can be active, they're a flexible way to use only the privileges you need in the moment. For example here, you've got Alice who has Dbadmin as a role and those are some elevated privileges. She probably doesn't want them active all the time, so she can set the role and add them to her identity set. All of this information is stored in the catalog, which is basically Vertica's metadata storage. How do we manage these principals? Well, depends on your use case, right? So, if you're a small organization or maybe only some people or services need Vertica access, the solution is just to manage it with Vertica. You can see some commands here that will let you do that. But what if we're a big organization and we want Vertica to reflect what's in our centralized user management system? Sort of a similar motivating use case for LDAP authentication, right? We want to avoid duplication hassles, we just want to centralize our management. In that case, we can use Vertica's LDAPLink feature. So with LDAPLink, principals are mirrored from LDAP. They're synced in a considerable fashion from the LDAP into Vertica's catalog. What this does is it manages creating and dropping users and roles for you and then mapping the users to the roles. Once that's done, you can do any Vertica-specific configuration on the Vertica side. It's important to note that principals created in Vertica this way, support multiple forms of authentication, not just LDAP. This is a separate feature from LDAP authentication and if you created a user via LDAPLink, you could have them use a different form of authentication, Kerberos, for example. Up to you. Now of course this kind of system is pretty mission-critical, right? You want to make sure you get the right roles and the right users and the right mappings in Vertica. So you probably want to test it. And for that, we've got new and improved dry run functionality, from 9.3.1. And what this feature offers you is new metafunctions that let you test various parameters without breaking your real LDAPLink configuration. So you can mess around with parameters and the configuration as much as you want and you can be sure that all of that is strictly isolated from the live system. Everything's separated. And when you use this, you get some really nice output through a Data Collector table. You can see some example output here. It runs the same logic as the real LDAPLink and provides detailed information about what would happen. You can check the documentation for specifics. All right, so we've connected to the database, we know who we are, but now, what can we do? So for any given action, you want to control who can do that, right? So what's the question you have to ask? Sometimes the question is just who are you? It's a simple yes or no question. For example, if I want to upgrade a user, the question I have to ask is, am I the superuser? If I'm the superuser, I can do it, if I'm not, I can't. But sometimes the actions are more complex and the question you have to ask is more complex. Does the principal have the required privileges? If you're familiar with SQL privileges, there are things like SELECT, INSERT, and Vertica has a few of their own, but the key thing here is that an action can require specific and maybe even multiple privileges on multiple objects. So for example, when selecting from a table, you need USAGE on the schema and SELECT on the table. And there's some other examples here. So where do these privileges come from? Well, if the action requires a privilege, these are the only places privileges can come from. The first source is implicit privileges, which could come from owning the object or from special roles, which we'll talk about in a sec. Explicit privileges, it's basically a SQL standard GRANT system. So you can grant privileges to users or roles and optionally, those users and roles could grant them downstream. Discretionary access control. So those are explicit and they come from the user and the active roles. So the whole identity set. And then we've got Vertica-specific inherited privileges and those come from the schema, and we'll talk about that in a sec as well. So these are the special roles in Vertica. First role, DBADMIN. This isn't the Dbadmin user, it's a role. And it has specific elevated privileges. You can check the documentation for those exact privileges but it's less than the superuser. The PSEUDOSUPERUSER can do anything the real superuser can do and you can grant this role to whomever. The DBDUSER is actually a role, can run Database Designer functions. SYSMONITOR gives you some elevated auditing permissions and we'll talk about that later as well. And finally, PUBLIC is a role that everyone has all the time so anything you want to be allowed for everyone, attach to PUBLIC. Imagine this scenario. I've got a really big schema with lots of relations. Those relations might be changing all the time. But for each principal that uses this schema, I want the privileges for all the tables and views there to be roughly the same. Even though the tables and views come and go, for example, an analyst might need full access to all of them no matter how many there are or what there are at any given time. So to manage this, my first approach I could use is remember to run grants every time a new table or view is created. And not just you but everyone using this schema. Not only is it a pain, it's hard to enforce. The second approach is to use schema-inherited privileges. So in Vertica, schema grants can include relational privileges. For example, SELECT or INSERT, which normally don't mean anything for a schema, but they do for a table. If a relation's marked as inheriting, then the schema grants to a principal, for example, salespeople, also apply to the relation. And you can see on the diagram here how the usage applies to the schema and the SELECT technically but in Sales.foo table, SELECT also applies. So now, instead of lots of GRANT statements for multiple object owners, we only have to run one ALTER SCHEMA statement and three GRANT statements and from then on, any time that you grant some privileges or revoke privileges to or on the schema, to or from a principal, all your new tables and views will get them automatically. So it's dynamically calculated. Now of course, setting it up securely, is that you want to know what's happened here and what's going on. So to monitor the privileges, there are three system tables which you want to look at. The first is grants, which will show you privileges that are active for you. That is your user and active roles and theirs and so on down the chain. Grants will show you the explicit privileges and inherited_privileges will show you the inherited ones. And then there's one more inheriting_objects which will show all tables and views which inherit privileges so that's useful more for not seeing privileges themselves but managing inherited privileges in general. And finally, how do you see all privileges from all these sources, right? In one go, you want to see them together? Well, there's a metafunction added in 9.3.1. Get_privileges_description which will, given an object, it will sum up all the privileges for a current user on that object. I'll refer you to the documentation for usage and supported types. Now, the problem with SELECT. SELECT let's you see everything or nothing. You can either read the table or you can't. But what if you want some principals to see subset or a transformed version of the data. So for example, I have a table with personnel data and different principals, as you can see here, need different access levels to sensitive information. Social security numbers. Well, one thing I could do is I could make a view for each principal. But I could also use access policies and access policies can do this without introducing any new objects or dependencies. It centralizes your restriction logic and makes it easier to manage. So what do access policies do? Well, we've got row and column access policies. Rows will hide and column access policies will transform data in the row or column, depending on who's doing the SELECTing. So it transforms the data, as we saw on the previous slide, to look as requested. Now, if access policies let you see the raw data, you can still modify the data. And the implication of this is that when you're crafting access policies, you should only use them to refine access for principals that need read-only access. That is, if you want a principal to be able to modify it, the access policies you craft should let through the raw data for that principal. So in our previous example, the loader service should be able to see every row and it should be able to see untransformed data in every column. And as long as that's true, then they can continue to load into this table. All of this is of course monitorable by a system table, in this case access_policy. Check the docs for more information on how to implement these. All right, that's it for access control. Now on to delegation and impersonation. So what's the question here? Well, the question is who is Vertica? And that might seem like a silly question, but here's what I mean by that. When Vertica's connecting to a downstream service, for example, cloud storage, how should Vertica identify itself? Well, most of the time, we do the permissions check ourselves and then we connect as Vertica, like in this diagram here. But sometimes we can do better. And instead of connecting as Vertica, we connect with some kind of upstream user identity. And when we do that, we let the service decide who can do what, so Vertica isn't the only line of defense. And in addition to the defense in depth benefit, there are also benefits for auditing because the external system can see who is really doing something. It's no longer just Vertica showing up in that external service's logs, it's somebody like Alice or Bob, trying to do something. One system where this comes into play is with Voltage SecureData. So, let's look at a couple use cases. The first one, I'm just encrypting for compliance or anti-theft reasons. In this case, I'll just use one global identity to encrypt or decrypt with Voltage. But imagine another use case, I want to control which users can decrypt which data. Now I'm using Voltage for access control. So in this case, we want to delegate. The solution here is on the Voltage side, give Voltage users access to appropriate identities and these identities control encryption for sets of data. A Voltage user can access multiple identities like groups. Then on the Vertica side, a Vertica user can set their Voltage username and password in a session and Vertica will talk to Voltage as that Voltage user. So in the diagram here, you can see an example of how this is leverage so that Alice could decrypt something but Bob cannot. Another place the delegation paradigm shows up is with storage. So Vertica can store and interact with data on non-local file systems. For example, HGFS or S3. Sometimes Vertica's storing Vertica-managed data there. For example, in Eon mode, you might store your projections in communal storage in S3. But sometimes, Vertica is interacting with external data. For example, this usually maps to a user storage location in the Vertica side and it might, on the external storage side, be something like Parquet files on Hadoop. And in that case, it's not really Vertica's data and we don't want to give Vertica more power than it needs, so let's request the data on behalf of who needs it. Lets say I'm an analyst and I want to copy from or export to Parquet, using my own bucket. It's not Vertica's bucket, it's my data. But I want Vertica to manipulate data in it. So the first option I have is to give Vertica as a whole access to the bucket and that's problematic because in that case, Vertica becomes kind of an AWS god. It can see any bucket, any Vertica user might want to push or pull data to or from any time Vertica wants. So it's not good for the principals of least access and zero trust. And we can do better than that. So in the second option, use an ID and secret key pair for an AWS, IAM, if you're familiar, principal that does have access to the bucket. So I might use my, the analyst, credentials, or I might use credentials for an AWS role that has even fewer privileges than I do. Sort of a restricted subset of my privileges. And then I use that. I set it in Vertica at the session level and Vertica will use those credentials for the copy export commands. And it gives more isolation. Something that's in the works is support for keyless delegation, using assumable IAM roles. So similar benefits to option two here, but also not having to manage keys at the user level. We can do basically the same thing with Hadoop and HGFS with three different methods. So first option is Kerberos delegation. I think it's the most secure. It definitely, if access control is your primary concern here, this will give you the tightest access control. The downside is it requires the most configuration outside of Vertica with Kerberos and HGFS but with this, you can really determine which Vertica users can talk to which HGFS locations. Then, you've got secure impersonation. If you've got a highly trusted Vertica userbase, or at least some subset of it is, and you're not worried about them doing things wrong but you want to know about auditing on the HGFS side, that's your primary concern, you can use this option. This diagram here gives you a visual overview of how that works. But I'll refer you to the docs for details. And then finally, option three, this is bringing your own delegation token. It's similar to what we do with AWS. We set something in the session level, so it's very flexible. The user can do it at an ad hoc basis, but it is manual, so that's the third option. Now on to auditing and monitoring. So of course, we want to know, what's happening in our database? It's important in general and important for incident response, of course. So your first stop, to answer this question, should be system tables. And they're a collection of information about events, system state, performance, et cetera. They're SELECT-only tables, but they work in queries as usual. The data is just loaded differently. So there are two types generally. There's the metadata table, which stores persistent information or rather reflects persistent information stored in the catalog, for example, users or schemata. Then there are monitoring tables, which reflect more transient information, like events, system resources. Here you can see an example of output from the resource pool's storage table which, these are actually, despite that it looks like system statistics, they're actually configurable parameters for using that. If you're interested in resource pools, a way to handle users' resource allocation and various principal's resource allocation, again, check that out on the docs. Then of course, there's the followup question, who can see all of this? Well, some system information is sensitive and we should only show it to those who need it. Principal of least privilege, right? So of course the superuser can see everything, but what about non-superusers? How do we give access to people that might need additional information about the system without giving them too much power? One option's SYSMONITOR, as I mentioned before, it's a special role. And this role can always read system tables but not change things like a superuser would be able to. Just reading. And another option is the RESTRICT and RELEASE metafunctions. Those grant and revoke access to from a certain system table set, to and from the PUBLIC role. But the downside of those approaches is that they're inflexible. So they only give you, they're all or nothing. For a specific preset of tables. And you can't really configure it per table. So if you're willing to do a little more setup, then I'd recommend using your own grants and roles. System tables support GRANT and REVOKE statements just like any regular relations. And in that case, I wouldn't even bother with SYSMONITOR or the metafunctions. So to do this, just grant whatever privileges you see fit to roles that you create. Then go ahead and grant those roles to the users that you want. And revoke access to the system tables of your choice from PUBLIC. If you need even finer-grained access than this, you can create views on top of system tables. For example, you can create a view on top of the user system table which only shows the current user's information, uses a built-in function that you can use as part of the view definition. And then, you can actually grant this to PUBLIC, so that each user in Vertica could see their own user's information and never give access to the user system table as a whole, just that view. Now if you're a superuser or if you have direct access to nodes in the cluster, filesystem/OS, et cetera, then you have more ways to see events. Vertica supports various methods of logging. You can see a few methods here which are generally outside of running Vertica, you'd interact with them in a different way, with the exception of active events which is a system table. We've also got the data collector. And that sorts events by subjects. So what the data collector does, it extends the logging and system table functionality, by the component, is what it's called in the documentation. And it logs these events and information to rotating files. For example, AnalyzeStatistics is a function that could be of use by users and as a database administrator, you might want to monitor that so you can use the data collector for AnalyzeStatistics. And the files that these create can be exported into a monitoring database. One example of that is with the Management Console Extended Monitoring. So check out their virtual BDC talk. The one on the management console. And that's it for the key points of security in Vertica. Well, many of these slides could spawn a talk on their own, so we encourage you to check out our blog, check out the documentation and the forum for further investigation and collaboration. Hopefully the information we provided today will inform your choices in securing your deployment of Vertica. Thanks for your time today. That concludes our presentation. Now, we're ready for Q&A.

Published Date : Mar 30 2020

SUMMARY :

in the question box below the slide as it occurs to you So for instance, you can see date of birth encrypted and the question you have to ask is more complex.

ENTITIES

Entity	Category	Confidence
Chris	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Chris Morris	PERSON	0.99+
second option	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Paige Roberts	PERSON	0.99+
two types	QUANTITY	0.99+
first option	QUANTITY	0.99+
three	QUANTITY	0.99+
Alice	PERSON	0.99+
second approach	QUANTITY	0.99+
Paige	PERSON	0.99+
third option	QUANTITY	0.99+
AWS'	ORGANIZATION	0.99+
today	DATE	0.99+
Today	DATE	0.99+
first approach	QUANTITY	0.99+
second half	QUANTITY	0.99+
each service	QUANTITY	0.99+
Bob	PERSON	0.99+
10 petabytes	QUANTITY	0.99+
Fenic	PERSON	0.99+
first	QUANTITY	0.99+
first source	QUANTITY	0.99+
first one	QUANTITY	0.99+
Fen	PERSON	0.98+
S3	TITLE	0.98+
One system	QUANTITY	0.98+
first objective	QUANTITY	0.98+
each user	QUANTITY	0.98+
First role	QUANTITY	0.97+
each principal	QUANTITY	0.97+
4/2	DATE	0.97+
each	QUANTITY	0.97+
both	QUANTITY	0.97+
Vertica	TITLE	0.97+
First	QUANTITY	0.97+
one	QUANTITY	0.96+
this week	DATE	0.95+
three different methods	QUANTITY	0.95+
three system tables	QUANTITY	0.94+
one thing	QUANTITY	0.94+
Fenic Fawkes	PERSON	0.94+
Parquet	TITLE	0.94+
Hadoop	TITLE	0.94+
One example	QUANTITY	0.93+
Dbadmin	PERSON	0.92+
10.0	QUANTITY	0.92+

UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives

>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.

Published Date : Mar 30 2020

SUMMARY :

Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.

ENTITIES

Entity	Category	Confidence
Tom Wall	PERSON	0.99+
Sue LeClaire	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Roger Huebner	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Tom	PERSON	0.99+
Python 2	TITLE	0.99+
August 2019	DATE	0.99+
2019	DATE	0.99+
Python 3	TITLE	0.99+
two	QUANTITY	0.99+
Sue	PERSON	0.99+
Python	TITLE	0.99+
python	TITLE	0.99+
SQL	TITLE	0.99+
late 2018	DATE	0.99+
First	QUANTITY	0.99+
end of 2019	DATE	0.99+
Vertica	TITLE	0.99+
today	DATE	0.99+
Java	TITLE	0.99+
Spark	TITLE	0.99+
C++	TITLE	0.99+
JavaScript	TITLE	0.99+
vertica-python	TITLE	0.99+
Today	DATE	0.99+
first time	QUANTITY	0.99+
11 different releases	QUANTITY	0.99+
UDXs	TITLE	0.99+
Kafka	TITLE	0.99+
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives	TITLE	0.98+
Grafana	ORGANIZATION	0.98+
PyODBC	TITLE	0.98+
first	QUANTITY	0.98+
UDX	TITLE	0.98+
vertica 10	TITLE	0.98+
ODBC	TITLE	0.98+
10	DATE	0.98+
Postgres	TITLE	0.98+
DataDog	ORGANIZATION	0.98+
40 customer reported issues	QUANTITY	0.97+
both	QUANTITY	0.97+

UNLIST TILL 4/2 - Keep Data Private

>> Paige: Hello everybody and thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled Keep Data Private Prepare and Analyze Without Unencrypting With Voltage SecureData for Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica, and I'll be your host for this session. Joining me is Rich Gaston, Global Solutions Architect, Security, Risk, and Government at Voltage. And before we begin, I encourage you to submit your questions or comments during the virtual session, you don't have to wait till the end. Just type your question as it occurs to you, or comment, in the question box below the slide and then click Submit. There'll be a Q&A session at the end of the presentation where we'll try to answer as many of your questions as we're able to get to during the time. Any questions that we don't address we'll do our best to answer offline. Now, if you want, you can visit the Vertica Forum to post your questions there after the session. Now, that's going to take the place of the Developer Lounge, and our engineering team is planning to join the Forum, to keep the conversation going. So as a reminder, you can also maximize your screen by clicking the double arrow button, in the lower-right corner of the slides. That'll allow you to see the slides better. And before you ask, yes, this virtual session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. All right, let's get started. Over to you, Rich. >> Rich: Hey, thank you very much, Paige, and appreciate the opportunity to discuss this topic with the audience. My name is Rich Gaston and I'm a Global Solutions Architect, within the Micro Focus team, and I work on global Data privacy and protection efforts, for many different organizations, looking to take that journey toward breach defense and regulatory compliance, from platforms ranging from mobile to mainframe, everything in between, cloud, you name it, we're there in terms of our solution sets. Vertica is one of our major partners in this space, and I'm very excited to talk with you today about our solutions on the Vertica platform. First, let's talk a little bit about what you're not going to learn today, and that is, on screen you'll see, just part of the mathematics that goes into, the format-preserving encryption algorithm. We are the originators and authors and patent holders on that algorithm. Came out of research from Stanford University, back in the '90s, and we are very proud, to take that out into the market through the NIST standard process, and license that to others. So we are the originators and maintainers, of both standards and athureader in the industry. We try to make this easy and you don't have to learn any of this tough math. Behind this there are also many other layers of technology. They are part of the security, the platform, such as stateless key management. That's a really complex area, and we make it very simple for you. We have very mature and powerful products in that space, that really make your job quite easy, when you want to implement our technology within Vertica. So today, our goal is to make Data protection easy for you, to be able to understand the basics of Voltage Secure Data, you're going to be learning how the Vertica UDx, can help you get started quickly, and we're going to see some examples of how Vertica plus Voltage Secure Data, are going to be working together, in our customer cases out in the field. First, let's take you through a quick introduction to Voltage Secure Data. The business drivers and what's this all about. First of all, we started off with Breach Defense. We see that despite continued investments, in personal perimeter and platform security, Data breaches continue to occur. Voltage Secure Data plus Vertica, provides defense in depth for sensitive Data, and that's a key concept that we're going to be referring to. in the security field defense in depth, is a standard approach to be able to provide, more layers of protection around sensitive assets, such as your Data, and that's exactly what Secure Data is designed to do. Now that we've come through many of these breach examples, and big ticket items, getting the news around breaches and their impact, the business regulators have stepped up, and regulatory compliance, is now a hot topic in Data privacy. Regulations such as GDPR came online in 2018 for the EU. CCPA came online just this year, a couple months ago for California, and is the de-facto standard for the United States now, as organizations are trying to look at, the best practices for providing, regulatory compliance around Data privacy and protection. These gives massive new rights to consumers, but also obligations to organizations, to protect that personal Data. Secure Data Plus Vertica provides, fine grained authorization around sensitive Data, And we're going to show you exactly how that works, within the Vertica platform. At the bottom, you'll see some of the snippets there, of the news articles that just keep racking up, and our goal is to keep you off the news, to keep your company safe, so that you can have the assurance, that even if there is an unintentional, or intentional breach of Data out of the corporation, if it is protected by voltage Secure Data, it will be of no value to those hackers, and then you have no impact, in terms of risk to the organization. What do we mean by defense in depth? Let's take a look first at the encryption types, and the benefits that they provide, and we see our customers implementing, all kinds of different protection mechanisms, within the organization. You could be looking at disk level protection, file system protection, protection on the files themselves. You could protect the entire Database, you could protect our transmissions, as they go from the client to the server via TLS, or other protected tunnels. And then we look at Field-level Encryption, and that's what we're talking about today. That's all the above protections, at the perimeter level at the platform level. Plus, we're giving you granular access control, to your sensitive Data. Our main message is, keep the Data protected for at the earliest possible point, and only access it, when you have a valid business need to do so. That's a really critical aspect as we see Vertica customers, loading terabytes, petabytes of Data, into clusters of Vertica console, Vertica Database being able to give access to that Data, out to a wide variety of end users. We started off with organizations having, four people in an office doing Data science, or analytics, or Data warehousing, or whatever it's called within an organization, and that's now ballooned out, to a new customer coming in and telling us, we're going to have 1000 people accessing it, plus service accounts accessing Vertica, we need to be able to provide fine level access control, and be able to understand what are folks doing with that sensitive Data? And how can we Secure it, the best practices possible. In very simple state, voltage protect Data at rest and in motion. The encryption of Data facilitates compliance, and it reduces your risk of breach. So if you take a look at what we mean by feel level, we could take a name, that name might not just be in US ASCII. Here we have a sort of Latin one extended, example of Harold Potter, and we could take a look at the example protected Data. Notice that we're taking a character set approach, to protecting it, meaning, I've got an alphanumeric option here for the format, that I'm applying to that name. That gives me a mix of alpha and numeric, and plus, I've got some of that Latin one extended alphabet in there as well, and that's really controllable by the end customer. They can have this be just US ASCII, they can have it be numbers for numbers, you can have a wide variety, of different protection mechanisms, including ignoring some characters in the alphabet, in case you want to maintain formatting. We've got all the bells and whistles, that you would ever want, to put on top of format preserving encryption, and we continue to add more to that platform, as we go forward. Taking a look at tax ID, there's an example of numbers for numbers, pretty basic, but it gives us the sort of idea, that we can very quickly and easily keep the Data protected, while maintaining the format. No schema changes are going to be required, when you want to protect that Data. If you look at credit card number, really popular example, and the same concept can be applied to tax ID, often the last four digits will be used in a tax ID, to verify someone's identity. That could be on an automated telephone system, it could be a customer service representative, just trying to validate the security of the customer, and we can keep that Data in the clear for that purpose, while protecting the entire string from breach. Dates are another critical area of concern, for a lot of medical use cases. But we're seeing Date of Birth, being included in a lot of Data privacy conversations, and we can protect dates with dates, they're going to be a valid date, and we have some really nifty tools, to maintain offsets between dates. So again, we've got the real depth of capability, within our encryption, that's not just saying, here's a one size fits all approach, GPS location, customer ID, IP address, all of those kinds of Data strings, can be protected by voltage Secure Data within Vertica. Let's take a look at the UDx basics. So what are we doing, when we add Voltage to Vertica? Vertica stays as is in the center. In fact, if you get the Vertical distribution, you're getting the Secure Data UDx onboard, you just need to enable it, and have Secure Data virtual appliance, that's the box there on the middle right. That's what we come in and add to the mix, as we start to be able to add those capabilities to Vertica. On the left hand side, you'll see that your users, your service accounts, your analytics, are still typically doing Select, Update, Insert, Delete, type of functionality within Vertica. And they're going to come into Vertica's access control layer, they're going to also access those services via SQL, and we simply extend SQL for Vertica. So when you add the UDx, you get additional syntax that we can provide, and we're going to show you examples of that. You can also integrate that with concepts, like Views within Vertica. So that we can say, let's give a view of Data, that gives the Data in the clear, using the UDx to decrypt that Data, and let's give everybody else, access to the raw Data which is protected. Third parties could be brought in, folks like contractors or folks that aren't vetted, as closely as a security team might do, for internal sensitive Data access, could be given access to the Vertical cluster, without risk of them breaching and going into some area, they're not supposed to take a look at. Vertica has excellent control for access, down even to the column level, which is phenomenal, and really provides you with world class security, around the Vertical solution itself. Secure Data adds another layer of protection, like we're mentioning, so that we can have Data protected in use, Data protected at rest, and then we can have the ability, to share that protected Data throughout the organization. And that's really where Secure Data shines, is the ability to protect that Data on mainframe, on mobile, and open systems, in the cloud, everywhere you want to have that Data move to and from Vertica, then you can have Secure Data, integrated with those endpoints as well. That's an additional solution on top, the Secure Data Plus Vertica solution, that is bundled together today for a sales purpose. But we can also have that conversation with you, about those wider Secure Data use cases, we'd be happy to talk to you about that. Security to the virtual appliance, is a lightweight appliance, sits on something like eight cores, 16 gigs of RAM, 100 gig of disk or 200 gig of disk, really a lightweight appliance, you can have one or many. Most customers have four in production, just for redundancy, they don't need them for scale. But we have some customers with 16 or more in production, because they're running such high volumes of transaction load. They're running a lot of web service transactions, and they're running Vertica as well. So we're going to have those virtual appliances, as co-located around the globe, hooked up to all kinds of systems, like Syslog, LDAP, load balancers, we've got a lot of capability within the appliance, to fit into your enterprise IP landscape. So let me get you directly into the neat, of what does the UDx do. If you're technical and you know SQL, this is probably going to be pretty straightforward to you, you'll see the copy command, used widely in Vertica to get Data into Vertica. So let's try to protect that Data when we're ingesting it. Let's grab it from maybe a CSV file, and put it straight into Vertica, but protected on the way and that's what the UDx does. We have Voltage Secure protectors, an added syntax, like I mentioned, to the Vertica SQL. And that allows us to say, we're going to protect the customer first name, using the parameters of hyper alphanumeric. That's our internal lingo of a format, within Secure Data, this part of our API, the API is require very few inputs. The format is the one, that you as a developer will be supplying, and you'll have different ones for maybe SSN, you'll have different formats for street address, but you can reuse a lot of your formats, across a lot of your PII, PHI Data types. Protecting after ingest is also common. So I've got some Data, that's already been put into a staging area, perhaps I've got a landing zone, a sandbox of some sort, now I want to be able to move that, into a different zone in Vertica, different area of the schema, and I want to have that Data protected. We can do that with the update command, and simply again, you'll notice Voltage Secure protect, nothing too wild there, basically the same syntax. We're going to query unprotected Data. How do we search once I've encrypted all my Data? Well, actually, there's a pretty nifty trick to do so. If you want to be able to query unprotected Data, and we have the search string, like a phone number there in this example, simply call Voltage Secure protect on that, now you'll have the cipher text, and you'll be able to search the stored cipher text. Again, we're just format preserving encrypting the Data, and it's just a string, and we can always compare those strings, using standard syntax and SQL. Using views to decrypt Data, again a powerful concept, in terms of how to make this work, within the Vertica Landscape, when you have a lot of different groups of users. Views are very powerful, to be able to point a BI tool, for instance, business intelligence tools, Cognos, Tableau, etc, might be accessing Data from Vertica with simple queries. Well, let's point them to a view that does the hard work, and uses the Vertical nodes, and its horsepower of CPU and RAM, to actually run that Udx, and do the decryption of the Data in use, temporarily in memory, and then throw that away, so that it can't be breached. That's a nice way to keep your users active and working and going forward, with their Data access and Data analytics, while also keeping the Data Secure in the process. And then we might want to export some Data, and push it out to someone in a clear text manner. We've got a third party, needs to take the tax ID along with some Data, to do some processing, all we need to do is call Voltage Secure Access, again, very similar to the protect call, and you're writing the parameter again, and boom, we have decrypted the Data and used again, the Vertical resources of RAM and CPU and horsepower, to do the work. All we're doing with Voltage Secure Data Appliance, is a real simple little key fetch, across a protected tunnel, that's a tiny atomic transaction, gets done very quick, and you're good to go. This is it in terms of the UDx, you have a couple of calls, and one parameter to pass, everything else is config driven, and really, you're up and running very quickly. We can even do demos and samples of this Vertical Udx, using hosted appliances, that we put up for pre sales purposes. So folks want to get up and get a demo going. We could take that Udx, configure it to point to our, appliance sitting on the internet, and within a couple of minutes, we're up and running with some simple use cases. Of course, for on-prem deployment, or deployment in the cloud, you'll want your own appliance in your own crypto district, you have your own security, but it just shows, that we can easily connect to any appliance, and get this working in a matter of minutes. Let's take a look deeper at the voltage plus Vertica solution, and we'll describe some of the use cases and path to success. First of all your steps to, implementing Data-centric security and Vertica. Want to note there on the left hand side, identify sensitive Data. How do we do this? I have one customer, where they look at me and say, Rich, we know exactly what our sensitive Data is, we develop the schema, it's our own App, we have a customer table, we don't need any help in this. We've got other customers that say, Rich, we have a very complex Database environment, with multiple Databases, multiple schemas, thousands of tables, hundreds of thousands of columns, it's really, really complex help, and we don't know what people have been doing exactly, with some of that Data, We've got various teams that share this resource. There, we do have additional tools, I wanted to give a shout out to another microfocus product, which is called Structured Data Manager. It's a great tool that helps you identify sensitive Data, with some really amazing technology under the hood, that can go into a Vertica repository, scan those tables, take a sample of rows or a full table scan, and give you back some really good reports on, we think this is sensitive, let's go confirm it, and move forward with Data protection. So if you need help on that, we've got the tools to do it. Once you identify that sensitive Data, you're going to want to understand, your Data flows and your use cases. Take a look at what analytics you're doing today. What analytics do you want to do, on sensitive Data in the future? Let's start designing our analytics, to work with sensitive Data, and there's some tips and tricks that we can provide, to help you mitigate, any kind of concerns around performance, or any kind of concerns around rewriting your SQL. As you've noted, you can just simply insert our SQL additions, into your code and you're off and running. You want to install and configure the Udx, and secure Data software plants. Well, the UDx is pretty darn simple. The documentation on Vertica is publicly available, you could see how that works, and what you need to configure it, one file here, and you're ready to go. So that's pretty straightforward to process, either grant some access to the Udx, and that's really up to the customer, because there are many different ways, to handle access control in Vertica, we're going to be flexible to fit within your model, of access control and adding the UDx to your mix. Each customer is a little different there, so you might want to talk with us a little bit about, the best practices for your use cases. But in general, that's going to be up and running in just a minute. The security software plants, hardened Linux appliance today, sits on-prem or in the cloud. And you can deploy that. I've seen it done in 15 minutes, but that's what the real tech you had, access to being able to generate a search, and do all this so that, your being able to set the firewall and all the DNS entries, the basically blocking and tackling of a software appliance, you get that done, corporations can take care of that, in just a couple of weeks, they get it all done, because they have wait waiting on other teams, but the software plants are really fast to get stood up, and they're very simple to administer, with our web based GUI. Then finally, you're going to implement your UDx use cases. Once the software appliance is up and running, we can set authentication methods, we could set up the format that you're going to use in Vertica, and then those two start talking together. And it should be going in dev and test in about half a day, and then you're running toward production, in just a matter of days, in most cases. We've got other customers that say, Hey, this is going to be a bigger migration project for us. We might want to split this up into chunks. Let's do the real sensitive and scary Data, like tax ID first, as our sort of toe in the water approach, and then we'll come back and protect other Data elements. That's one way to slice and dice, and implement your solution in a planned manner. Another way is schema based. Let's take a look at this section of the schema, and implement protection on these Data elements. Now let's take a look at the different schema, and we'll repeat the process, so you can iteratively move forward with your deployment. So what's the added value? When you add full Vertica plus voltage? I want to highlight this distinction because, Vertica contains world class security controls, around their Database. I'm an old time DBA from a different product, competing against Vertica in the past, and I'm really aware of the granular access controls, that are provided within various platforms. Vertica would rank at the very top of the list, in terms of being able to give me very tight control, and a lot of different AWS methods, being able to protect the Data, in a lot of different use cases. So Vertica can handle a lot of your Data protection needs, right out of the box. Voltage Secure Data, as we keep mentioning, adds that defense in-Depth, and it's going to enable those, enterprise wide use cases as well. So first off, I mentioned this, the standard of FF1, that is format preserving encryption, we're the authors of it, we continue to maintain that, and we want to emphasize that customers, really ought to be very, very careful, in terms of choosing a NIST standard, when implementing any kind of encryption, within the organization. So 8 ES was one of the first, and Hallmark, benchmark encryption algorithms, and in 2016, we were added to that mix, as FF1 with CS online. If you search NIST, and Voltage Security, you'll see us right there as the author of the standard, and all the processes that went along with that approval. We have centralized policy for key management, authentication, audit and compliance. We can now see that Vertica selected or fetch the key, to be able to protect some Data at this date and time. We can track that and be able to give you audit, and compliance reporting against that Data. You can move protected Data into and out of Vertica. So if we ingest via Kafka, and just via NiFi and Kafka, ingest on stream sets. There are a variety of different ingestion methods, and streaming methods, that can get Data into Vertica. We can integrate secure Data with all of those components. We're very well suited to integrate, with any Hadoop technology or any big Data technology, as we have API's in a variety of languages, bitness and platforms. So we've got that all out of the box, ready to go for you, if you need it. When you're moving Data out of Vertica, you might move it into an open systems platform, you might move it to the cloud, we can also operate and do the decryption there, you're going to get the same plaintext back, and if you protect Data over in the cloud, and move it into Vertica, you're going to be able to decrypt it in Vertica. That's our cross platform promise. We've been delivering on that for many, many years, and we now have many, many endpoints that do that, in production for the world's largest organization. We're going to preserve your Data format, and referential integrity. So if I protect my social security number today, I can protect another batch of Data tomorrow, and that same ciphertext will be generated, when I put that into Vertica, I can have absolute referential integrity on that Data, to be able to allow for analytics to occur, without even decrypting Data in many cases. And we have decrypt access for authorized users only, with the ability to add LDAP authentication authorization, for UDx users. So you can really have a number of different approaches, and flavors of how you implement voltage within Vertica, but what you're getting is the additional ability, to have that confidence, that we've got the Data protected at rest, even if I have a DBA that's not vetted or someone new, or I don't know where this person is from a third party, and being provided access as a DBA level privilege. They could select star from all day long, and they're going to get ciphertext, they're going to have nothing of any value, and if they want to use the UDF to decrypt it, they're going to be tracked and traced, as to their utilization of that. So it allows us to have that control, and additional layer of security on your sensitive Data. This may be required by regulatory agencies, and it's seeming that we're seeing compliance audits, get more and more strict every year. GDPR was kind of funny, because they said in 2016, hey, this is coming, they said in 2018, it's here, and now they're saying in 2020, hey, we're serious about this, and the fines are mounting. And let's give you some examples to kind of, help you understand, that these regulations are real, the fines are real, and your reputational damage can be significant, if you were to be in breach, of a regulatory compliance requirements. We're finding so many different use cases now, popping up around regional protection of Data. I need to protect this Data so that it cannot go offshore. I need to protect this Data, so that people from another region cannot see it. That's all the kind of capability that we have, within secure Data that we can add to Vertica. We have that broad platform support, and I mentioned NiFi and Kafka, those would be on the left hand side, as we start to ingest Data from applications into Vertica. We can have landing zone approaches, where we provide some automated scripting at an OS level, to be able to protect ETL batch transactions coming in. We could protect within the Vertica UDx, as I mentioned, with the copy command, directly using Vertica. Everything inside that dot dash line, is the Vertical Plus Voltage Secure Data combo, that's sold together as a single package. Additionally, we'd love to talk with you, about the stuff that's outside the dash box, because we have dozens and dozens of endpoints, that could protect and access Data, on many different platforms. And this is where you really start to leverage, some of the extensive power of secure Data, to go across platform to handle your web based apps, to handle apps in the cloud, and to handle all of this at scale, with hundreds of thousands of transactions per second, of format preserving encryption. That may not sound like much, but when you take a look at the algorithm, what we're doing on the mathematics side, when you look at everything that goes into that transaction, to me, that's an amazing accomplishment, that we're trying to reach those kinds of levels of scale, and with Vertica, it scales horizontally. So the more nodes you add, the more power you get, the more throughput you're going to get, from voltage secure Data. I want to highlight the next steps, on how we can continue to move forward. Our secure Data team is available to you, to talk about the landscape, your use cases, your Data. We really love the concept that, we've got so many different organizations out there, using secure Data in so many different and unique ways. We have vehicle manufacturers, who are protecting not just the VIN, not just their customer Data, but in fact they're protecting sensor Data from the vehicles, which is sent over the network, down to the home base every 15 minutes, for every vehicle that's on the road, and every vehicle of this customer of ours, since 2017, has included that capability. So now we're talking about, an additional millions and millions of units coming online, as those cars are sold and distributed, and used by customers. That sensor Data is critical to the customer, and they cannot let that be ex-filled in the clear. So they protect that Data with secure Data, and we have a great track record of being able to meet, a variety of different unique requirements, whether it's IoT, whether it's web based Apps, E-commerce, healthcare, all kinds of different industries, we would love to help move the conversations forward, and we do find that it's really a three party discussion, the customer, secure Data experts in some cases, and the Vertica team. We have great enablement within Vertica team, to be able to explain and present, our secure Data solution to you. But we also have that other ability to add other experts in, to keep that conversation going into a broader perspective, of how can I protect my Data across all my platforms, not just in Vertica. I want to give a shout out to our friends at Vertica Academy. They're building out a great demo and training facilities, to be able to help you learn more about these UDx's, and how they're implemented. The Academy, is a terrific reference and resource for your teams, to be able to learn more, about the solution in a self guided way, and then we'd love to have your feedback on that. How can we help you more? What are the topics you'd like to learn more about? How can we look to the future, in protecting unstructured Data? How can we look to the future, of being able to protect Data at scale? What are the requirements that we need to be meeting? Help us through the learning processes, and through feedback to the team, get better, and then we'll help you deliver more solutions, out to those endpoints and protect that Data, so that we're not having Data breach, we're not having regulatory compliance concerns. And then lastly, learn more about the Udx. I mentioned, that all of our content there, is online and available to the public. So vertica.com/secureData , you're going to be able to walk through the basics of the UDX. You're going to see how simple it is to set up, what the UDx syntax looks like, how to grant access to it, and then you'll start to be able to figure out, hey, how can I start to put this, into a PLC in my own environment? Like I mentioned before, we have publicly available hosted appliance, for demo purposes, that we can make available to you, if you want to PLC this. Reach out to us. Let's get a conversation going, and we'll get you the address and get you some instructions, we can have a quick enablement session. We really want to make this accessible to you, and help demystify the concept of encryption, because when you see it as a developer, and you start to get your hands on it and put it to use, you can very quickly see, huh, I could use this in a variety of different cases, and I could use this to protect my Data, without impacting my analytics. Those are some of the really big concerns that folks have, and once we start to get through that learning process, and playing around with it in a PLC way, that we can start to really put it to practice into production, to say, with confidence, we're going to move forward toward Data encryption, and have a very good result, at the end of the day. This is one of the things I find with customers, that's really interesting. Their biggest stress, is not around the timeframe or the resource, it's really around, this is my Data, I have been working on collecting this Data, and making it available in a very high quality way, for many years. This is my job and I'm responsible for this Data, and now you're telling me, you're going to encrypt that Data? It makes me nervous, and that's common, everybody feels that. So we want to have that conversation, and that sort of trial and error process to say, hey, let's get your feet wet with it, and see how you like it in a sandbox environment. Let's now take that into analytics, and take a look at how we can make this, go for a quick 1.0 release, and let's then take a look at, future expansions to that, where we start adding Kafka on the ingest side. We start sending Data off, into other machine learning and analytics platforms, that we might want to utilize outside of Vertica, for certain purposes, in certain industries. Let's take a look at those use cases together, and through that journey, we can really chart a path toward the future, where we can really help you protect that Data, at rest, in use, and keep you safe, from both the hackers and the regulators, and that I think at the end of the day, is really what it's all about, in terms of protecting our Data within Vertica. We're going to have a little couple minutes for Q&A, and we would encourage you to have any questions here, and we'd love to follow up with you more, about any questions you might have, about Vertica Plus Voltage Secure Data. They you very much for your time today.

Published Date : Mar 30 2020

SUMMARY :

and our engineering team is planning to join the Forum, and our goal is to keep you off the news,

ENTITIES

Entity	Category	Confidence
Vertica	ORGANIZATION	0.99+
100 gig	QUANTITY	0.99+
16	QUANTITY	0.99+
16 gigs	QUANTITY	0.99+
200 gig	QUANTITY	0.99+
Paige Roberts	PERSON	0.99+
2016	DATE	0.99+
Paige	PERSON	0.99+
Rich Gaston	PERSON	0.99+
dozens	QUANTITY	0.99+
2018	DATE	0.99+
Vertica Academy	ORGANIZATION	0.99+
2020	DATE	0.99+
SQL	TITLE	0.99+
AWS	ORGANIZATION	0.99+
First	QUANTITY	0.99+
1000 people	QUANTITY	0.99+
Hallmark	ORGANIZATION	0.99+
today	DATE	0.99+
Harold Potter	PERSON	0.99+
Rich	PERSON	0.99+
millions	QUANTITY	0.99+
Stanford University	ORGANIZATION	0.99+
15 minutes	QUANTITY	0.99+
Today	DATE	0.99+
Each customer	QUANTITY	0.99+
one	QUANTITY	0.99+
both	QUANTITY	0.99+
California	LOCATION	0.99+
Kafka	TITLE	0.99+
Vertica	TITLE	0.99+
Latin	OTHER	0.99+
tomorrow	DATE	0.99+
2017	DATE	0.99+
eight cores	QUANTITY	0.99+
two	QUANTITY	0.98+
GDPR	TITLE	0.98+
first	QUANTITY	0.98+
one customer	QUANTITY	0.98+
Tableau	TITLE	0.98+
United States	LOCATION	0.97+
this week	DATE	0.97+
Vertica	LOCATION	0.97+
4/2	DATE	0.97+
Linux	TITLE	0.97+
one file	QUANTITY	0.96+
vertica.com/secureData	OTHER	0.96+
four	QUANTITY	0.95+
about half a day	QUANTITY	0.95+
Cognos	TITLE	0.95+
four people	QUANTITY	0.94+
Udx	ORGANIZATION	0.94+
one way	QUANTITY	0.94+

UNLIST TILL 4/2 - A Deep Dive into the Vertica Management Console Enhancements and Roadmap

>> Jeff: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled "A Deep Dive "into the Vertica Mangement Console Enhancements and Roadmap." I'm Jeff Healey of Vertica Marketing. I'll be your host for this breakout session. Joining me are Bhavik Gandhi and Natalia Stavisky from Vertica engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively visit Vertica Forums at forum.vertica.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going well after the event. Also, a reminder that you can maximize the screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to you on demand this week. We'll send you a notification as soon as it's ready. Now let's get started. Over to you, Bhavik. >> Bhavik: All right. So hello, and welcome, everybody doing this presentation of "Deep Dive into the Vertica Management Console Enhancements and Roadmap." Myself, Bhavik, and my team member, Natalia Stavisky, will go over a few useful announcements on Vertica Management Console, discussing a few real scenarios. All right. So today we will go forward with the brief introduction about the Management Console, then we will discuss the benefits of using Management Console by going over a couple of user scenarios for the query taking too long to run and receiving email alerts from Management Console. Then we will go over a few MC features for what we call Eon Mode databases, like provisioning and reviving the Eon Mode databases from MC, managing the subcluster and understanding the Depot. Then we will go over some of the future announcements on MC that we are planning. All right, so let's get started. All right. So, do you want to know about how to provision a new Vertica cluster from MC? How to analyze and understand a database workload by monitoring the queries on the database? How do you balance the resource pools and use alerts and thresholds on MC? So, the Management Console is basically our answer and we'll talk about its capabilities and new announcements in this presentation. So just to give a brief overview of the Management Console, who uses Management Console, it's generally used by IT administrators and DB admins. Management Console can be used to monitor both Eon Mode and Enterprise Mode databases. Why to use Management Console? You can use Management Console for provisioning Vertica databases and cluster. You can manage the already existing Vertica databases and cluster you have, and you can use various tools on Management Console like query execution, Database Designer, Workload Analyzer, and set up alerts and thresholds to get notified by some of your activities on the MC. So let's go over a few benefits of using Management Console. Okay. So using Management Console, you can view and optimize resource pool usage. Management Console helps you to identify some critical conditions on your Vertica cluster. Additionally, you can set up various thresholds thresholds in MC and get other data if those thresholds are triggered on the database. So now let's dig into the couple of scenarios. So for the first scenario, we will discuss about queries taking too long and using workload analyzer to possibly help to solve the problem. In the second scenario, we will go over alert email that you received from your Management Console and analyzing the problem and taking required actions to solve the problem. So let's go over the scenario where queries are taking too long to run. So in this example, we have this one query that we are running using the query execution on MC. And for some reason we notice that it's taking about 14.8 seconds seconds to execute this query, which is higher than the expected run time of the query. The query that we are running happens to be the query used by MC during the extended monitoring. Notice that the table name and the schema name which is ds_requests_issued, and, is the schema used for extended monitoring. Now in 10.0 MC we have redesigned the Workload Analyzer and Recommendations feature to show the recommendations and allow you to execute those recommendations. In our example, we have taken the table name and figured the tuning descriptions to see if there are any tuning recommendations related to this table. As we see over here, there are three tuning recommendations available for that table. So now in 10.0 MC, you can select those recommendations and then run them. So let's run the recommendations. All right. So once recommendations are run successfully, you can go and see all the processed recommendations that you have run previously. Over here we see that there are three recommendations that we had selected earlier have successfully processed. Now we take the same query and run it on the query execution on MC and hey, it's running really faster and we see that it takes only 0.3 seconds to run the query and, which is about like 98% decrease in original runtime of the query. So in this example we saw that using a Workload Analyzer tool on MC you can possibly triage and solve issue for your queries which are taking to long to execute. All right. So now let's go over another user scenario where DB admin's received some alert email messages from MC and would like to understand and analyze the problem. So to know more about what's going on on the database and proactively react to the problems, DB admins using the Management Console can create set of thresholds and get alerted about the conditions on the database if the threshold values is reached and then respond to the problem thereafter. Now as a DB admin, I see some email message notifications from MC and upon checking the emails, I see that there are a couple of email alerts received from MC on my email. So one of the messages that I received was for Query Resource Rejections greater than 5, pool, midpool7. And then around the same time, I received another email from the MC for the Failed Queries greater than 5, and in this case I see there are 80 failed queries. So now let's go on the MC and investigate the problem. So before going into the deep investigation about failures, let's review the threshold settings on MC. So as we see, we have set up the thresholds under the database settings page for failed queries in the last 10 minutes greater than 5 and MC should send an email to the individual if the threshold is triggered. And also we have a threshold set up for queries and resource rejections in the last five minutes for midpool7 set to greater than 5. There are various other thresholds on this page that you can set if you desire to. Now let's go and triage those email alerts about the failed queries and resource rejections that we had received. To analyze the failed queries, let's take a look at the query statistics page on the database Overview page on MC. Let's take a look at the Resource Pools graph and especially for the failed queries for each resource pools. And over to the right under the failed query section, I see about like, in the last 24 hours, there are about 6,000 failed queries for midpool7. And now I switch to view to see the statistics for each user and on this page I see for User MaryLee on the right hand side there are a high number of failed queries in last 24 hours. And to know more about the failed queries for this user, I can click on the graph for this user and get the reasons behind it. So let's click on the graph and see what's going on. And so clicking on this graph, it takes me to the failed queries view on the Query Monitoring page for database, on Database activities tab. And over here, I see there are a high number of failed queries for this user, MaryLee, with the reasons stated as, exceeding high limit. To drill down more and to know more reasons behind it, I can click on the plus icon on the left hand side for each failed queries to get the failure reason for each node on the database. So let's do that. And clicking the plus icon, I see for the two nodes that are listed, over here it says there are insufficient resources like memory and file handles for midpool7. Now let's go and analyze the midpool7 configurations and activities on it. So to do so, I will go over to the Resource Pool Monitoring view and select midpool7. I see the resource allocations for this resource pool is very low. For example, the max memory is just 1MB and the max concurrency is set to 0. Hmm, that's very odd configuration for this resource pool. Also in the bottom right graph for the resource rejections for midpool7, the graph shows very high values for resource rejection. All right. So since we saw some odd configurations and odd resource allocations for midpool7, I would like to see when this resource, when the settings were changed on the resource pools. So to do this, I can preview the audit logs on, are available on the Management Console. So I can go onto the Vertica Audit Logs and see the logs for the resource pool. So I just (mumbles) for the logs and figuring the logs for midpool7. I see on February 17th, the memory and other attributes for midpool7 were modified. So now let's analyze the resource activity for midpool7 around the time when the configurations were changed. So in our case we are using extended monitoring on MC for this database, so we can go back in time and see the statistics over the larger time range for midpool7. So viewing the activities for midpool7 around February 17th, around the time when these configurations were changed, we see a decrease in resource pool usage. Also, on the bottom right, we see the resource rejections for this midpool7 have an increase, linear increase, after the configurations were changed. I can select a point on the graph to get the more details about the resource rejections. Now to analyze the effects of the modifications on midpool7. Let's go over to the Query Monitoring page. All right, I will adjust the time range around the time when the configurations were changed for midpool7 and completed activities queries for user MaryLee. And I see there are no completed queries for this user. Now I'm taking a look at the Failed Queries tab and adjusting the time range around the time when the configurations were changed. I can do so because we are using extended monitoring. So again, adjusting the time, I can see there are high number of failed queries for this user. There about about like 10,000 failed queries for this user after the configurations were changed on this resource pool. So now let's go and modify the settings since we know after the configurations were changed, this user was not able to run the queries. So you can change the resource pool settings of using Management Console's database settings page and under the Resource Pools tab. So selecting the midpool7, I see the same odd configurations for this resource pool that we saw earlier. So now let's go and modify it, the settings. So I will increase the max memory and modify the settings for midpool7 so that it has adequate resources to run the queries for the user. Hit apply on the right hand top to see the settings. Now let's do the validation after we change the resource pool attributes. So let's go over to the same query monitoring page and see if MaryLee user is able to run the queries for midpool7. We see that now, after the configuration, after the change, after we changed the configuration for midpool7, the user can run the queries successfully and the count for Completed Queries has increased after we modified the settings for this midpool7 resource pool. And also viewing the resource pool monitoring page, we can validate that after the new configurations for midpool7 has been applied and also the resource pool usage after the configuration change has increased. And also on the bottom right graph, we can see that the resource rejections for midpool7 has decreased over the time after we modified the settings. And since we are using extended monitoring for this database, I can see that the trend in data for these resource pools, the before and after effects of modifying the settings. So initially when the settings were changed, there were high resource rejections and after we again modified the settings, the resource rejections went down. Right. So now let's go work with the provisioning and reviving the Eon Mode Vertica database cluster using the Management Console on different platform. So Management Console supports provisioning and reviving of Eon Mode databases on various cloud environments like AWS, the Google Cloud Platform, and Pure Storage. So for Google, for provisioning the Vertica Management Console on Google Cloud Platform you can use launch a template. Or on AWS environment you can use the cloud formation templates available for different OS's. Once you have provisioned Vertica Management Console, you can provision the Vertica cluster and databases from MC itself. So you can provision a Vertica cluster, you can select the Create new database button available on the homepage. This will open up the wizard to create a new database and cluster. In this example, we are using we are using the Google Cloud Platform. So the wizard will ask me for varius authentication parameters for the Google Cloud Platform. And if you're on AWS, it'll ask you for the authentication parameters for the AWS environment. And going forward on the Wizard, it'll ask me to select the instance Type. I will select for the new Vertica cluster. And also provide the communal location url for my Eon Mode database and all the other preferences related to the new cluster. Once I have selected all the preferences for my new cluster I can preview the settings and I can hit, if I am, I can hit Create if all looks okay. So if I hit Create, this will create a new, MC will create a new GCP instances because we are on the GCP environment in this example. It will create a cluster on this instance, it'll create a Vertica Eon Mode Database on this cluster. And it will, additionally, you can load the test data on it if you like to. Now let's go over and revive the existing Eon Mode database from the communal location. So you can do it the same using the Management Console by selecting the Revive Eon Mode database button on the homepage. This will again open up the wizard for reviving the Eon Mode database. Again, in this example, since we are using GCP Platform, it will ask me for the Google Cloud storage authentication attributes. And for reviving, it will ask me for the communal location so I can enter the Google Storage bucket and my folder and it will discover all the Eon Mode databases located under this folder. And I can select one of the databases that I would like to revive. And it will ask me for other Vertica preferences and for this video, for this database reviving. And once I enter all the preferences and review all the preferences I can hit Revive the database button on the Wizard. So after I hit Revive database it will create the GCP instances. The number of GCP instances that I created would be seen as the number of hosts on the original Vertica cluster. It will install the Vertica cluster on this data, on this instances and it will revive the database and it will start the database. And after starting the database, it will be imported on the MC so you can start monitoring on it. So in this example, we saw you can provision and revive the Vertica database on the GCP Platform. Additionally, you can use AWS environment to provision and revive. So now since we have the Eon Mode database on MC, Natalia will go over some Eon Mode features on MC like managing subcluster and Depot activity monitoring. Over to you, Natalia. >> Natalia: Okay, thank you. Hello, my name is Natalia Stavisky. I am also a member of Vertica Management Console Team. And I will talk today about the work I did to allow users to manage subclusters using the Management Console, and also the work I did to help users understand what's going on in their Depot in the Vertica Eon Mode database. So let's look at the picture of the subclusters. On the Manage page of Vertica Management Console, you can see here is a page that has blue tabs, and the tab that's active is Subclusters. You can see that there are two subclusters are available in this database. And for each of the subclusters, you can see subcluster properties, whether this is the primary subcluster or secondary. In this case, primary is the default subcluster. It's indicated by a star. You can see what nodes belong to each subcluster. You can see the node state and node statistics. You can also easily add a new subcluster. And we're quickly going to do this. So once you click on the button, you'll launch the wizard that'll take you through the steps. You'll enter the name of the subcluster, indicate whether this is secondary or primary subcluster. I should mention that Vertica recommends having only one primary subcluster. But we have both options here available. You will enter the number of nodes for your subcluster. And once the subcluster has been created, you can manage the subcluster. What other options for managing subcluster we have here? You can scale up an existing subcluster and that's a similar approach, you launch the wizard and (mumbles) nodes. You want to add to your existing subcluster. You can scale down a subcluster. And MC validates requirements for maintaining minimal number of nodes to prevent database shutdown. So if you can not remove any nodes from a subcluster, this option will not be available. You can stop a subcluster. And depending on whether this is a primary subcluster or secondary subcluster, this option may be available or not available. Like in this picture, we can see that for the default subcluster this option is not available. And this is because shutting down the default subcluster will cause the database to shut down as well. You can terminate a subcluster. And again, the MC warns you not to terminate the primary subcluster and validates requirements for maintaining minimal number of nodes to prevent database shutdown. So now we are going to talk a little more about how the MC helps you to understand what's going on in your Depot. So Depot is one of the core of Eon Mode database. And what are the frequently asked questions about the Depot? Is the Depot size sufficient? Are a subset of users putting a high load on the database? What tables are fetched and evicted repeatedly, we call it "re-fetched," in Depot? So here in the Depot Activity Monitoring page, we now have four tabs that allow you to answer those questions. And we'll go a little more in detail through each of them, but I'll just mention what they are for now. At a Glance shows you basic Depot configuration and also shows you query executing. Depot Efficiency, we'll talk more about that and other tabs. Depot Content, that shows you what tables are currently in your Depot. And Depot Pinning allows you to see what pinning policies have been created and to create new pinning policies. Now let's go through a scenario. Monitoring performance of workloads on one subcluster. As you know, Eon Mode database allows you to have multiple subclusters and we'll explore how this feature is useful and how we can use the Management Console to make decisions regarding whether you would like to have multiple subclusters. So here we have, in my setup, a single subcluster called default_subcluster. It has two users that are running queries that are accessing tables, mostly in schema public. So the query started executing and we can see that after fetching tables from Communal, which is the red line, the rest of the time the queries are executing in Depot. The green line is indicating queries running in Depot. The all nodes Depot is about 88% full, a steady flow, and the depot size seems to be sufficient for query executions from Depot only. That's the good case scenario. Now at around 17 :15, user Sherry got an urgent request to generate a report. And at, she started running her queries. We can see that picture is quite different now. The tables Sherry is querying are in a different schema and are much larger. Now we can see multiple lines in different colors. We can see a bunch of fetches and evictions which are indicated by blue and purple bars, and a lot of queries are now spilling into Communal. This is the red and orange lines. Orange line is an indicator of a query running partially in Depot and partially getting fetched from Communal. And the red line is data fetched from Communal storage. Let's click on the, one of the lines. Each data point, each point on the line, it'll take you to the Query Details page where you can see more about what's going on. So this is the page that shows us what queries have been run in this particular time interval which is on top of this page in orange color. So that's about one minute time interval and now we can see user Sherry among the users that are running queries. Sherry's queries involve large tables and are running against a different schema. We can see the clickstream schema in the name of the, in part of the query request. So what is happening, there is not enough Depot space for both the schema that's already in use and the one Sherry needs. As a result, evictions and fetches have started occurring. What other questions we can ask ourself to help us understand what's going on? So how about, what tables are most frequently re-fetched? So for that, we will go to the Depot Efficiency page and look at the middle, the middle chart here. We can see the larger version of this chart if we expand it. So now we have 10 tables listed that are most frequently being re-fetched. We can see that there is a clickstream schema and there are other schemas so all of those tables are being used in the queries, fetched, and then there is not enough space in the Depot, they getting evicted and they get re-fetched again. So what can be done to enable all queries to run in Depot? Option one can be increase the Depot size. So we can do this by running the following queries, which (mumbles) which nodes and storage location and the new Depot size. And I should mention that we can run this query from the Management Console from the query execution page. So this would have helped us to increase the Depot size. What other options do we have, for example, when increasing Depot size is not an option? We can also provision a second subcluster to isolate workloads like Sherry's. So we are going to do this now and we will provision a second subcluster using the Manage page. Here we're creating subcluster for Sherry or for workloads like hers. And we're going to create a (mumbles). So Sherry's subcluster has been created. We can see it here, added to the list of the subclusters. It's a secondary subcluster. Sherry has been instructed to use the new SherrySubcluster for her work. Now let's see what happened. We'll go again at Depot Activity page and we'll look at the At a Glance tab. We can see that around >> 18: 07, Sherry switched to running her queries on SherrySubcluster. On top of this page, you can see subcluster selected. So we currently have two subclusters and I'm looking, what happened to SherrySubcluster once it has been provisioned? So Sherry started using it and the lines after initial fetching from Depot, which was from Communal, which was the red line, after that, all Sherry's queries fit in Depot, which is indicated by green line. Also the Depot is pretty full on those nodes, about 90% full. But the queries are processed efficiently, there is no spilling into Communal. So that's a good case scenario. Let's now go back and take a look at the original subcluster, default subcluster. So on the left portion of the chart we can see multiple lines, that was activity before Sherry switched to her own designated subcluster. At around 18:07, after Sherry switched from the subcluster to using her designated subcluster, there is no, she is no longer using the subcluster, she is not putting a load in it. So the lines after that are turning a green color, which means the queries that are still running in default subcluster are all running in Depot. We can also see that Depot fetches and evictions bars, those purple and blue bars, are no longer showing significant numbers. Also we can check the second chart that shows Communal Storage Access. And we can see that the bars have also dropped, so there is no significant access for Communal Storage. So this problem has been solved. Each of the subclusters are serving queries from Depot and that's our most efficient scenario. Let's also look at the other tabs that we have for Depot monitoring. Let's look at Depot Efficiency tab. It has six charts and I'll go through each one of them quickly. Files Reads by Location gives an indicator of where the majority of query execution took place in Depot or in Communal. Top 10 Re-Fetches into Depot, and imagine the charts earlier in our user case, it shows tables that are most frequently fetched and evicted and then fetched again. These are good candidates to get pinned if increasing Depot size is not an option. Note that both of these charts have an option to select time interval using calendar widget. So you can get the information about the activity that happened during that time interval. Depot Pinning shows what portion of your Depot is pinned, both by byte count and by table count. And the three tables at the bottom show Depot structure. How long tables stay in Depot, we would like tables to be fetched in Depot and stay there for a long time, how often they are accessed, again, the tables in Depot, we would like to see them accessed frequently, and what the size range of tables in Depot. Depot Content. This tab allows us to search for tables that are currently in Depot and also to see stats like table size in Depot. How often tables are accessed and when were they last accessed. And the same information that's available for tables in Depot is also available on projections and partition levels for those tables. Depot Pinning. This tab allows users to see what policies are currently existing and so you can do this by clicking on the first little button and click search. This'll show you all existing policies that are already created. The second option allows you to search for a table and create a policy. You can also use the action column to modify existing policies or delete them. And the third option provides details about most frequently re-fetched tables, including fetch count, total access count, and number of re-fetched bytes. So all this information can help to make decisions regarding pinning specific tables. So that's about it about the Depot. And I should mention that the server team also has a very good presentation on the, webinar, on the Eon Mode database Depot management and subcluster management. that strongly recommend it to attend or download the slide presentation. Let's talk quickly about the Management Console Roadmap, what we are planning to do in the future. So we are going to continue focusing on subcluster management, there is still a lot of things we can do here. Promoting/demoting subclusters. Load balancing across subclusters, scheduling subcluster actions, support for large cluster mode. We'll continue working on Workload Analyzer enhancement recommendation, on backup and restore from the MC. Building custom thresholds, and Eon on HDFS support. Okay, so we are ready now to take any questions you may have now. Thank you.

Published Date : Mar 30 2020

SUMMARY :

for the virtual Vertica BDC 2020. and all the other preferences related to the new cluster. and the depot size seems to be sufficient So on the left portion of the chart

ENTITIES

Entity	Category	Confidence
Natalia Stavisky	PERSON	0.99+
Sherry	PERSON	0.99+
MaryLee	PERSON	0.99+
Jeff Healey	PERSON	0.99+
Natalia	PERSON	0.99+
Jeff	PERSON	0.99+
February 17th	DATE	0.99+
second scenario	QUANTITY	0.99+
10 tables	QUANTITY	0.99+
forum.vertica.com	OTHER	0.99+
AWS	ORGANIZATION	0.99+
1MB	QUANTITY	0.99+
two users	QUANTITY	0.99+
first scenario	QUANTITY	0.99+
second option	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Bhavik	PERSON	0.99+
80 failed queries	QUANTITY	0.99+
today	DATE	0.99+
Depot	ORGANIZATION	0.99+
third	QUANTITY	0.99+
Each	QUANTITY	0.99+
six charts	QUANTITY	0.99+
both	QUANTITY	0.99+
each point	QUANTITY	0.99+
three recommendations	QUANTITY	0.99+
Today	DATE	0.99+
each	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
Bhavik Gandhi	PERSON	0.99+
midpool7	TITLE	0.99+
two nodes	QUANTITY	0.99+
second chart	QUANTITY	0.99+
two subclusters	QUANTITY	0.98+
second subcluster	QUANTITY	0.98+
Each data point	QUANTITY	0.98+
each user	QUANTITY	0.98+
both options	QUANTITY	0.98+
4/2	DATE	0.98+
Eon	ORGANIZATION	0.97+
this week	DATE	0.97+
each subcluster	QUANTITY	0.97+
about 90%	QUANTITY	0.97+
three tables	QUANTITY	0.96+
0	QUANTITY	0.96+
about 14.8 seconds seconds	QUANTITY	0.96+
one subcluster	QUANTITY	0.95+

UNLIST TILL 4/2 - The Next-Generation Data Underlying Architecture

>> Paige: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, Vertica next generation architecture. I'm Paige Roberts, open social relationship Manager at Vertica, I'll be your host for this session. And joining me is Vertica Chief Architect, Chuck Bear, before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment, in the question box that's below the slides and click submit. So as you think about it, go ahead and type it in, there'll be a Q&A session at the end of the presentation, where we'll answer as many questions, as we're able to during the time. Any questions that we don't get a chance to address, we'll do our best to answer offline. Or alternatively, you can visit the Vertica forums to post your questions there, after the session. Our engineering team is planning to join the forum and keep the conversation going, so you can, it's just sort of like the developers lounge would be in delight conference. It gives you a chance to talk to our engineering team. Also, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this virtual session is being recorded, and it will be available to view on demand this week, we'll send you a notification, as soon as it's ready. Okay, now, let's get started, over to you, Chuck. >> Chuck: Thanks for the introduction, Paige, Vertica vision is to help customers, get value from structured data. This vision is simple, it doesn't matter what vertical the customer is in. They're all analytics companies, it doesn't matter what the customers environment is, as data is generated everywhere. We also can't do this alone, we know that you need other tools and people to build a complete solution. You know our database is key to delivering on the vision because we need a database that scales. When you start a new database company, you aren't going to win against 30 year old products on features. But from day one, we had something else, an architecture built for analytics performance. This architecture was inspired by the C-store project, combining the best design ideas from academics and industry veterans like Dr. Mike Stonebreaker. Our storage is optimized for performance, we use many computers in parallel. After over 10 years of refinements against various customer workloads, much of the design held up and serendipitously, the fact that we don't store in place updates set Vertica up for success in the cloud as well. These days, there are other tools that embody some of these design ideas. But we have other strengths that are more important than the storage format, where the only good analytics database that runs both on premise and in the cloud, giving customers the option to migrate their workloads, in most convenient and economical environment, or a full data management solution, not just the query tool. Unlike some other choices, ours comes with integration with a sequel ecosystem and full professional support. We organize our product roadmap into four key pillars, plus the cross cutting concerns of open integration and performance and scale. We have big plans to strengthen Vertica, while staying true to our core. This presentation is primarily about the separation pillar, and performance and scale, I'll cover our plans for Eon, our data management architecture, Mart analytic clusters, or fifth generation query executer, and our data storage layer. Let's start with how Vertica manages data, one of the central design points for Vertica was shared nothing, a design that didn't utilize a dedicated hardware shared disk technology. This quote here is how Mike put it politely, but around the Vertica office, shared disk with an LMTB over Mike's dead body. And we did get some early field experience with shared disk, customers, well, in fact will learn on anything if you let them. There were misconfigurations that required certified experts, obscure bugs extent. Another thing about the shared nothing designed for commodity hardware though, and this was in the papers, is that all the data management features like fault tolerance, backup and elasticity have to be done in software. And no matter how much you do, procuring, configuring and maintaining the machines with disks is harder. The software configuration process to add more service may be simple, but capacity planning, racking and stacking is not. The original allure of shared storage returned, this time though, the complexity and economics are different. It's cheaper, even provision storage with a few clicks and only pay for what you need. It expands, contracts and brings the maintenance of the storage close to a team is good at it. But there's a key difference, it's an object store, an object stores don't support the API's and access patterns used by most database software. So another Vertica visionary Ben, set out to exploit Vertica storage organization, which turns out to be a natural fit for modern cloud shared storage. Because Vertica data files are written once and not updated, they match the object storage model perfectly. And so today we have Eon, Eon uses shared storage to hold Vertica data with local disk depot's that act as caches, ensuring that we can get the performance that our customers have come to expect. Essentially Eon in enterprise behave similarly, but we have the benefit of flexible storage. Today Eon has the features our customers expect, it's been developed in tune for years, we have successful customers such as Redpharma, and if you'd like to know more about Eon has helped them succeed in Amazon cloud, I highly suggest reading their case study, which you can find on vertica.com. Eon provides high availability and flexible scaling, sometimes on premise customers with local disks get a little jealous of how recovery and sub-clusters work in Eon. Though we operate on premise, particularly on pure storage, but enterprise also had strengths, the most obvious being that you don't need and short shared storage to run it. So naturally, our vision is to converge the two modes, back into a single Vertica. A Vertica that runs any combination of local disks and shared storage, with full flexibility and portability. This is easy to say, but over the next releases, here's what we'll do. First, we realize that the query executer, optimizer and client drivers and so on, are already the same. Just the transaction handling and data management is different. But there's already more going on, we have peer-to-peer depot operations and other internode transfers. And enterprise also has a network, we could just get files from remote nodes over that network, essentially mimicking the behavior and benefits of shared storage with the layer of software. The only difference at the end of it, will be which storage hold the master copy. In enterprise, the nodes can't drop the files because they're the master copy. Whereas in Eon they can be evicted because it's just the cache, the masters, then shared storage. And in keeping with versus current support for multiple storage locations, we can intermix these approaches at the table level. Getting there as a journey, and we've already taken the first steps. One of the interesting design ideas of the C-store paper is the idea that redundant copies, don't have to have the same physical organization. Different copies can be optimized for different queries, sorted in different ways. Of course, Mike also said to keep the recovery system simple, because it's hard to debug, whenever the recovery system is being used, it's always in a high pressure situation. This turns out to be a contradiction, and the latter idea was better. No down performing stuff, if you don't keep the storage the same. Recovery hardware if you have, to reorganize data in the process. Even query optimization is more complicated. So over the past couple releases, we got rid of non identical buddies. But the storage files can still diverge at the fifth level, because tuple mover operations are synchronized. The same record can end up in different files than different nodes. The next step in our journey, is to make sure both copies are identical. This will help with backup and restore as well, because the second copy doesn't need backed up, or if it is backed up, it appears identical to the deduplication that is going to look present in both backup systems. Simultaneously, we're improving the Vertica networking service to support this new access pattern. In conjunction with identical storage files, we will converge to a recovery system that instantaneous nodes can process queries immediately, by retrieving data they need over the network from the redundant copies as they do in Eon day with even higher performance. The final step then is to unify the catalog and transaction model. Related concepts such as segment and shard, local catalog and shard catalog will be coalesced, as they're really represented the same concepts all along, just in different modes. In the catalog, we'll make slight changes to the definition of a projection, which represents the physical storage organization. The new definition simplifies segmentation and introduces valuable granularities of sharding to support evolution over time, and offers a straightforward migration path for both Eon and enterprise. There's a lot more to our Eon story than just the architectural roadmap. If you missed yesterday's Vertica, in Eon mode presentation about supported cloud, on premise storage option, replays are available. Be sure to catch the upcoming presentation on sizing and configuring vertica and in beyond doors. As we've seen with Eon, Vertica can separate data storage from the compute nodes, allowing machines to quickly fill in for each other, to rebuild fault tolerance. But separating compute and storage is used for much, much more. We now offer powerful, flexible ways for Vertica to add servers and increase access to the data. Vertica nine, this feature is called sub-clusters. It allows computing capacity to be added quickly and incrementally, and isolates workloads from each other. If your exploratory analytics team needs direct access to the source data, they need a lot of machines and not the same number all the time, and you don't 100% trust the kind of queries and user defined functions, they might be using sub-clusters as the solution. While there's much more expensive information available in our other presentation. I'd like to point out the highlights of our latest sub-cluster best practices. We suggest having a primary sub-cluster, this is the one that runs all the time, if you're loading data around the clock. It should be sized for the ETL workloads and also determines the natural shard count. Additional read oriented secondary sub-clusters can be added for real time dashboards, reports and analytics. That way, subclusters can be added or deep provisioned, without disruption to other users. The sub-cluster features of Vertica 9.3 are working well for customers. Yesterday, the Trade Desk presented their use case for Vertica over 300,000 in 5 sub clusters running in the cloud. If you missed a presentation, check out the replay. But we have plans beyond sub-clusters, we're extending sub-clusters to real clusters. For the Vertica savvy, this means the clusters bump, share the same spread ring network. This will provide further isolation, allowing clusters to control their own independent data sets. While replicating all are part of the data from other clusters using a publish subscribe mechanism. Synchronizing data between clusters is a feature customers want to understand the real business for themselves. This vision effects are designed for ancillary aspects, how we should assign resource pools, security policies and balance client connection. We will be simplifying our data segmentation strategy, so that when data that originate in the different clusters meet, they'll still get fully optimized joins, even if those clusters weren't positioned with the same number of nodes per shard. Having a broad vision for data management is a key component to political success. But we also take pride in our execution strategy, when you start a new database from scratch as we did 15 years ago, you won't compete on features. Our key competitive points where speed and scale of analytics, we set a target of 100 x better query performance in traditional databases with path loads. Our storage architecture provides a solid foundation on which to build toward these goals. Every query starts with data retrieval, keeping data sorted, organized by column and compressed by using adaptive caching, to keep the data retrieval time in IO to the bare minimum theoretically required. We also keep the data close to where it will be processed, and you clusters the machines to increase throughput. We have partition pruning a robust optimizer evaluate active use segmentation as part of the physical database designed to keep records close to the other relevant records. So the solid foundation, but we also need optimal execution strategies and tactics. One execution strategy which we built for a long time, but it's still a source of pride, it's how we process expressions. Databases and other systems with general purpose expression evaluators, write a compound expression into a tree. Here I'm using A plus one times B as an example, during execution, if your CPU traverses the tree and compute sub-parts from the whole. Tree traversal often takes more compute cycles than the actual work to be done. Especially in evaluation is a very common operation, so something worth optimizing. One instinct that engineers have is to use what we call, just-in-time or JIT compilation, which means generating code form the CPU into the specific activity expression, and add them. This replaces the tree of boxes that are custom made box for the query. This approach has complexity bugs, but it can be made to work. It has other drawbacks though, it adds a lot to query setup time, especially for short queries. And it pretty much eliminate the ability of mere models, mere mortals to develop user defined functions. If you go back to the problem we're trying to solve, the source of the overhead is the tree traversal. If you increase the batch of records processed in each traversal step, this overhead is amortized until it becomes negligible. It's a perfect match for a columnar storage engine. This also sets the CPU up for efficiency. The CPUs look particularly good, at following the same small sequence of instructions in a tight loop. In some cases, the CPU may even be able to vectorize, and apply the same processing to multiple records to the same instruction. This approach is easy to implement and debug, user defined functions are possible, then generally aligned with the other complexities of implementing and improving a large system. More importantly, the performance, both in terms of query setup and record throughput is dramatically improved. You'll hear me say that we look at research and industry for inspiration. In this case, our findings in line with academic binding. If you'd like to read papers, I recommend everything you always wanted to know about compiled and vectorized queries, don't afraid to ask, so we did have this idea before we read that paper. However, not every decision we made in the Vertica executer that the test of time as well as the expression evaluator. For example, sorting and grouping aren't susceptible to vectorization because sort decisions interrupt the flow. We have used JIT compiling on that for years, and Vertica 401, and it provides modest setups, but we know we can do even better. But who we've embarked on a new design for execution engine, which I call EE five, because it's our best. It's really designed especially for the cloud, now I know what you're thinking, you're thinking, I just put up a slide with an old engine, a new engine, and a sleek play headed up into the clouds. But this isn't just marketing hype, here's what I mean, when I say we've learned lessons over the years, and then we're redesigning the executer for the cloud. And of course, you'll see that the new design works well on premises as well. These changes are just more important for the cloud. Starting with the network layer in the cloud, we can't count on all nodes being connected to the same switch. Multicast doesn't work like it does in a custom data center, so as I mentioned earlier, we're redesigning the network transfer layer for the cloud. Storage in the cloud is different, and I'm not referring here to the storage of persistent data, but to the storage of temporary data used only once during the course of query execution. Our new pattern is designed to take into account the strengths and weaknesses of cloud object storage, where we can't easily do a path. Moving on to memory, many of our access patterns are reasonably effective on bare metal machines, that aren't the best choice on cloud hyperbug that have overheads, page faults or big gap. Here again, we found we can improve performance, a bit on dedicated hardware, and even more in the cloud. Finally, and this is true in all environments, core counts have gone up. And not all of our algorithms take full advantage, there's a lot of ground to cover here. But I think sorting in the perfect example to illustrate these points, I mentioned that we use JIT in sorting. We're getting rid of JIT in favor of a data format that can be treated efficiently, independent of what the data types are. We've drawn on the best, most modern technology from academia and industry. We've got our own analysis and testing, you know what we chose, we chose parallel merge sort, anyone wants to take a guess when merge sort was invented. It was invented in 1948, or at least documented that way, like computing context. If you've heard me talk before, you know that I'm fascinated by how all the things I worked with as an engineer, were invented before I was born. And in Vertica , we don't use the newest technologies, we use the best ones. And what is noble about Vertica is the way we've combined the best ideas together into a cohesive package. So all kidding about the 1940s aside, or he redesigned is actually state of the art. How do we know the sort routine is state of the art? It turns out, there's a pretty credible benchmark or at the appropriately named historic sortbenchmark.org. Anyone with resources looking for fame for their product or academic paper can try to set the record. Record is last set in 2016 with Tencent Sort, 100 terabytes in 99 seconds. Setting the records it's hard, you have to come up with hundreds of machines on a dedicated high speed switching fabric. There's a lot to a distributed sort, there all have core sorting algorithms. The authors of the paper conveniently broke out of the time spent in their sort, 67 out of 99 seconds want to know local sorting. If we break this out, divided by two CPUs and each of 512 nodes, we find that each CPU so there's almost a gig and a half per second. This is for what's called an indy sort, like an Indy race car, is in general purpose. It only handles fixed hundred five records with 10 byte key. There is a record length can vary, then it's called daytona sort, a 10 set daytona sort, is a little slower. One point is 10 gigabytes per second per CPU, now for Verrtica, We have a wide variety ability in record sizes, and more interesting data types, but still no harm in setting us like phone numbers, comfortable to the world record. On my 2017 era AMD desktop CPU, the Vertica EE5 sort to store about two and a half gigabytes per second. Obviously, this test isn't apply to apples because they use their own open power chip. But the number of DRM channels is the same, so it's pretty close the number that says we've hit on the right approach. And it performs this way on premise, in the cloud, and we can adapt it to cloud temp space. So what's our roadmap for integrating EE5 into the product and compare replacing the query executed the database to replacing the crankshaft and other parts of the engine of a car while it's been driven. We've actually done it before, between Vertica three and a half and five, and then we never really stopped changing it, now we'll do it again. The first part in replacing with algorithm called storage merge, which combines sorted data from disk. The first time has was two that are in vertical in incoming 10.0 patch that will be EE5 or resegmented storage merge, and then convert sorting and grouping into do out. There the performance results so far, in cases where the Vertica execute is doing well today, simple environments with simple data patterns, such as this simple capitalistic query, there's a lot of speed up, when we ship the segmentation code, which didn't quite make the freeze as much like to bump longer term, what we do is grouping into the storage of large operations, we'll get to where we think we ought to be, given a theoretical minimum work the CPUs need to do. Now if we look at a case where the current execution isn't doing as well, we see there's a much stronger benefit to the code shipping in Vertica 10. In fact, it turns a chart bar sideways to try to help you see the difference better. This case also benefit from the improvements in 10 product point releases and beyond. They will not happening to the vertical query executer, That was just the taste. But now I'd like to switch to the roadmap first for our adapters layer. I'll start with a story about, how our storage access layer evolved. If you go back to the academic ideas, if you start paper that persuaded investors to fund Vertica, read optimized store was the part that had substantiation in the form of performance data. Much of the paper was speculative, but we tried to follow it anyway. That paper talked about the WS with RS, The rights are in the read store, and how they work together for transaction processing and how there was a supernova. In all honesty, Vertica engineers couldn't figure out from the paper what to do next, incase you want to try, and we asked them they would like, We never got enough clarification to build it that way. But here's what we built, instead. We built the ROS, read optimized store, introduction on steep major revision. It's sorted, ordered columnar and compressed that follows a table partitioning that worked even better than the we are as described in the paper. We also built the last byte optimized store, we built four versions of this over the years actually. But this was the best one, it's not a set of interrelated V tree. It's just an append only, insertion order remember your way here, am sorry, no compression, no base, no partitioning. There is, however, a tuple over which does what we call move out. Move the data from WOS to ROS, sorting and compressing. Let's take a moment to compare how they behave, when you load data directly to the ROS, there's a data parsing operation. Then we finished the sorting, and then compressing right out the columnar data files to stay storage. The next query through executes against the ROS and it runs as it should because the ROS is read optimized. Let's repeat the exercise for WOS, the load operation response before the sorting and compressing, and before the data is written to persistent storage. Now it's possible for a query to come along, and the query could be responsible for sorting the lost data in addition to its other processes. Effect on query isn't predictable until the TM comes along and writes the data to the ROS. Over the years, we've done a lot of comparisons between ROS and WOS. ROS has always been better for sustained load throughput, it achieves much higher records per second without pushing back against the client and hasn't Vertica for when we developed the first usable merge out algorithm. ROS has always been better for predictable query performance, the ROS has never had the same management complexity and limitations as WOS. You don't have to pick a memory size and figure out which transactions get to use the pool. A non persistent nature of ROS always cause headaches when there are unexpected cluster shutdowns. We also looked at field usage data, we found that few customers were using a lot, especially among those that studied the issue carefully. So how we set out on a mission to improve the ROS to the point where it was always better than both the WOS and the profit of the past. And now it's true, ROS is better than the WOS and the loss of a couple of years ago. We implemented storage bundling, better catalog object storage and better tuple mover merge outs. And now, after extensive Q&A and customer testing, we've now succeeded, and in Vertica 10, we've removed the whys. Let's talk for a moment about simplicity, one of the best things Mike Stonebreaker said is no knobs. Anyone want to guess how many knobs we got rid of, and we took the WOS out of the product. 22 were five knobs to control whether it didn't went to ROS as well. Six controlling the ROS itself, Six more to set policies for the typical remove out and so on. In my honest opinion is still wasn't enough control over to achieve excess in a multi tenant environment, the big reason to get rid of the WOS for simplicity. Make the lives of DBAs and users better, we have a long way to go, but we're doing it. On my desk, I keep a jar with the knob in it for each knob in Vertica. When developers add a knob to the product, they have to add a knob to the jar. When they remove a knob, they get to choose one to take out, We have a lot of work to do, but I'm thrilled to report that in 15 years 10 is the first release with a number of knobs ticked downward. Get back to the WOS, I've said the most important thing get rid of it for last. We're getting rid of it so we can deliver our vision of the future to our customer. Remember how he said an Eon and sub-clusters we got all these benefits from shared storage? Guess what can't live in shared storage, the WOS. Remember how it's been a big part of the future was keeping the copies that identical to the primary copy? Independent actions of the WOS took a little at the root of the divergence between copies of the data. You have to admit it when you're wrong. That was in the original design and held up to the a selling point of time, without onto the idea of a separate ROS and WOS for too long. In Vertica, 10, we can finally bid, good reagents. I've covered a lot of ground, so let's put all the pieces together. I've talked a lot about our vision and how we're achieving it. But we also still pay attention to tactical detail. We've been fine tuning our memory management model to enhance performance. That involves revisiting tens of thousands of satellite of code, much like painting the inside of a large building with small paintbrushes. We're getting results as shown in the chart in Vertica nine, concurrent monitoring queries use memory from the global catalog tool, and Vertica 10, they don't. This is only one example of an important detail we're improving. We've also reworked the monitoring tables without network messages into two parts. The increased data we're collecting and analyzing and our quality assurance processes, we're improving on everything. As the story goes, I still have my grandfather's axe, of course, my father had to replace the handle, and I had to replace the head. Along the same lines, we still have Mike Stonebreaker Vertica. We didn't replace the query optimizer twice the debate database designer and storage layer four times each. The query executed is and it's a free design, like charted out how our code has changed over the years. I found that we don't have much from a long time ago, I did some digging, and you know what we have left in 2007. We have the original curly braces, and a little bit of percent code for handling dates and times. To deliver on our mission to help customers get value from their structured data, with high performance of scale, and in diverse deployment environments. We have the sound architecture roadmap, reviews the best execution strategy and solid tactics. On the architectural front, we're converging in an enterprise, we're extending smart analytic clusters. In query processing, we're redesigning the execution engine for the cloud, as I've told you. There's a lot more than just the fast engine. that you want to learn about our new data support for complex data types, improvements to the query optimizer statistics, or extension to live aggregate projections and flatten tables. You should check out some of the other engineering talk that the big data conference. We continue to stay on top of the details from low level CPU and memory too, to the monitoring management, developing tighter feedback cycles between development, Q&A and customers. And don't forget to check out the rest of the pillars of our roadmap. We have new easier ways to get started with Vertica in the cloud. Engineers have been hard at work on machine learning and security. It's easier than ever to use Vertica with third Party product, as a variety of tools integrations continues to increase. Finally, the most important thing we can do, is to help people get value from structured data to help people learn more about Vertica. So hopefully I left plenty of time for Q&A at the end of this presentation. I hope to hear your questions soon.

Published Date : Mar 30 2020

SUMMARY :

and keep the conversation going, and apply the same processing to multiple records

ENTITIES

Entity	Category	Confidence
Mike	PERSON	0.99+
Mike Stonebreaker	PERSON	0.99+
2007	DATE	0.99+
Chuck Bear	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
2016	DATE	0.99+
Paige Roberts	PERSON	0.99+
Chuck	PERSON	0.99+
second copy	QUANTITY	0.99+
99 seconds	QUANTITY	0.99+
67	QUANTITY	0.99+
100%	QUANTITY	0.99+
1948	DATE	0.99+
Ben	PERSON	0.99+
two modes	QUANTITY	0.99+
Redpharma	ORGANIZATION	0.99+
first time	QUANTITY	0.99+
first steps	QUANTITY	0.99+
Paige	PERSON	0.99+
two parts	QUANTITY	0.99+
First	QUANTITY	0.99+
five knobs	QUANTITY	0.99+
100 terabytes	QUANTITY	0.99+
both copies	QUANTITY	0.99+
Today	DATE	0.99+
each knob	QUANTITY	0.99+
WS	ORGANIZATION	0.99+
AMD	ORGANIZATION	0.99+
Eon	ORGANIZATION	0.99+
1940s	DATE	0.99+
today	DATE	0.99+
One point	QUANTITY	0.99+
first part	QUANTITY	0.99+
fifth level	QUANTITY	0.99+
each	QUANTITY	0.99+
yesterday	DATE	0.98+
both	QUANTITY	0.98+
Six	QUANTITY	0.98+
first	QUANTITY	0.98+
512 nodes	QUANTITY	0.98+
ROS	TITLE	0.98+
over 10 years	QUANTITY	0.98+
Yesterday	DATE	0.98+
15 years ago	DATE	0.98+
twice	QUANTITY	0.98+
sortbenchmark.org	OTHER	0.98+
first release	QUANTITY	0.98+
two CPUs	QUANTITY	0.97+
Vertica 10	TITLE	0.97+
100 x	QUANTITY	0.97+
WOS	TITLE	0.97+
vertica.com	OTHER	0.97+
10 byte	QUANTITY	0.97+
this week	DATE	0.97+
one	QUANTITY	0.97+
5 sub clusters	QUANTITY	0.97+
two	QUANTITY	0.97+
one example	QUANTITY	0.97+
over 300,000	QUANTITY	0.96+
Dr.	PERSON	0.96+
One	QUANTITY	0.96+
tens of thousands of satellite	QUANTITY	0.96+
EE5	COMMERCIAL_ITEM	0.96+
fifth generation	QUANTITY	0.96+

UNLIST TILL 4/2 - Vertica @ Uber Scale

>> Sue: Hi, everybody. Thank you for joining us today, for the Virtual Vertica BDC 2020. This breakout session is entitled "Vertica @ Uber Scale" My name is Sue LeClaire, Director of Marketing at Vertica. And I'll be your host for this webinar. Joining me is Girish Baliga, Director I'm sorry, user, Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q and A session, at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can also Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button, in the lower right corner of the slides. And yet, this virtual session is being recorded, and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Girish over to you. >> Girish: Thanks a lot Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real time analytics teams at Uber. Vertica is one of the main platforms that we support, and Vertica powers a lot of core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases, across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability, using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I will be sharing an easy way for you to take advantage of, many of the ideas and solutions that I'm going to present today, that you can apply to your own Vertica deployments in your companies. So stick around and put on your seat belts, and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems, and enabling people all over the world to solve their local problems, their local needs, their local issues, in a manner that's efficient, fast and reliable. As our CEO Dara has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So, across our various business lines, we have over 110 million monthly users, who use our rides, services, or eat services, and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the ride business, have over 20 million trips per day. And that each business is also catching up, particularly during the recent times that we've been having. And so, I hope these numbers give you a scale of the amount of data, that we process each and every day. And support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So, Uber to describe it very briefly, is a lot like Amazon. We are largely an operation and logistics company. And employee work based reflects that. So over 70% of our employees work in teams, which come under the umbrella of Community Operations and Centers of Excellence. So these are all folks working in various cities and towns that we operate around the world, and run the Uber businesses, as somewhat local businesses responding to local needs, local market conditions, local regulation and so forth. And Vertica is one of the most important tools, that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going, to deeply into any issues that they want to triage , to generate reports, to plan for the future, a whole lot of use cases. The second big class of users, are in our marketplace team. So marketplace is the engineering team, that backs our ride shared business. And as part of this, running this business, a key problem that they have to solve, is how to determine what prices to set, for particular rides, so that we have a good match between supply and demand. So obviously the real time pricing decisions they're made by serving systems, with very detailed and well crafted machine learning models. However, the training data that goes into this models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store, and serve out of Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and back-end systems, to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines, is dashboards and reporting. So we have a lot of dashboards, which are built by core data analysts teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run, periodically say once a week or once a day even, depending on the frequency of data that they need. And many of these are powered by the data, and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas, doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives, or marketing support, and so forth. So the analysis that backs a lot of these decisions, is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is, how we provide best in class algorithms, pricing, and matching. And a lot of the analysis that goes into, figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real time decisions, are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica usage spans a whole bunch of organizations and users, all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5000 weekly active, people who run queries at least once a week, to do some critical business role or problem to solve, that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps, and request for a ride or order food delivery on each platform, the apps are talking to our serving systems. And the serving systems use online storage systems, to store the data as the trips and eat orders are getting processed in real time. So for this, we primarily use an in house built, key value storage system called Schemaless, and an open source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support serving systems. So all of this operations generates a lot of data, that we then want to process and analyze, and use for our operational improvements. So, we have ingestion systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber a data lake is powered by Hadoop, with files stored on HDFS clusters. So once the raw data lines on the data lake, we then have ETL jobs that process these raw datasets, and generate, modeled and customize datasets which we then use for further analysis. So once these model datasets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer. So with internal tools, like QueryBuilder, which is a UI interface to write queries, and look at results. And it read over the front-end sites, and Dashbuilder, which is a dash, board building tool, and report management tool. So these are all various tools that we have built within Uber. And these can talk to Vertica and run SQL queries to power, whatever, dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights, that we show on dashboards as folks use, and it also powers reports that we run periodically. But more importantly, we have some core, properties and core feature sets that Vertica provides, which allows us to support many of these use cases, very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables onto Vertica. The core fact tables are all the trips, all the each orders and all these other line items for various businesses from Uber, stored as partitioned tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, career partners and so forth. So we have both these two kinds of datasets, which will load into Vertica. And we have full historical data, all the way since we launched these businesses to today. So that folks can do deeper longitudinal analysis, so they can look at patterns, like how the business has grown from month to month, year to year, the same month, over a year, over multiple years, and so forth. And, the really powerful thing about Vertica, is that most of these queries, you run the deep longitudinal queries, run very, very fast. And that's really why we love Vertica. Because we see query latency P90s. That is 90 percentile of all queries that we run on our platform, typically finish in under a minute. So that's very important for us because Vertica is used, primarily for interactive analytics use cases. And providing SQL query execution times under a minute, is critical for our users and business owners to get the most out of analytics and Big Data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important set of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards, that rely very heavily on being able to restrict datasets by geographic areas, cities, source destination pairs, heat maps, and so forth. And Vertica has a rich array of functions that we use very heavily. We also have, support for custom projections in Vertica. And this really helps us, have very good performance for critical datasets. So for instance, in some of our core fact tables, we have done a lot of query and analysis to figure out, how users run their queries, what kind of columns they use, what combination of columns they use, and what joints they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica, is very valuable for us. So we've also had some very successful collaborations, with the Vertica engineering team. About a year and a half back, we had open-sourced a Python Client, that we had built in house to talk to Vertica. We were using this Python Client in our business intelligence layer that I'd shown on the previous slide. And we had open-sourced it after working closely with Eng team. And now Vertica formally supports the Python Client as an open-source project, which you can download to and integrate into your systems. Another more recent example of collaboration is the Vertica Eon mode on GCP. So as most of or at least some of you know, Vertica Eon mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So Vertica team hustled on this, and provided us early preview version, which we've been testing out to see how performance, is impacted by running on the Cloud, and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this very soon. So here I have a visualization of an internal dashboard, that is powered solely by data and queries running on Vertica. So this GIF has sequence have different visualizations supported by this tool. So for instance, here you see a heat map, downgrading heat map of source of traffic demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lines. And then you can see how demand moves around. So, as the cycles through the various animations, you can basically see all the different kinds of insights, and query shapes that we send to Vertica, which powers this critical business dashboard for our operations teams. All right, so now how do we do all of this at scale? So, we started off with a single Vertica cluster, a few years back. So we had our data lake, the data would land into Vertica. So these are the core fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer, right? So this is a very simple, and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that, Uber is a pretty big company at this point, with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course, on queueing theory, you will realize that basically, even though you could have all the query is processed through a single serving system. You will tend to see larger and larger queue wait times, as the number of queries pile up. And what this means in practice for end users, is that they are basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself, is probably less than a minute, their query sitting in the queue for a bunch of minutes, and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point of failure. Now Vertica can handle single node failures very gracefully, and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon, you will see that, when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime, is when you're upgrading software in your clusters. So, essentially we're a global company, and we have users all around the world, we really cannot afford to have downtime, even for one hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we we might need to upgrade our machines, or we might need to replace storage or memory due to issues with the hardware in there, due to normal wear and tear, or due to abnormal issues. And so because of all of these things, having a single point of failure, having a single cluster was not really practical for us. So the next thing we did, was we set up multiple clusters, right? So we had a bunch of identities clusters, all of which have the same datasets. So then we would basically load data using ingestion pipelines from our data lake, onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Anytime we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster, if we had to, or we would just have some spare nodes, which would rotate into our production clusters and so forth. However, having multiple clusters, led to a new set of issues. So the first problem was that since we have multiple clusters, you would end up with inconsistent schema. So one of the things to understand about our platform, is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on Vertica clusters. So we have dataset owners and publishers, who manage their own datasets. Now exposing multiple clusters to these dataset owners. Turns out, it's not a great idea, right? Because they are not really aware of, the importance of having consistency of schemas and datasets across different clusters. So over time, what we saw was that the schema for the same tables would basically get out of order, because they were all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables in one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where, we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had ingestion pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the ingestion pipeline into cluster B failed, then the data there would be older than clusters A and C. So, when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results, than you would if you went to a or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts. But then that would lead to a bad situation for them where they would not able to fully trust the data that was, and the results and insights that were being returned by the SQL queries and Vertica systems. And then the third problem was, we had a lot of extra replication. So the 20/80 Rule, or maybe even the 90/10 Rule, applies to datasets on our clusters as well. So less than 10% of our datasets, for instance, in 90% of the queries, right? And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this set up where we had to do that, was obviously very suboptimal for us. So then what we did, was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing datasets or dataset owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that, all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that, when queries come in from the BI layer, the query was forwarded, smartly and with knowledge and data about which cluster up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent, almost single system view of our entire Vertica deployment. And the third bit, we had put in place, was the data manifest, which were the communication mechanism between ingestion and proxy. So the data manifest basically is a listing of, which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra replication problem. So now we basically have some big clusters, where all the core tables, and all the tables, in fact, are served. So any query that hits 90%, less so tables, goes to the big clusters. And most of the queries which hit 10% heavily queried important tables, can also be served by many other small clusters, so much more efficient use of resources. So this basically is the view that we have today, of Vertica within Uber, so external to our team, folks, just have an endpoint, where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries. And they are so to a proxy layer. So let's get a little more into details, about each of these layers. So, on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle, so the list of cities list of drivers, the list of users and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these datasets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each orders and so forth. So these are partition. So we have one partition roughly per day, for the last couple of years, and then we have more of a hierarchical partitions set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that, is because not all our data comes in at the same time. So we have updates for trips, going over the past two or three days, for instance, where people add ratings to their trips, or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last few days to make sure we capture all those updates. And we also update older partitions, if for instance, records were deleted for retention purposes, or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow dataset owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that, we take advantage of the data being clustered by day, so that we don't have to update all the data at once. So when we are recovering from an cluster event, like a version upgrade or software upgrade, or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, and copying all the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we, add this cluster to our serving pool, and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here, is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables, in all the clusters in phase one. And then when all the clusters give us access signals, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load, using Vertica Data Copy. So what this means is earlier, in a parallel pipelines scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica Data Copy, we just load it data into one cluster and then much more efficiently copy it, to the other clusters. So this has significantly reduced our ingestion overheads, and speed it up our load process. And as I mentioned as the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date, by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first, what we do is we stage these changes to make sure that these are correct. So this catches errors that are trying to do, an incompatible update, like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency. And provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for, replicated clusters to all our users. So the way we handle that is, as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query, and route the query to the appropriate cluster based on the manifest information. Also the proxy is aware of the health of individual clusters. So if for some reason a cluster is down for maintenance or upgrades, the proxy is aware of this information. And it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters, and do some load balancing to ensure that we award hotspots on various clusters. So the key takeaways that I have from the stock, are primarily these. So we started off with single cluster mode on Vertica, and we ran into a bunch of issues around scaling and availability due to cluster downtime. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. Then we run into issues around schema consistency, data staleness, and data replication. So we built an entire ecosystem around Vertica, with abstraction layers around data management and ingestion, and proxy. And with this setup, we were able to enforce consistency and improve storage utilization. So, hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber, and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have a interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem, is to use the Vertica Eon mode. So the Vertica Eon mode, allows you to set up multiple clusters with consistent data updates, and set them up at various different sizes to handle different query loads. And it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been, trying it out on DCP, and initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new. And hopefully you took away something that you can also apply to your systems. We have a few more time for some questions. So I'll pause for now and take any questions.

Published Date : Mar 30 2020

SUMMARY :

Any questions that we don't address, So the first issue that we have is that,

ENTITIES

Entity	Category	Confidence
Girish Baliga	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Girish	PERSON	0.99+
10%	QUANTITY	0.99+
one hour	QUANTITY	0.99+
Sue LeClaire	PERSON	0.99+
90%	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Sue	PERSON	0.99+
two	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Dara	PERSON	0.99+
first issue	QUANTITY	0.99+
less than a minute	QUANTITY	0.99+
MySQL	TITLE	0.99+
First	QUANTITY	0.99+
first problem	QUANTITY	0.99+
third problem	QUANTITY	0.99+
third bit	QUANTITY	0.99+
less than 10%	QUANTITY	0.99+
each platform	QUANTITY	0.99+
second	QUANTITY	0.99+
one cluster	QUANTITY	0.99+
one	QUANTITY	0.99+
second issue	QUANTITY	0.99+
Python	TITLE	0.99+
today	DATE	0.99+
second phase	QUANTITY	0.99+
two kinds	QUANTITY	0.99+
over 10,000 cities	QUANTITY	0.99+
over 70%	QUANTITY	0.99+
each business	QUANTITY	0.99+
second thing	QUANTITY	0.98+
second problem	QUANTITY	0.98+
Vertica	TITLE	0.98+
both	QUANTITY	0.98+
Vertica Data Manager	TITLE	0.98+
two-phase	QUANTITY	0.98+
first	QUANTITY	0.98+
90 percentile	QUANTITY	0.98+
once a week	QUANTITY	0.98+
each	QUANTITY	0.98+
single point	QUANTITY	0.97+
SQL	TITLE	0.97+
once a day	QUANTITY	0.97+
Redis	TITLE	0.97+
one partition	QUANTITY	0.97+
under a minute	QUANTITY	0.97+
@ Uber Scale	ORGANIZATION	0.96+

UNLIST TILL 4/2 - A Technical Overview of Vertica Architecture

>> Paige: Hello, everybody and thank you for joining us today on the Virtual Vertica BDC 2020. Today's breakout session is entitled A Technical Overview of the Vertica Architecture. I'm Paige Roberts, Open Source Relations Manager at Vertica and I'll be your host for this webinar. Now joining me is Ryan Role-kuh? Did I say that right? (laughs) He's a Vertica Senior Software Engineer. >> Ryan: So it's Roelke. (laughs) >> Paige: Roelke, okay, I got it, all right. Ryan Roelke. And before we begin, I want to be sure and encourage you guys to submit your questions or your comments during the virtual session while Ryan is talking as you think of them as you go along. You don't have to wait to the end, just type in your question or your comment in the question box below the slides and click submit. There'll be a Q and A at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to get back to you offline. Now, alternatively, you can visit the Vertica forums to post your question there after the session as well. Our engineering team is planning to join the forums to keep the conversation going, so you can have a chat afterwards with the engineer, just like any other conference. Now also, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides and before you ask, yes, this virtual session is being recorded and it will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now, let's get started. Over to you, Ryan. >> Ryan: Thanks, Paige. Good afternoon, everybody. My name is Ryan and I'm a Senior Software Engineer on Vertica's Development Team. I primarily work on improving Vertica's query execution engine, so usually in the space of making things faster. Today, I'm here to talk about something that's more general than that, so we're going to go through a technical overview of the Vertica architecture. So the intent of this talk, essentially, is to just explain some of the basic aspects of how Vertica works and what makes it such a great database software and to explain what makes a query execute so fast in Vertica, we'll provide some background to explain why other databases don't keep up. And we'll use that as a starting point to discuss an academic database that paved the way for Vertica. And then we'll explain how Vertica design builds upon that academic database to be the great software that it is today. I want to start by sharing somebody's approximation of an internet minute at some point in 2019. All of the data on this slide is generated by thousands or even millions of users and that's a huge amount of activity. Most of the applications depicted here are backed by one or more databases. Most of this activity will eventually result in changes to those databases. For the most part, we can categorize the way these databases are used into one of two paradigms. First up, we have online transaction processing or OLTP. OLTP workloads usually operate on single entries in a database, so an update to a retail inventory or a change in a bank account balance are both great examples of OLTP operations. Updates to these data sets must be visible immediately and there could be many transactions occurring concurrently from many different users. OLTP queries are usually key value queries. The key uniquely identifies the single entry in a database for reading or writing. Early databases and applications were probably designed for OLTP workloads. This example on the slide is typical of an OLTP workload. We have a table, accounts, such as for a bank, which tracks information for each of the bank's clients. An update query, like the one depicted here, might be run whenever a user deposits $10 into their bank account. Our second category is online analytical processing or OLAP which is more about using your data for decision making. If you have a hardware device which periodically records how it's doing, you could analyze trends of all your devices over time to observe what data patterns are likely to lead to failure or if you're Google, you might log user search activity to identify which links helped your users find the answer. Analytical processing has always been around but with the advent of the internet, it happened at scales that were unimaginable, even just 20 years ago. This SQL example is something you might see in an OLAP workload. We have a table, searches, logging user activity. We will eventually see one row in this table for each query submitted by users. If we want to find out what time of day our users are most active, then we could write a query like this one on the slide which counts the number of unique users running searches for each hour of the day. So now let's rewind to 2005. We don't have a picture of an internet minute in 2005, we don't have the data for that. We also don't have the data for a lot of other things. The term Big Data is not quite yet on anyone's radar and The Cloud is also not quite there or it's just starting to be. So if you have a database serving your application, it's probably optimized for OLTP workloads. OLAP workloads just aren't mainstream yet and database engineers probably don't have them in mind. So let's innovate. It's still 2005 and we want to try something new with our database. Let's take a look at what happens when we do run an analytic workload in 2005. Let's use as a motivating example a table of stock prices over time. In our table, the symbol column identifies the stock that was traded, the price column identifies the new price and the timestamp column indicates when the price changed. We have several other columns which, we should know that they're there, but we're not going to use them in any example queries. This table is designed for analytic queries. We're probably not going to make any updates or look at individual rows since we're logging historical data and want to analyze changes in stock price over time. Our database system is built to serve OLTP use cases, so it's probably going to store the table on disk in a single file like this one. Notice that each row contains all of the columns of our data in row major order. There's probably an index somewhere in the memory of the system which will help us to point lookups. Maybe our system expects that we will use the stock symbol and the trade time as lookup keys. So an index will provide quick lookups for those columns to the position of the whole row in the file. If we did have an update to a single row, then this representation would work great. We would seek to the row that we're interested in, finding it would probably be very fast using the in-memory index. And then we would update the file in place with our new value. On the other hand, if we ran an analytic query like we want to, the data access pattern is very different. The index is not helpful because we're looking up a whole range of rows, not just a single row. As a result, the only way to find the rows that we actually need for this query is to scan the entire file. We're going to end up scanning a lot of data that we don't need and that won't just be the rows that we don't need, there's many other columns in this table. Many information about who made the transaction, and we'll also be scanning through those columns for every single row in this table. That could be a very serious problem once we consider the scale of this file. Stocks change a lot, we probably have thousands or millions or maybe even billions of rows that are going to be stored in this file and we're going to scan all of these extra columns for every single row. If we tried out our stocks use case behind the desk for the Fortune 500 company, then we're probably going to be pretty disappointed. Our queries will eventually finish, but it might take so long that we don't even care about the answer anymore by the time that they do. Our database is not built for the task we want to use it for. Around the same time, a team of researchers in the North East have become aware of this problem and they decided to dedicate their time and research to it. These researchers weren't just anybody. The fruits of their labor, which we now like to call the C-Store Paper, was published by eventual Turing Award winner, Mike Stonebraker, along with several other researchers from elite universities. This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. That sounds exactly like what we want for our stocks use case. Reasoning about what makes our queries executions so slow brought our researchers to the Memory Hierarchy, which essentially is a visualization of the relative speeds of different parts of a computer. At the top of the hierarchy, we have the fastest data units, which are, of course, also the most expensive to produce. As we move down the hierarchy, components get slower but also much cheaper and thus you can have more of them. Our OLTP databases data is stored in a file on the hard disk. We scanned the entirety of this file, even though we didn't need most of the data and now it turns out, that is just about the slowest thing that our query could possibly be doing by over two orders of magnitude. It should be clear, based on that, that the best thing we can do to optimize our query's execution is to avoid reading unnecessary data from the disk and that's what the C-Store researchers decided to look at. The key innovation of the C-Store paper does exactly that. Instead of storing data in a row major order, in a large file on disk, they transposed the data and stored each column in its own file. Now, if we run the same select query, we read only the relevant columns. The unnamed columns don't factor into the table scan at all since we don't even open the files. Zooming out to an internet scale sized data set, we can appreciate the savings here a lot more. But we still have to read a lot of data that we don't need to answer this particular query. Remember, we had two predicates, one on the symbol column and one on the timestamp column. Our query is only interested in AAPL stock, but we're still reading rows for all of the other stocks. So what can we do to optimize our disk read even more? Let's first partition our data set into different files based on the timestamp date. This means that we will keep separate files for each date. When we query the stocks table, the database knows all of the files we have to open. If we have a simple predicate on the timestamp column, as our sample query does, then the database can use it to figure out which files we don't have to look at at all. So now all of our disk reads that we have to do to answer our query will produce rows that pass the timestamp predicate. This eliminates a lot of wasteful disk reads. But not all of them. We do have another predicate on the symbol column where symbol equals AAPL. We'd like to avoid disk reads of rows that don't satisfy that predicate either. And we can avoid those disk reads by clustering all the rows that match the symbol predicate together. If all of the AAPL rows are adjacent, then as soon as we see something different, we can stop reading the file. We won't see any more rows that can pass the predicate. Then we can use the positions of the rows we did find to identify which pieces of the other columns we need to read. One technique that we can use to cluster the rows is sorting. So we'll use the symbol column as a sort key for all of the columns. And that way we can reconstruct a whole row by seeking to the same row position in each file. It turns out, having sorted all of the rows, we can do a bit more. We don't have any more wasted disk reads but we can still be more efficient with how we're using the disk. We've clustered all of the rows with the same symbol together so we don't really need to bother repeating the symbol so many times in the same file. Let's just write the value once and say how many rows we have. This one length encoding technique can compress large numbers of rows into a small amount of space. In this example, we do de-duplicate just a few rows but you can imagine de-duplicating many thousands of rows instead. This encoding is great for reducing the amounts of disk we need to read at query time, but it also has the additional benefit of reducing the total size of our stored data. Now our query requires substantially fewer disk reads than it did when we started. Let's recap what the C-Store paper did to achieve that. First, we transposed our data to store each column in its own file. Now, queries only have to read the columns used in the query. Second, we partitioned the data into multiple file sets so that all rows in a file have the same value for the partition column. Now, a predicate on the partition column can skip non-matching file sets entirely. Third, we selected a column of our data to use as a sort key. Now rows with the same value for that column are clustered together, which allows our query to stop reading data once it finds non-matching rows. Finally, sorting the data this way enables high compression ratios, using one length encoding which minimizes the size of the data stored on the disk. The C-Store system combined each of these innovative ideas to produce an academically significant result. And if you used it behind the desk of a Fortune 500 company in 2005, you probably would've been pretty pleased. But it's not 2005 anymore and the requirements of a modern database system are much stricter. So let's take a look at how C-Store fairs in 2020. First of all, we have designed the storage layer of our database to optimize a single query in a single application. Our design optimizes the heck out of that query and probably some similar ones but if we want to do anything else with our data, we might be in a bit of trouble. What if we just decide we want to ask a different question? For example, in our stock example, what if we want to plot all the trade made by a single user over a large window of time? How do our optimizations for the previous query measure up here? Well, our data's partitioned on the trade date, that could still be useful, depending on our new query. If we want to look at a trader's activity over a long period of time, we would have to open a lot of files. But if we're still interested in just a day's worth of data, then this optimization is still an optimization. Within each file, our data is ordered on the stock symbol. That's probably not too useful anymore, the rows for a single trader aren't going to be clustered together so we will have to scan all of the rows in order to figure out which ones match. You could imagine a worse design but as it becomes crucial to optimize this new type of query, then we might have to go as far as reconfiguring the whole database. The next problem of one of scale. One server is probably not good enough to serve a database in 2020. C-Store, as described, runs on a single server and stores lots of files. What if the data overwhelms this small system? We could imagine exhausting the file system's inodes limit with lots of small files due to our partitioning scheme. Or we could imagine something simpler, just filling up the disk with huge volumes of data. But there's an even simpler problem than that. What if something goes wrong and C-Store crashes? Then our data is no longer available to us until the single server is brought back up. A third concern, another one of scalability, is that one deployment does not really suit all possible things and use cases we could imagine. We haven't really said anything about being flexible. A contemporary database system has to integrate with many other applications, which might themselves have pretty restricted deployment options. Or the demands imposed by our workloads have changed and the setup you had before doesn't suit what you need now. C-Store doesn't do anything to address these concerns. What the C-Store paper did do was lead very quickly to the founding of Vertica. Vertica's architecture and design are essentially all about bringing the C-Store designs into an enterprise software system. The C-Store paper was just an academic exercise so it didn't really need to address any of the hard problems that we just talked about. But Vertica, the first commercial database built upon the ideas of the C-Store paper would definitely have to. This brings us back to the present to look at how an analytic query runs in 2020 on the Vertica Analytic Database. Vertica takes the key idea from the paper, can we significantly improve query performance by changing the way our data is stored and give its users the tools to customize their storage layer in order to heavily optimize really important or commonly wrong queries. On top of that, Vertica is a distributed system which allows it to scale up to internet-sized data sets, as well as have better reliability and uptime. We'll now take a brief look at what Vertica does to address the three inadequacies of the C-Store system that we mentioned. To avoid locking into a single database design, Vertica provides tools for the database user to customize the way their data is stored. To address the shortcomings of a single node system, Vertica coordinates processing among multiple nodes. To acknowledge the large variety of desirable deployments, Vertica does not require any specialized hardware and has many features which smoothly integrate it with a Cloud computing environment. First, we'll look at the database design problem. We're a SQL database, so our users are writing SQL and describing their data in SQL way, the Create Table statement. Create Table is a logical description of what your data looks like but it doesn't specify the way that it has to be stored, For a single Create Table, we could imagine a lot of different storage layouts. Vertica adds some extensions to SQL so that users can go even further than Create Table and describe the way that they want the data to be stored. Using terminology from the C-Store paper, we provide the Create Projection statement. Create Projection specifies how table data should be laid out, including column encoding and sort order. A table can have multiple projections, each of which could be ordered on different columns. When you query a table, Vertica will answer the query using the projection which it determines to be the best match. Referring back to our stock example, here's a sample Create Table and Create Projection statement. Let's focus on our heavily optimized example query, which had predicates on the stock symbol and date. We specify that the table data is to be partitioned by date. The Create Projection Statement here is excellent for this query. We specify using the order by clause that the data should be ordered according to our predicates. We'll use the timestamp as a secondary sort key. Each projection stores a copy of the table data. If you don't expect to need a particular column in a projection, then you can leave it out. Our average price query didn't care about who did the trading, so maybe our projection design for this query can leave the trader column out entirely. If the question we want to ask ever does change, maybe we already have a suitable projection, but if we don't, then we can create another one. This example shows another projection which would be much better at identifying trends of traders, rather than identifying trends for a particular stock. Next, let's take a look at our second problem, that one, or excuse me, so how should you decide what design is best for your queries? Well, you could spend a lot of time figuring it out on your own, or you could use Vertica's Database Designer tool which will help you by automatically analyzing your queries and spitting out a design which it thinks is going to work really well. If you want to learn more about the Database Designer Tool, then you should attend the session Vertica Database Designer- Today and Tomorrow which will tell you a lot about what the Database Designer does and some recent improvements that we have made. Okay, now we'll move to our next problem. (laughs) The challenge that one server does not fit all. In 2020, we have several orders of magnitude more data than we had in 2005. And you need a lot more hardware to crunch it. It's not tractable to keep multiple petabytes of data in a system with a single server. So Vertica doesn't try. Vertica is a distributed system so will deploy multiple severs which work together to maintain such a high data volume. In a traditional Vertica deployment, each node keeps some of the data in its own locally-attached storage. Data is replicated so that there is a redundant copy somewhere else in the system. If any one node goes down, then the data that it served is still available on a different node. We'll also have it so that in the system, there's no special node with extra duties. All nodes are created equal. This ensures that there is no single point of failure. Rather than replicate all of your data, Vertica divvies it up amongst all of the nodes in your system. We call this segmentation. The way data is segmented is another parameter of storage customization and it can definitely have an impact upon query performance. A common way to segment data is by using a hash expression, which essentially randomizes the node that a row of data belongs to. But with a guarantee that the same data will always end up in the same place. Describing the way data is segmented is another part of the Create Projection Statement, as seen in this example. Here we segment on the hash of the symbol column so all rows with the same symbol will end up on the same node. For each row that we load into the system, we'll apply our segmentation expression. The result determines which segment the row belongs to and then we'll send the row to each node which holds the copy of that segment. In this example, our projection is marked KSAFE 1, so we will keep one redundant copy of each segment. When we load a row, we might find that its segment had copied on Node One and Node Three, so we'll send a copy of the row to each of those nodes. If Node One is temporarily disconnected from the network, then Node Three can serve the other copy of the segment so that the whole system remains available. The last challenge we brought up from the C-Store design was that one deployment does not fit all. Vertica's cluster design neatly addressed many of our concerns here. Our use of segmentation to distribute data means that a Vertica system can scale to any size of deployment. And since we lack any special hardware or nodes with special purposes, Vertica servers can run anywhere, on premise or in the Cloud. But let's suppose you need to scale out your cluster to rise to the demands of a higher workload. Suppose you want to add another node. This changes the division of the segmentation space. We'll have to re-segment every row in the database to find its new home and then we'll have to move around any data that belongs to a different segment. This is a very expensive operation, not something you want to be doing all that often. Traditional Vertica doesn't solve that problem especially well, but Vertica Eon Mode definitely does. Vertica's Eon Mode is a large set of features which are designed with a Cloud computing environment in mind. One feature of this design is elastic throughput scaling, which is the idea that you can smoothly change your cluster size without having to pay the expenses of shuffling your entire database. Vertica Eon Mode had an entire session dedicated to it this morning. I won't say any more about it here, but maybe you already attended that session or if you haven't, then I definitely encourage you to listen to the recording. If you'd like to learn more about the Vertica architecture, then you'll find on this slide links to several of the academic conference publications. These four papers here, as well as Vertica Seven Years Later paper which describes some of the Vertica designs seven years after the founding and also a paper about the innovations of Eon Mode and of course, the Vertica documentation is an excellent resource for learning more about what's going on in a Vertica system. I hope you enjoyed learning about the Vertica architecture. I would be very happy to take all of your questions now. Thank you for attending this session.

Published Date : Mar 30 2020

SUMMARY :

A Technical Overview of the Vertica Architecture. Ryan: So it's Roelke. in the question box below the slides and click submit. that the best thing we can do

ENTITIES

Entity	Category	Confidence
Ryan	PERSON	0.99+
Mike Stonebraker	PERSON	0.99+
Ryan Roelke	PERSON	0.99+
2005	DATE	0.99+
2020	DATE	0.99+
thousands	QUANTITY	0.99+
2019	DATE	0.99+
$10	QUANTITY	0.99+
Paige Roberts	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Paige	PERSON	0.99+
Node Three	TITLE	0.99+
Today	DATE	0.99+
First	QUANTITY	0.99+
each file	QUANTITY	0.99+
Roelke	PERSON	0.99+
each row	QUANTITY	0.99+
Node One	TITLE	0.99+
millions	QUANTITY	0.99+
each hour	QUANTITY	0.99+
each	QUANTITY	0.99+
Second	QUANTITY	0.99+
second category	QUANTITY	0.99+
each column	QUANTITY	0.99+
One technique	QUANTITY	0.99+
one	QUANTITY	0.99+
two predicates	QUANTITY	0.99+
each node	QUANTITY	0.99+
One server	QUANTITY	0.99+
SQL	TITLE	0.99+
C-Store	TITLE	0.99+
second problem	QUANTITY	0.99+
Ryan Role	PERSON	0.99+
Third	QUANTITY	0.99+
North East	LOCATION	0.99+
each segment	QUANTITY	0.99+
today	DATE	0.98+
single entry	QUANTITY	0.98+
each date	QUANTITY	0.98+
Google	ORGANIZATION	0.98+
one row	QUANTITY	0.98+
one server	QUANTITY	0.98+
single server	QUANTITY	0.98+
single entries	QUANTITY	0.98+
both	QUANTITY	0.98+
20 years ago	DATE	0.98+
two paradigms	QUANTITY	0.97+
a day	QUANTITY	0.97+
this week	DATE	0.97+
billions of rows	QUANTITY	0.97+
Vertica	TITLE	0.97+
4/2	DATE	0.97+
single application	QUANTITY	0.97+
each query	QUANTITY	0.97+
Each projection	QUANTITY	0.97+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for 4:15a.m.: