Lie 1, The Most Effective Data Architecture Is Centralized | Starburst

(bright upbeat music) >> In 2011, early Facebook employee and Cloudera co-founder Jeff Hammerbacher famously said, "The best minds of my generation are thinking about how to get people to click on ads, and that sucks!" Let's face it. More than a decade later, organizations continue to be frustrated with how difficult it is to get value from data and build a truly agile and data-driven enterprise. What does that even mean, you ask? Well, it means that everyone in the organization has the data they need when they need it in a context that's relevant to advance the mission of an organization. Now, that could mean cutting costs, could mean increasing profits, driving productivity, saving lives, accelerating drug discovery, making better diagnoses, solving supply chain problems, predicting weather disasters, simplifying processes, and thousands of other examples where data can completely transform people's lives beyond manipulating internet users to behave a certain way. We've heard the prognostications about the possibilities of data before and in fairness we've made progress, but the hard truth is the original promises of master data management, enterprise data warehouses, data marts, data hubs, and yes even data lakes were broken and left us wanting for more. Welcome to The Data Doesn't Lie... Or Does It? A series of conversations produced by theCUBE and made possible by Starburst Data. I'm your host, Dave Vellante, and joining me today are three industry experts. Justin Borgman is the co-founder and CEO of Starburst, Richard Jarvis is the CTO at EMIS Health, and Teresa Tung is cloud first technologist at Accenture. Today, we're going to have a candid discussion that will expose the unfulfilled, and yes, broken promises of a data past. We'll expose data lies: big lies, little lies, white lies, and hidden truths. And we'll challenge, age old data conventions and bust some data myths. We're debating questions like is the demise of a single source of truth inevitable? Will the data warehouse ever have feature parity with the data lake or vice versa? Is the so-called modern data stack simply centralization in the cloud, AKA the old guards model in new cloud close? How can organizations rethink their data architectures and regimes to realize the true promises of data? Can and will an open ecosystem deliver on these promises in our lifetimes? We're spanning much of the Western world today. Richard is in the UK, Teresa is on the West Coast, and Justin is in Massachusetts with me. I'm in theCUBE studios, about 30 miles outside of Boston. Folks, welcome to the program. Thanks for coming on. >> Thanks for having us. >> Okay, let's get right into it. You're very welcome. Now, here's the first lie. The most effective data architecture is one that is centralized with a team of data specialists serving various lines of business. What do you think Justin? >> Yeah, definitely a lie. My first startup was a company called Hadapt, which was an early SQL engine for IDU that was acquired by Teradata. And when I got to Teradata, of course, Teradata is the pioneer of that central enterprise data warehouse model. One of the things that I found fascinating was that not one of their customers had actually lived up to that vision of centralizing all of their data into one place. They all had data silos. They all had data in different systems. They had data on prem, data in the cloud. Those companies were acquiring other companies and inheriting their data architecture. So despite being the industry leader for 40 years, not one of their customers truly had everything in one place. So I think definitely history has proven that to be a lie. >> So Richard, from a practitioner's point of view, what are your thoughts? I mean, there's a lot of pressure to cut cost, keep things centralized, serve the business as best as possible from that standpoint. What does your experience show? >> Yeah, I mean, I think I would echo Justin's experience really that we as a business have grown up through acquisition, through storing data in different places sometimes to do information governance in different ways to store data in a platform that's close to data experts people who really understand healthcare data from pharmacies or from doctors. And so, although if you were starting from a greenfield site and you were building something brand new, you might be able to centralize all the data and all of the tooling and teams in one place. The reality is that businesses just don't grow up like that. And it's just really impossible to get that academic perfection of storing everything in one place. >> Teresa, I feel like Sarbanes-Oxley have kind of saved the data warehouse, right? (laughs) You actually did have to have a single version of the truth for certain financial data, but really for some of those other use cases I mentioned, I do feel like the industry has kind of let us down. What's your take on this? Where does it make sense to have that sort of centralized approach versus where does it make sense to maybe decentralize? >> I think you got to have centralized governance, right? So from the central team, for things like Sarbanes-Oxley, for things like security, for certain very core data sets having a centralized set of roles, responsibilities to really QA, right? To serve as a design authority for your entire data estate, just like you might with security, but how it's implemented has to be distributed. Otherwise, you're not going to be able to scale, right? So being able to have different parts of the business really make the right data investments for their needs. And then ultimately, you're going to collaborate with your partners. So partners that are not within the company, right? External partners. We're going to see a lot more data sharing and model creation. And so you're definitely going to be decentralized. >> So Justin, you guys last, jeez, I think it was about a year ago, had a session on data mesh. It was a great program. You invited Zhamak Dehghani. Of course, she's the creator of the data mesh. One of our fundamental premises is that you've got this hyper specialized team that you've got to go through if you want anything. But at the same time, these individuals actually become a bottleneck, even though they're some of the most talented people in the organization. So I guess, a question for you Richard. How do you deal with that? Do you organize so that there are a few sort of rock stars that build cubes and the like or have you had any success in sort of decentralizing with your constituencies that data model? >> Yeah. So we absolutely have got rockstar data scientists and data guardians, if you like. People who understand what it means to use this data, particularly the data that we use at EMIS is very private, it's healthcare information. And some of the rules and regulations around using the data are very complex and strict. So we have to have people who understand the usage of the data, then people who understand how to build models, how to process the data effectively. And you can think of them like consultants to the wider business because a pharmacist might not understand how to structure a SQL query, but they do understand how they want to process medication information to improve patient lives. And so that becomes a consulting type experience from a set of rock stars to help a more decentralized business who needs to understand the data and to generate some valuable output. >> Justin, what do you say to a customer or prospect that says, "Look, Justin. I got a centralized team and that's the most cost effective way to serve the business. Otherwise, I got duplication." What do you say to that? >> Well, I would argue it's probably not the most cost effective, and the reason being really twofold. I think, first of all, when you are deploying a enterprise data warehouse model, the data warehouse itself is very expensive, generally speaking. And so you're putting all of your most valuable data in the hands of one vendor who now has tremendous leverage over you for many, many years to come. I think that's the story at Oracle or Teradata or other proprietary database systems. But the other aspect I think is that the reality is those central data warehouse teams, as much as they are experts in the technology, they don't necessarily understand the data itself. And this is one of the core tenets of data mesh that Zhamak writes about is this idea of the domain owners actually know the data the best. And so by not only acknowledging that data is generally decentralized, and to your earlier point about Sarbanes-Oxley, maybe saving the data warehouse, I would argue maybe GDPR and data sovereignty will destroy it because data has to be decentralized for those laws to be compliant. But I think the reality is the data mesh model basically says data's decentralized and we're going to turn that into an asset rather than a liability. And we're going to turn that into an asset by empowering the people that know the data the best to participate in the process of curating and creating data products for consumption. So I think when you think about it that way, you're going to get higher quality data and faster time to insight, which is ultimately going to drive more revenue for your business and reduce costs. So I think that that's the way I see the two models comparing and contrasting. >> So do you think the demise of the data warehouse is inevitable? Teresa, you work with a lot of clients. They're not just going to rip and replace their existing infrastructure. Maybe they're going to build on top of it, but what does that mean? Does that mean the EDW just becomes less and less valuable over time or it's maybe just isolated to specific use cases? What's your take on that? >> Listen, I still would love all my data within a data warehouse. I would love it mastered, would love it owned by a central team, right? I think that's still what I would love to have. That's just not the reality, right? The investment to actually migrate and keep that up to date, I would say it's a losing battle. Like we've been trying to do it for a long time. Nobody has the budgets and then data changes, right? There's going to be a new technology that's going to emerge that we're going to want to tap into. There's going to be not enough investment to bring all the legacy, but still very useful systems into that centralized view. So you keep the data warehouse. I think it's a very, very valuable, very high performance tool for what it's there for, but you could have this new mesh layer that still takes advantage of the things I mentioned: the data products in the systems that are meaningful today, and the data products that actually might span a number of systems. Maybe either those that either source systems with the domains that know it best, or the consumer-based systems or products that need to be packaged in a way that'd be really meaningful for that end user, right? Each of those are useful for a different part of the business and making sure that the mesh actually allows you to use all of them. >> So, Richard, let me ask you. Take Zhamak's principles back to those. You got the domain ownership and data as product. Okay, great. Sounds good. But it creates what I would argue are two challenges: self-serve infrastructure, let's park that for a second, and then in your industry, one of the most regulated, most sensitive, computational governance. How do you automate and ensure federated governance in that mesh model that Teresa was just talking about? >> Well, it absolutely depends on some of the tooling and processes that you put in place around those tools to centralize the security and the governance of the data. And I think although a data warehouse makes that very simple 'cause it's a single tool, it's not impossible with some of the data mesh technologies that are available. And so what we've done at EMIS is we have a single security layer that sits on top of our data mesh, which means that no matter which user is accessing which data source, we go through a well audited, well understood security layer. That means that we know exactly who's got access to which data field, which data tables. And then everything that they do is audited in a very kind of standard way regardless of the underlying data storage technology. So for me, although storing the data in one place might not be possible, understanding where your source of truth is and securing that in a common way is still a valuable approach, and you can do it without having to bring all that data into a single bucket so that it's all in one place. And so having done that and investing quite heavily in making that possible has paid dividends in terms of giving wider access to the platform, and ensuring that only data that's available under GDPR and other regulations is being used by the data users. >> Yeah. So Justin, we always talk about data democratization, and up until recently, they really haven't been line of sight as to how to get there, but do you have anything to add to this because you're essentially doing analytic queries with data that's all dispersed all over. How are you seeing your customers handle this challenge? >> Yeah, I mean, I think data products is a really interesting aspect of the answer to that. It allows you to, again, leverage the data domain owners, the people who know the data the best, to create data as a product ultimately to be consumed. And we try to represent that in our product as effectively, almost eCommerce like experience where you go and discover and look for the data products that have been created in your organization, and then you can start to consume them as you'd like. And so really trying to build on that notion of data democratization and self-service, and making it very easy to discover and start to use with whatever BI tool you may like or even just running SQL queries yourself. >> Okay guys, grab a sip of water. After the short break, we'll be back to debate whether proprietary or open platforms are the best path to the future of data excellence. Keep it right there. (bright upbeat music)

Published Date : Aug 22 2022

SUMMARY :

has the data they need when they need it Now, here's the first lie. has proven that to be a lie. of pressure to cut cost, and all of the tooling have kind of saved the data So from the central team, for that build cubes and the like and to generate some valuable output. and that's the most cost effective way is that the reality is those of the data warehouse is inevitable? and making sure that the mesh one of the most regulated, most sensitive, and processes that you put as to how to get there, aspect of the answer to that. or open platforms are the best path

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Richard	PERSON	0.99+
Justin Borgman	PERSON	0.99+
Justin	PERSON	0.99+
Richard Jarvis	PERSON	0.99+
Teresa Tung	PERSON	0.99+
Jeff Hammerbacher	PERSON	0.99+
Teresa	PERSON	0.99+
Teradata	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Massachusetts	LOCATION	0.99+
Zhamak Dehghani	PERSON	0.99+
UK	LOCATION	0.99+
2011	DATE	0.99+
two challenges	QUANTITY	0.99+
Hadapt	ORGANIZATION	0.99+
40 years	QUANTITY	0.99+
Starburst	ORGANIZATION	0.99+
two models	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Boston	LOCATION	0.99+
Facebook	ORGANIZATION	0.99+
Sarbanes-Oxley	ORGANIZATION	0.99+
Each	QUANTITY	0.99+
first lie	QUANTITY	0.99+
Accenture	ORGANIZATION	0.99+
GDPR	TITLE	0.99+
Today	DATE	0.98+
today	DATE	0.98+
SQL	TITLE	0.98+
Starburst Data	ORGANIZATION	0.98+
EMIS Health	ORGANIZATION	0.98+
Cloudera	ORGANIZATION	0.98+
one	QUANTITY	0.98+
first startup	QUANTITY	0.98+
one place	QUANTITY	0.98+
about 30 miles	QUANTITY	0.98+
One	QUANTITY	0.97+
More than a decade later	DATE	0.97+
EMIS	ORGANIZATION	0.97+
single bucket	QUANTITY	0.97+
first technologist	QUANTITY	0.96+
three industry experts	QUANTITY	0.96+
single tool	QUANTITY	0.96+
single version	QUANTITY	0.94+
Zhamak	PERSON	0.92+
theCUBE	ORGANIZATION	0.91+
single source	QUANTITY	0.9+
West Coast	LOCATION	0.87+
one vendor	QUANTITY	0.84+
single security layer	QUANTITY	0.81+
about a year ago	DATE	0.75+
IDU	ORGANIZATION	0.68+
Is	TITLE	0.65+
a second	QUANTITY	0.64+
EDW	ORGANIZATION	0.57+
examples	QUANTITY	0.55+
echo	COMMERCIAL_ITEM	0.54+
twofold	QUANTITY	0.5+
Lie	TITLE	0.35+

Breaking Analysis: How JPMC is Implementing a Data Mesh Architecture on the AWS Cloud

>> From theCUBE studios in Palo Alto and Boston, bringing you data-driven insights from theCUBE and ETR. This is braking analysis with Dave Vellante. >> A new era of data is upon us, and we're in a state of transition. You know, even our language reflects that. We rarely use the phrase big data anymore, rather we talk about digital transformation or digital business, or data-driven companies. Many have come to the realization that data is a not the new oil, because unlike oil, the same data can be used over and over for different purposes. We still use terms like data as an asset. However, that same narrative, when it's put forth by the vendor and practitioner communities, includes further discussions about democratizing and sharing data. Let me ask you this, when was the last time you wanted to share your financial assets with your coworkers or your partners or your customers? Hello everyone, and welcome to this week's Wikibon Cube Insights powered by ETR. In this breaking analysis, we want to share our assessment of the state of the data business. We'll do so by looking at the data mesh concept and how a leading financial institution, JP Morgan Chase is practically applying these relatively new ideas to transform its data architecture. Let's start by looking at what is the data mesh. As we've previously reported many times, data mesh is a concept and set of principles that was introduced in 2018 by Zhamak Deghani who's director of technology at ThoughtWorks, it's a global consultancy and software development company. And she created this movement because her clients, who were some of the leading firms in the world had invested heavily in predominantly monolithic data architectures that had failed to deliver desired outcomes in ROI. So her work went deep into trying to understand that problem. And her main conclusion that came out of this effort was the world of data is distributed and shoving all the data into a single monolithic architecture is an approach that fundamentally limits agility and scale. Now a profound concept of data mesh is the idea that data architectures should be organized around business lines with domain context. That the highly technical and hyper specialized roles of a centralized cross functional team are a key blocker to achieving our data aspirations. This is the first of four high level principles of data mesh. So first again, that the business domain should own the data end-to-end, rather than have it go through a centralized big data technical team. Second, a self-service platform is fundamental to a successful architectural approach where data is discoverable and shareable across an organization and an ecosystem. Third, product thinking is central to the idea of data mesh. In other words, data products will power the next era of data success. And fourth data products must be built with governance and compliance that is automated and federated. Now there's lot more to this concept and there are tons of resources on the web to learn more, including an entire community that is formed around data mesh. But this should give you a basic idea. Now, the other point is that, in observing Zhamak Deghani's work, she is deliberately avoided discussions around specific tooling, which I think has frustrated some folks because we all like to have references that tie to products and tools and companies. So this has been a two-edged sword in that, on the one hand it's good, because data mesh is designed to be tool agnostic and technology agnostic. On the other hand, it's led some folks to take liberties with the term data mesh and claim mission accomplished when their solution, you know, maybe more marketing than reality. So let's look at JP Morgan Chase in their data mesh journey. Is why I got really excited when I saw this past week, a team from JPMC held a meet up to discuss what they called, data lake strategy via data mesh architecture. I saw that title, I thought, well, that's a weird title. And I wondered, are they just taking their legacy data lakes and claiming they're now transformed into a data mesh? But in listening to the presentation, which was over an hour long, the answer is a definitive no, not at all in my opinion. A gentleman named Scott Hollerman organized the session that comprised these three speakers here, James Reid, who's a divisional CIO at JPMC, Arup Nanda who is a technologist and architect and Serita Bakst who is an information architect, again, all from JPMC. This was the most detailed and practical discussion that I've seen to date about implementing a data mesh. And this is JP Morgan's their approach, and we know they're extremely savvy and technically sound. And they've invested, it has to be billions in the past decade on data architecture across their massive company. And rather than dwell on the downsides of their big data past, I was really pleased to see how they're evolving their approach and embracing new thinking around data mesh. So today, we're going to share some of the slides that they use and comment on how it dovetails into the concept of data mesh that Zhamak Deghani has been promoting, and at least as we understand it. And dig a bit into some of the tooling that is being used by JP Morgan, particularly around it's AWS cloud. So the first point is it's all about business value, JPMC, they're in the money business, and in that world, business value is everything. So Jr Reid, the CIO showed this slide and talked about their overall goals, which centered on a cloud first strategy to modernize the JPMC platform. I think it's simple and sensible, but there's three factors on which he focused, cut costs always short, you got to do that. Number two was about unlocking new opportunities, or accelerating time to value. But I was really happy to see number three, data reuse. That's a fundamental value ingredient in the slide that he's presenting here. And his commentary was all about aligning with the domains and maximizing data reuse, i.e. data is not like oil and making sure there's appropriate governance around that. Now don't get caught up in the term data lake, I think it's just how JP Morgan communicates internally. It's invested in the data lake concept, so they use water analogies. They use things like data puddles, for example, which are single project data marts or data ponds, which comprise multiple data puddles. And these can feed in to data lakes. And as we'll see, JPMC doesn't strive to have a single version of the truth from a data standpoint that resides in a monolithic data lake, rather it enables the business lines to create and own their own data lakes that comprise fit for purpose data products. And they do have a single truth of metadata. Okay, we'll get to that. But generally speaking, each of the domains will own end-to-end their own data and be responsible for those data products, we'll talk about that more. Now the genesis of this was sort of a cloud first platform, JPMC is leaning into public cloud, which is ironic since the early days, in the early days of cloud, all the financial institutions were like never. Anyway, JPMC is going hard after it, they're adopting agile methods and microservices architectures, and it sees cloud as a fundamental enabler, but it recognizes that on-prem data must be part of the data mesh equation. Here's a slide that starts to get into some of that generic tooling, and then we'll go deeper. And I want to make a couple of points here that tie back to Zhamak Deghani's original concept. The first is that unlike many data architectures, this puts data as products right in the fat middle of the chart. The data products live in the business domains and are at the heart of the architecture. The databases, the Hadoop clusters, the files and APIs on the left-hand side, they serve the data product builders. The specialized roles on the right hand side, the DBA's, the data engineers, the data scientists, the data analysts, we could have put in quality engineers, et cetera, they serve the data products. Because the data products are owned by the business, they inherently have the context that is the middle of this diagram. And you can see at the bottom of the slide, the key principles include domain thinking, an end-to-end ownership of the data products. They build it, they own it, they run it, they manage it. At the same time, the goal is to democratize data with a self-service as a platform. One of the biggest points of contention of data mesh is governance. And as Serita Bakst said on the Meetup, metadata is your friend, and she kind of made a joke, she said, "This sounds kind of geeky, but it's important to have a metadata catalog to understand where data resides and the data lineage in overall change management. So to me, this really past the data mesh stink test pretty well. Let's look at data as products. CIO Reid said the most difficult thing for JPMC was getting their heads around data product, and they spent a lot of time getting this concept to work. Here's the slide they use to describe their data products as it related to their specific industry. They set a common language and taxonomy is very important, and you can imagine how difficult that was. He said, for example, it took a lot of discussion and debate to define what a transaction was. But you can see at a high level, these three product groups around wholesale, credit risk, party, and trade and position data as products, and each of these can have sub products, like, party, we'll have to know your customer, KYC for example. So a key for JPMC was to start at a high level and iterate to get more granular over time. So lots of decisions had to be made around who owns the products and the sub-products. The product owners interestingly had to defend why that product should even exist, what boundaries should be in place and what data sets do and don't belong in the various products. And this was a collaborative discussion, I'm sure there was contention around that between the lines of business. And which sub products should be part of these circles? They didn't say this, but tying it back to data mesh, each of these products, whether in a data lake or a data hub or a data pond or data warehouse, data puddle, each of these is a node in the global data mesh that is discoverable and governed. And supporting this notion, Serita said that, "This should not be infrastructure-bound, logically, any of these data products, whether on-prem or in the cloud can connect via the data mesh." So again, I felt like this really stayed true to the data mesh concept. Well, let's look at some of the key technical considerations that JPM discussed in quite some detail. This chart here shows a diagram of how JP Morgan thinks about the problem, and some of the challenges they had to consider were how to write to various data stores, can you and how can you move data from one data store to another? How can data be transformed? Where's the data located? Can the data be trusted? How can it be easily accessed? Who has the right to access that data? These are all problems that technology can help solve. And to address these issues, Arup Nanda explained that the heart of this slide is the data in ingestor instead of ETL. All data producers and contributors, they send their data to the ingestor and the ingestor then registers the data so it's in the data catalog. It does a data quality check and it tracks the lineage. Then, data is sent to the router, which persists the data in the data store based on the best destination as informed by the registration. This is designed to be a flexible system. In other words, the data store for a data product is not fixed, it's determined at the point of inventory, and that allows changes to be easily made in one place. The router simply reads that optimal location and sends it to the appropriate data store. Nowadays you see the schema infer there is used when there is no clear schema on right. In this case, the data product is not allowed to be consumed until the schema is inferred, and then the data goes into a raw area, and the inferer determines the schema and then updates the inventory system so that the data can be routed to the proper location and properly tracked. So that's some of the detail of how the sausage factory works in this particular use case, it was very interesting and informative. Now let's take a look at the specific implementation on AWS and dig into some of the tooling. As described in some detail by Arup Nanda, this diagram shows the reference architecture used by this group within JP Morgan, and it shows all the various AWS services and components that support their data mesh approach. So start with the authorization block right there underneath Kinesis. The lake formation is the single point of entitlement and has a number of buckets including, you can see there the raw area that we just talked about, a trusted bucket, a refined bucket, et cetera. Depending on the data characteristics at the data catalog registration block where you see the glue catalog, that determines in which bucket the router puts the data. And you can see the many AWS services in use here, identity, the EMR, the elastic MapReduce cluster from the legacy Hadoop work done over the years, the Redshift Spectrum and Athena, JPMC uses Athena for single threaded workloads and Redshift Spectrum for nested types so they can be queried independent of each other. Now remember very importantly, in this use case, there is not a single lake formation, rather than multiple lines of business will be authorized to create their own lakes, and that creates a challenge. So how can that be done in a flexible and automated manner? And that's where the data mesh comes into play. So JPMC came up with this federated lake formation accounts idea, and each line of business can create as many data producer or consumer accounts as they desire and roll them up into their master line of business lake formation account. And they cross-connect these data products in a federated model. And these all roll up into a master glue catalog so that any authorized user can find out where a specific data element is located. So this is like a super set catalog that comprises multiple sources and syncs up across the data mesh. So again to me, this was a very well thought out and practical application of database. Yes, it includes some notion of centralized management, but much of that responsibility has been passed down to the lines of business. It does roll up to a master catalog, but that's a metadata management effort that seems compulsory to ensure federated and automated governance. As well at JPMC, the office of the chief data officer is responsible for ensuring governance and compliance throughout the federation. All right, so let's take a look at some of the suspects in this world of data mesh and bring in the ETR data. Now, of course, ETR doesn't have a data mesh category, there's no such thing as that data mesh vendor, you build a data mesh, you don't buy it. So, what we did is we use the ETR dataset to select and filter on some of the culprits that we thought might contribute to the data mesh to see how they're performing. This chart depicts a popular view that we often like to share. It's a two dimensional graphic with net score or spending momentum on the vertical axis and market share or pervasiveness in the data set on the horizontal axis. And we filtered the data on sectors such as analytics, data warehouse, and the adjacencies to things that might fit into data mesh. And we think that these pretty well reflect participation that data mesh is certainly not all compassing. And it's a subset obviously, of all the vendors who could play in the space. Let's make a few observations. Now as is often the case, Azure and AWS, they're almost literally off the charts with very high spending velocity and large presence in the market. Oracle you can see also stands out because much of the world's data lives inside of Oracle databases. It doesn't have the spending momentum or growth, but the company remains prominent. And you can see Google Cloud doesn't have nearly the presence in the dataset, but it's momentum is highly elevated. Remember that red dotted line there, that 40% line, anything over that indicates elevated spending momentum. Let's go to Snowflake. Snowflake is consistently shown to be the gold standard in net score in the ETR dataset. It continues to maintain highly elevated spending velocity in the data. And in many ways, Snowflake with its data marketplace and its data cloud vision and data sharing approach, fit nicely into the data mesh concept. Now, a caution, Snowflake has used the term data mesh in it's marketing, but in our view, it lacks clarity, and we feel like they're still trying to figure out how to communicate what that really is. But is really, we think a lot of potential there to that vision. Databricks is also interesting because the firm has momentum and we expect further elevated levels in the vertical axis in upcoming surveys, especially as it readies for its IPO. The firm has a strong product and managed service, and is really one to watch. Now we included a number of other database companies for obvious reasons like Redis and Mongo, MariaDB, Couchbase and Terradata. SAP as well is in there, but that's not all database, but SAP is prominent so we included them. As is IBM more of a database, traditional database player also with the big presence. Cloudera includes Hortonworks and HPE Ezmeral comprises the MapR business that HPE acquired. So these guys got the big data movement started, between Cloudera, Hortonworks which is born out of Yahoo, which was the early big data, sorry early Hadoop innovator, kind of MapR when it's kind of owned course, and now that's all kind of come together in various forms. And of course, we've got Talend and Informatica are there, they are two data integration companies that are worth noting. We also included some of the AI and ML specialists and data science players in the mix like DataRobot who just did a monster $250 million round. Dataiku, H2O.ai and ThoughtSpot, which is all about democratizing data and injecting AI, and I think fits well into the data mesh concept. And you know we put VMware Cloud in there for reference because it really is the predominant on-prem infrastructure platform. All right, let's wrap with some final thoughts here, first, thanks a lot to the JP Morgan team for sharing this data. I really want to encourage practitioners and technologists, go to watch the YouTube of that meetup, we'll include it in the link of this session. And thank you to Zhamak Deghani and the entire data mesh community for the outstanding work that you're doing, challenging the established conventions of monolithic data architectures. The JPM presentation, it gives you real credibility, it takes Data Mesh well beyond concept, it demonstrates how it can be and is being done. And you know, this is not a perfect world, you're going to start somewhere and there's going to be some failures, the key is to recognize that shoving everything into a monolithic data architecture won't support massive scale and agility that you're after. It's maybe fine for smaller use cases in smaller firms, but if you're building a global platform in a data business, it's time to rethink data architecture. Now much of this is enabled by the cloud, but cloud first doesn't mean cloud only, doesn't mean you'll leave your on-prem data behind, on the contrary, you have to include non-public cloud data in your Data Mesh vision just as JPMC has done. You've got to get some quick wins, that's crucial so you can gain credibility within the organization and grow. And one of the key takeaways from the JP Morgan team is, there is a place for dogma, like organizing around data products and domains and getting that right. On the other hand, you have to remain flexible because technologies is going to come, technology is going to go, so you got to be flexible in that regard. And look, if you're going to embrace the metaphor of water like puddles and ponds and lakes, we suggest maybe a little tongue in cheek, but still we believe in this, that you expand your scope to include data ocean, something John Furry and I have talked about and laughed about extensively in theCUBE. Data oceans, it's huge. It's the new data lake, go transcend data lake, think oceans. And think about this, just as we're evolving our language, we should be evolving our metrics. Much the last the decade of big data was around just getting the stuff to work, getting it up and running, standing up infrastructure and managing massive, how much data you got? Massive amounts of data. And there were many KPIs built around, again, standing up that infrastructure, ingesting data, a lot of technical KPIs. This decade is not just about enabling better insights, it's a more than that. Data mesh points us to a new era of data value, and that requires the new metrics around monetizing data products, like how long does it take to go from data product conception to monetization? And how does that compare to what it is today? And what is the time to quality if the business owns the data, and the business has the context? the quality that comes out of them, out of the shoot should be at a basic level, pretty good, and at a higher mark than out of a big data team with no business context. Automation, AI, and very importantly, organizational restructuring of our data teams will heavily contribute to success in the coming years. So we encourage you, learn, lean in and create your data future. Okay, that's it for now, remember these episodes, they're all available as podcasts wherever you listen, all you got to do is search, breaking analysis podcast, and please subscribe. Check out ETR's website at etr.plus for all the data and all the survey information. We publish a full report every week on wikibon.com and siliconangle.com. And you can get in touch with us, email me david.vellante@siliconangle.com, you can DM me @dvellante, or you can comment on my LinkedIn posts. This is Dave Vellante for theCUBE insights powered by ETR. Have a great week everybody, stay safe, be well, and we'll see you next time. (upbeat music)

Published Date : Jul 12 2021

SUMMARY :

This is braking analysis and the adjacencies to things

ENTITIES

Entity	Category	Confidence
JPMC	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
2018	DATE	0.99+
Zhamak Deghani	PERSON	0.99+
James Reid	PERSON	0.99+
JP Morgan	ORGANIZATION	0.99+
JP Morgan	ORGANIZATION	0.99+
Cloudera	ORGANIZATION	0.99+
Serita Bakst	PERSON	0.99+
IBM	ORGANIZATION	0.99+
HPE	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Scott Hollerman	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
40%	QUANTITY	0.99+
JP Morgan Chase	ORGANIZATION	0.99+
Serita	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Arup Nanda	PERSON	0.99+
each	QUANTITY	0.99+
ThoughtWorks	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
david.vellante@siliconangle.com	OTHER	0.99+
each line	QUANTITY	0.99+
Terradata	ORGANIZATION	0.99+
Redis	ORGANIZATION	0.99+
$250 million	QUANTITY	0.99+
first point	QUANTITY	0.99+
three factors	QUANTITY	0.99+
Second	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
today	DATE	0.99+
Informatica	ORGANIZATION	0.99+
Talend	ORGANIZATION	0.99+
John Furry	PERSON	0.99+
Zhamak Deghani	PERSON	0.99+
first platform	QUANTITY	0.98+
YouTube	ORGANIZATION	0.98+
fourth	QUANTITY	0.98+
single	QUANTITY	0.98+
One	QUANTITY	0.98+
Third	QUANTITY	0.97+
Couchbase	ORGANIZATION	0.97+
three speakers	QUANTITY	0.97+
two data	QUANTITY	0.97+
first strategy	QUANTITY	0.96+
one	QUANTITY	0.96+
one place	QUANTITY	0.96+
Jr Reid	PERSON	0.96+
single lake	QUANTITY	0.95+
SAP	ORGANIZATION	0.95+
wikibon.com	OTHER	0.95+
siliconangle.com	OTHER	0.94+
Azure	ORGANIZATION	0.93+

UNLIST TILL 4/2 - The Next-Generation Data Underlying Architecture

>> Paige: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, Vertica next generation architecture. I'm Paige Roberts, open social relationship Manager at Vertica, I'll be your host for this session. And joining me is Vertica Chief Architect, Chuck Bear, before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait, just type your question or comment, in the question box that's below the slides and click submit. So as you think about it, go ahead and type it in, there'll be a Q&A session at the end of the presentation, where we'll answer as many questions, as we're able to during the time. Any questions that we don't get a chance to address, we'll do our best to answer offline. Or alternatively, you can visit the Vertica forums to post your questions there, after the session. Our engineering team is planning to join the forum and keep the conversation going, so you can, it's just sort of like the developers lounge would be in delight conference. It gives you a chance to talk to our engineering team. Also, as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this virtual session is being recorded, and it will be available to view on demand this week, we'll send you a notification, as soon as it's ready. Okay, now, let's get started, over to you, Chuck. >> Chuck: Thanks for the introduction, Paige, Vertica vision is to help customers, get value from structured data. This vision is simple, it doesn't matter what vertical the customer is in. They're all analytics companies, it doesn't matter what the customers environment is, as data is generated everywhere. We also can't do this alone, we know that you need other tools and people to build a complete solution. You know our database is key to delivering on the vision because we need a database that scales. When you start a new database company, you aren't going to win against 30 year old products on features. But from day one, we had something else, an architecture built for analytics performance. This architecture was inspired by the C-store project, combining the best design ideas from academics and industry veterans like Dr. Mike Stonebreaker. Our storage is optimized for performance, we use many computers in parallel. After over 10 years of refinements against various customer workloads, much of the design held up and serendipitously, the fact that we don't store in place updates set Vertica up for success in the cloud as well. These days, there are other tools that embody some of these design ideas. But we have other strengths that are more important than the storage format, where the only good analytics database that runs both on premise and in the cloud, giving customers the option to migrate their workloads, in most convenient and economical environment, or a full data management solution, not just the query tool. Unlike some other choices, ours comes with integration with a sequel ecosystem and full professional support. We organize our product roadmap into four key pillars, plus the cross cutting concerns of open integration and performance and scale. We have big plans to strengthen Vertica, while staying true to our core. This presentation is primarily about the separation pillar, and performance and scale, I'll cover our plans for Eon, our data management architecture, Mart analytic clusters, or fifth generation query executer, and our data storage layer. Let's start with how Vertica manages data, one of the central design points for Vertica was shared nothing, a design that didn't utilize a dedicated hardware shared disk technology. This quote here is how Mike put it politely, but around the Vertica office, shared disk with an LMTB over Mike's dead body. And we did get some early field experience with shared disk, customers, well, in fact will learn on anything if you let them. There were misconfigurations that required certified experts, obscure bugs extent. Another thing about the shared nothing designed for commodity hardware though, and this was in the papers, is that all the data management features like fault tolerance, backup and elasticity have to be done in software. And no matter how much you do, procuring, configuring and maintaining the machines with disks is harder. The software configuration process to add more service may be simple, but capacity planning, racking and stacking is not. The original allure of shared storage returned, this time though, the complexity and economics are different. It's cheaper, even provision storage with a few clicks and only pay for what you need. It expands, contracts and brings the maintenance of the storage close to a team is good at it. But there's a key difference, it's an object store, an object stores don't support the API's and access patterns used by most database software. So another Vertica visionary Ben, set out to exploit Vertica storage organization, which turns out to be a natural fit for modern cloud shared storage. Because Vertica data files are written once and not updated, they match the object storage model perfectly. And so today we have Eon, Eon uses shared storage to hold Vertica data with local disk depot's that act as caches, ensuring that we can get the performance that our customers have come to expect. Essentially Eon in enterprise behave similarly, but we have the benefit of flexible storage. Today Eon has the features our customers expect, it's been developed in tune for years, we have successful customers such as Redpharma, and if you'd like to know more about Eon has helped them succeed in Amazon cloud, I highly suggest reading their case study, which you can find on vertica.com. Eon provides high availability and flexible scaling, sometimes on premise customers with local disks get a little jealous of how recovery and sub-clusters work in Eon. Though we operate on premise, particularly on pure storage, but enterprise also had strengths, the most obvious being that you don't need and short shared storage to run it. So naturally, our vision is to converge the two modes, back into a single Vertica. A Vertica that runs any combination of local disks and shared storage, with full flexibility and portability. This is easy to say, but over the next releases, here's what we'll do. First, we realize that the query executer, optimizer and client drivers and so on, are already the same. Just the transaction handling and data management is different. But there's already more going on, we have peer-to-peer depot operations and other internode transfers. And enterprise also has a network, we could just get files from remote nodes over that network, essentially mimicking the behavior and benefits of shared storage with the layer of software. The only difference at the end of it, will be which storage hold the master copy. In enterprise, the nodes can't drop the files because they're the master copy. Whereas in Eon they can be evicted because it's just the cache, the masters, then shared storage. And in keeping with versus current support for multiple storage locations, we can intermix these approaches at the table level. Getting there as a journey, and we've already taken the first steps. One of the interesting design ideas of the C-store paper is the idea that redundant copies, don't have to have the same physical organization. Different copies can be optimized for different queries, sorted in different ways. Of course, Mike also said to keep the recovery system simple, because it's hard to debug, whenever the recovery system is being used, it's always in a high pressure situation. This turns out to be a contradiction, and the latter idea was better. No down performing stuff, if you don't keep the storage the same. Recovery hardware if you have, to reorganize data in the process. Even query optimization is more complicated. So over the past couple releases, we got rid of non identical buddies. But the storage files can still diverge at the fifth level, because tuple mover operations are synchronized. The same record can end up in different files than different nodes. The next step in our journey, is to make sure both copies are identical. This will help with backup and restore as well, because the second copy doesn't need backed up, or if it is backed up, it appears identical to the deduplication that is going to look present in both backup systems. Simultaneously, we're improving the Vertica networking service to support this new access pattern. In conjunction with identical storage files, we will converge to a recovery system that instantaneous nodes can process queries immediately, by retrieving data they need over the network from the redundant copies as they do in Eon day with even higher performance. The final step then is to unify the catalog and transaction model. Related concepts such as segment and shard, local catalog and shard catalog will be coalesced, as they're really represented the same concepts all along, just in different modes. In the catalog, we'll make slight changes to the definition of a projection, which represents the physical storage organization. The new definition simplifies segmentation and introduces valuable granularities of sharding to support evolution over time, and offers a straightforward migration path for both Eon and enterprise. There's a lot more to our Eon story than just the architectural roadmap. If you missed yesterday's Vertica, in Eon mode presentation about supported cloud, on premise storage option, replays are available. Be sure to catch the upcoming presentation on sizing and configuring vertica and in beyond doors. As we've seen with Eon, Vertica can separate data storage from the compute nodes, allowing machines to quickly fill in for each other, to rebuild fault tolerance. But separating compute and storage is used for much, much more. We now offer powerful, flexible ways for Vertica to add servers and increase access to the data. Vertica nine, this feature is called sub-clusters. It allows computing capacity to be added quickly and incrementally, and isolates workloads from each other. If your exploratory analytics team needs direct access to the source data, they need a lot of machines and not the same number all the time, and you don't 100% trust the kind of queries and user defined functions, they might be using sub-clusters as the solution. While there's much more expensive information available in our other presentation. I'd like to point out the highlights of our latest sub-cluster best practices. We suggest having a primary sub-cluster, this is the one that runs all the time, if you're loading data around the clock. It should be sized for the ETL workloads and also determines the natural shard count. Additional read oriented secondary sub-clusters can be added for real time dashboards, reports and analytics. That way, subclusters can be added or deep provisioned, without disruption to other users. The sub-cluster features of Vertica 9.3 are working well for customers. Yesterday, the Trade Desk presented their use case for Vertica over 300,000 in 5 sub clusters running in the cloud. If you missed a presentation, check out the replay. But we have plans beyond sub-clusters, we're extending sub-clusters to real clusters. For the Vertica savvy, this means the clusters bump, share the same spread ring network. This will provide further isolation, allowing clusters to control their own independent data sets. While replicating all are part of the data from other clusters using a publish subscribe mechanism. Synchronizing data between clusters is a feature customers want to understand the real business for themselves. This vision effects are designed for ancillary aspects, how we should assign resource pools, security policies and balance client connection. We will be simplifying our data segmentation strategy, so that when data that originate in the different clusters meet, they'll still get fully optimized joins, even if those clusters weren't positioned with the same number of nodes per shard. Having a broad vision for data management is a key component to political success. But we also take pride in our execution strategy, when you start a new database from scratch as we did 15 years ago, you won't compete on features. Our key competitive points where speed and scale of analytics, we set a target of 100 x better query performance in traditional databases with path loads. Our storage architecture provides a solid foundation on which to build toward these goals. Every query starts with data retrieval, keeping data sorted, organized by column and compressed by using adaptive caching, to keep the data retrieval time in IO to the bare minimum theoretically required. We also keep the data close to where it will be processed, and you clusters the machines to increase throughput. We have partition pruning a robust optimizer evaluate active use segmentation as part of the physical database designed to keep records close to the other relevant records. So the solid foundation, but we also need optimal execution strategies and tactics. One execution strategy which we built for a long time, but it's still a source of pride, it's how we process expressions. Databases and other systems with general purpose expression evaluators, write a compound expression into a tree. Here I'm using A plus one times B as an example, during execution, if your CPU traverses the tree and compute sub-parts from the whole. Tree traversal often takes more compute cycles than the actual work to be done. Especially in evaluation is a very common operation, so something worth optimizing. One instinct that engineers have is to use what we call, just-in-time or JIT compilation, which means generating code form the CPU into the specific activity expression, and add them. This replaces the tree of boxes that are custom made box for the query. This approach has complexity bugs, but it can be made to work. It has other drawbacks though, it adds a lot to query setup time, especially for short queries. And it pretty much eliminate the ability of mere models, mere mortals to develop user defined functions. If you go back to the problem we're trying to solve, the source of the overhead is the tree traversal. If you increase the batch of records processed in each traversal step, this overhead is amortized until it becomes negligible. It's a perfect match for a columnar storage engine. This also sets the CPU up for efficiency. The CPUs look particularly good, at following the same small sequence of instructions in a tight loop. In some cases, the CPU may even be able to vectorize, and apply the same processing to multiple records to the same instruction. This approach is easy to implement and debug, user defined functions are possible, then generally aligned with the other complexities of implementing and improving a large system. More importantly, the performance, both in terms of query setup and record throughput is dramatically improved. You'll hear me say that we look at research and industry for inspiration. In this case, our findings in line with academic binding. If you'd like to read papers, I recommend everything you always wanted to know about compiled and vectorized queries, don't afraid to ask, so we did have this idea before we read that paper. However, not every decision we made in the Vertica executer that the test of time as well as the expression evaluator. For example, sorting and grouping aren't susceptible to vectorization because sort decisions interrupt the flow. We have used JIT compiling on that for years, and Vertica 401, and it provides modest setups, but we know we can do even better. But who we've embarked on a new design for execution engine, which I call EE five, because it's our best. It's really designed especially for the cloud, now I know what you're thinking, you're thinking, I just put up a slide with an old engine, a new engine, and a sleek play headed up into the clouds. But this isn't just marketing hype, here's what I mean, when I say we've learned lessons over the years, and then we're redesigning the executer for the cloud. And of course, you'll see that the new design works well on premises as well. These changes are just more important for the cloud. Starting with the network layer in the cloud, we can't count on all nodes being connected to the same switch. Multicast doesn't work like it does in a custom data center, so as I mentioned earlier, we're redesigning the network transfer layer for the cloud. Storage in the cloud is different, and I'm not referring here to the storage of persistent data, but to the storage of temporary data used only once during the course of query execution. Our new pattern is designed to take into account the strengths and weaknesses of cloud object storage, where we can't easily do a path. Moving on to memory, many of our access patterns are reasonably effective on bare metal machines, that aren't the best choice on cloud hyperbug that have overheads, page faults or big gap. Here again, we found we can improve performance, a bit on dedicated hardware, and even more in the cloud. Finally, and this is true in all environments, core counts have gone up. And not all of our algorithms take full advantage, there's a lot of ground to cover here. But I think sorting in the perfect example to illustrate these points, I mentioned that we use JIT in sorting. We're getting rid of JIT in favor of a data format that can be treated efficiently, independent of what the data types are. We've drawn on the best, most modern technology from academia and industry. We've got our own analysis and testing, you know what we chose, we chose parallel merge sort, anyone wants to take a guess when merge sort was invented. It was invented in 1948, or at least documented that way, like computing context. If you've heard me talk before, you know that I'm fascinated by how all the things I worked with as an engineer, were invented before I was born. And in Vertica , we don't use the newest technologies, we use the best ones. And what is noble about Vertica is the way we've combined the best ideas together into a cohesive package. So all kidding about the 1940s aside, or he redesigned is actually state of the art. How do we know the sort routine is state of the art? It turns out, there's a pretty credible benchmark or at the appropriately named historic sortbenchmark.org. Anyone with resources looking for fame for their product or academic paper can try to set the record. Record is last set in 2016 with Tencent Sort, 100 terabytes in 99 seconds. Setting the records it's hard, you have to come up with hundreds of machines on a dedicated high speed switching fabric. There's a lot to a distributed sort, there all have core sorting algorithms. The authors of the paper conveniently broke out of the time spent in their sort, 67 out of 99 seconds want to know local sorting. If we break this out, divided by two CPUs and each of 512 nodes, we find that each CPU so there's almost a gig and a half per second. This is for what's called an indy sort, like an Indy race car, is in general purpose. It only handles fixed hundred five records with 10 byte key. There is a record length can vary, then it's called daytona sort, a 10 set daytona sort, is a little slower. One point is 10 gigabytes per second per CPU, now for Verrtica, We have a wide variety ability in record sizes, and more interesting data types, but still no harm in setting us like phone numbers, comfortable to the world record. On my 2017 era AMD desktop CPU, the Vertica EE5 sort to store about two and a half gigabytes per second. Obviously, this test isn't apply to apples because they use their own open power chip. But the number of DRM channels is the same, so it's pretty close the number that says we've hit on the right approach. And it performs this way on premise, in the cloud, and we can adapt it to cloud temp space. So what's our roadmap for integrating EE5 into the product and compare replacing the query executed the database to replacing the crankshaft and other parts of the engine of a car while it's been driven. We've actually done it before, between Vertica three and a half and five, and then we never really stopped changing it, now we'll do it again. The first part in replacing with algorithm called storage merge, which combines sorted data from disk. The first time has was two that are in vertical in incoming 10.0 patch that will be EE5 or resegmented storage merge, and then convert sorting and grouping into do out. There the performance results so far, in cases where the Vertica execute is doing well today, simple environments with simple data patterns, such as this simple capitalistic query, there's a lot of speed up, when we ship the segmentation code, which didn't quite make the freeze as much like to bump longer term, what we do is grouping into the storage of large operations, we'll get to where we think we ought to be, given a theoretical minimum work the CPUs need to do. Now if we look at a case where the current execution isn't doing as well, we see there's a much stronger benefit to the code shipping in Vertica 10. In fact, it turns a chart bar sideways to try to help you see the difference better. This case also benefit from the improvements in 10 product point releases and beyond. They will not happening to the vertical query executer, That was just the taste. But now I'd like to switch to the roadmap first for our adapters layer. I'll start with a story about, how our storage access layer evolved. If you go back to the academic ideas, if you start paper that persuaded investors to fund Vertica, read optimized store was the part that had substantiation in the form of performance data. Much of the paper was speculative, but we tried to follow it anyway. That paper talked about the WS with RS, The rights are in the read store, and how they work together for transaction processing and how there was a supernova. In all honesty, Vertica engineers couldn't figure out from the paper what to do next, incase you want to try, and we asked them they would like, We never got enough clarification to build it that way. But here's what we built, instead. We built the ROS, read optimized store, introduction on steep major revision. It's sorted, ordered columnar and compressed that follows a table partitioning that worked even better than the we are as described in the paper. We also built the last byte optimized store, we built four versions of this over the years actually. But this was the best one, it's not a set of interrelated V tree. It's just an append only, insertion order remember your way here, am sorry, no compression, no base, no partitioning. There is, however, a tuple over which does what we call move out. Move the data from WOS to ROS, sorting and compressing. Let's take a moment to compare how they behave, when you load data directly to the ROS, there's a data parsing operation. Then we finished the sorting, and then compressing right out the columnar data files to stay storage. The next query through executes against the ROS and it runs as it should because the ROS is read optimized. Let's repeat the exercise for WOS, the load operation response before the sorting and compressing, and before the data is written to persistent storage. Now it's possible for a query to come along, and the query could be responsible for sorting the lost data in addition to its other processes. Effect on query isn't predictable until the TM comes along and writes the data to the ROS. Over the years, we've done a lot of comparisons between ROS and WOS. ROS has always been better for sustained load throughput, it achieves much higher records per second without pushing back against the client and hasn't Vertica for when we developed the first usable merge out algorithm. ROS has always been better for predictable query performance, the ROS has never had the same management complexity and limitations as WOS. You don't have to pick a memory size and figure out which transactions get to use the pool. A non persistent nature of ROS always cause headaches when there are unexpected cluster shutdowns. We also looked at field usage data, we found that few customers were using a lot, especially among those that studied the issue carefully. So how we set out on a mission to improve the ROS to the point where it was always better than both the WOS and the profit of the past. And now it's true, ROS is better than the WOS and the loss of a couple of years ago. We implemented storage bundling, better catalog object storage and better tuple mover merge outs. And now, after extensive Q&A and customer testing, we've now succeeded, and in Vertica 10, we've removed the whys. Let's talk for a moment about simplicity, one of the best things Mike Stonebreaker said is no knobs. Anyone want to guess how many knobs we got rid of, and we took the WOS out of the product. 22 were five knobs to control whether it didn't went to ROS as well. Six controlling the ROS itself, Six more to set policies for the typical remove out and so on. In my honest opinion is still wasn't enough control over to achieve excess in a multi tenant environment, the big reason to get rid of the WOS for simplicity. Make the lives of DBAs and users better, we have a long way to go, but we're doing it. On my desk, I keep a jar with the knob in it for each knob in Vertica. When developers add a knob to the product, they have to add a knob to the jar. When they remove a knob, they get to choose one to take out, We have a lot of work to do, but I'm thrilled to report that in 15 years 10 is the first release with a number of knobs ticked downward. Get back to the WOS, I've said the most important thing get rid of it for last. We're getting rid of it so we can deliver our vision of the future to our customer. Remember how he said an Eon and sub-clusters we got all these benefits from shared storage? Guess what can't live in shared storage, the WOS. Remember how it's been a big part of the future was keeping the copies that identical to the primary copy? Independent actions of the WOS took a little at the root of the divergence between copies of the data. You have to admit it when you're wrong. That was in the original design and held up to the a selling point of time, without onto the idea of a separate ROS and WOS for too long. In Vertica, 10, we can finally bid, good reagents. I've covered a lot of ground, so let's put all the pieces together. I've talked a lot about our vision and how we're achieving it. But we also still pay attention to tactical detail. We've been fine tuning our memory management model to enhance performance. That involves revisiting tens of thousands of satellite of code, much like painting the inside of a large building with small paintbrushes. We're getting results as shown in the chart in Vertica nine, concurrent monitoring queries use memory from the global catalog tool, and Vertica 10, they don't. This is only one example of an important detail we're improving. We've also reworked the monitoring tables without network messages into two parts. The increased data we're collecting and analyzing and our quality assurance processes, we're improving on everything. As the story goes, I still have my grandfather's axe, of course, my father had to replace the handle, and I had to replace the head. Along the same lines, we still have Mike Stonebreaker Vertica. We didn't replace the query optimizer twice the debate database designer and storage layer four times each. The query executed is and it's a free design, like charted out how our code has changed over the years. I found that we don't have much from a long time ago, I did some digging, and you know what we have left in 2007. We have the original curly braces, and a little bit of percent code for handling dates and times. To deliver on our mission to help customers get value from their structured data, with high performance of scale, and in diverse deployment environments. We have the sound architecture roadmap, reviews the best execution strategy and solid tactics. On the architectural front, we're converging in an enterprise, we're extending smart analytic clusters. In query processing, we're redesigning the execution engine for the cloud, as I've told you. There's a lot more than just the fast engine. that you want to learn about our new data support for complex data types, improvements to the query optimizer statistics, or extension to live aggregate projections and flatten tables. You should check out some of the other engineering talk that the big data conference. We continue to stay on top of the details from low level CPU and memory too, to the monitoring management, developing tighter feedback cycles between development, Q&A and customers. And don't forget to check out the rest of the pillars of our roadmap. We have new easier ways to get started with Vertica in the cloud. Engineers have been hard at work on machine learning and security. It's easier than ever to use Vertica with third Party product, as a variety of tools integrations continues to increase. Finally, the most important thing we can do, is to help people get value from structured data to help people learn more about Vertica. So hopefully I left plenty of time for Q&A at the end of this presentation. I hope to hear your questions soon.

Published Date : Mar 30 2020

SUMMARY :

and keep the conversation going, and apply the same processing to multiple records

ENTITIES

Entity	Category	Confidence
Mike	PERSON	0.99+
Mike Stonebreaker	PERSON	0.99+
2007	DATE	0.99+
Chuck Bear	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
2016	DATE	0.99+
Paige Roberts	PERSON	0.99+
Chuck	PERSON	0.99+
second copy	QUANTITY	0.99+
99 seconds	QUANTITY	0.99+
67	QUANTITY	0.99+
100%	QUANTITY	0.99+
1948	DATE	0.99+
Ben	PERSON	0.99+
two modes	QUANTITY	0.99+
Redpharma	ORGANIZATION	0.99+
first time	QUANTITY	0.99+
first steps	QUANTITY	0.99+
Paige	PERSON	0.99+
two parts	QUANTITY	0.99+
First	QUANTITY	0.99+
five knobs	QUANTITY	0.99+
100 terabytes	QUANTITY	0.99+
both copies	QUANTITY	0.99+
Today	DATE	0.99+
each knob	QUANTITY	0.99+
WS	ORGANIZATION	0.99+
AMD	ORGANIZATION	0.99+
Eon	ORGANIZATION	0.99+
1940s	DATE	0.99+
today	DATE	0.99+
One point	QUANTITY	0.99+
first part	QUANTITY	0.99+
fifth level	QUANTITY	0.99+
each	QUANTITY	0.99+
yesterday	DATE	0.98+
both	QUANTITY	0.98+
Six	QUANTITY	0.98+
first	QUANTITY	0.98+
512 nodes	QUANTITY	0.98+
ROS	TITLE	0.98+
over 10 years	QUANTITY	0.98+
Yesterday	DATE	0.98+
15 years ago	DATE	0.98+
twice	QUANTITY	0.98+
sortbenchmark.org	OTHER	0.98+
first release	QUANTITY	0.98+
two CPUs	QUANTITY	0.97+
Vertica 10	TITLE	0.97+
100 x	QUANTITY	0.97+
WOS	TITLE	0.97+
vertica.com	OTHER	0.97+
10 byte	QUANTITY	0.97+
this week	DATE	0.97+
one	QUANTITY	0.97+
5 sub clusters	QUANTITY	0.97+
two	QUANTITY	0.97+
one example	QUANTITY	0.97+
over 300,000	QUANTITY	0.96+
Dr.	PERSON	0.96+
One	QUANTITY	0.96+
tens of thousands of satellite	QUANTITY	0.96+
EE5	COMMERCIAL_ITEM	0.96+
fifth generation	QUANTITY	0.96+

Paula D'Amico, Webster Bank | Io Tahoe | Enterprise Data Automation

>> Narrator: From around the Globe, it's theCube with digital coverage of Enterprise Data Automation, and event series brought to you by Io-Tahoe. >> Everybody, we're back. And this is Dave Vellante, and we're covering the whole notion of Automated Data in the Enterprise. And I'm really excited to have Paula D'Amico here. Senior Vice President of Enterprise Data Architecture at Webster Bank. Paula, good to see you. Thanks for coming on. >> Hi, nice to see you, too. >> Let's start with Webster bank. You guys are kind of a regional I think New York, New England, believe it's headquartered out of Connecticut. But tell us a little bit about the bank. >> Webster bank is regional Boston, Connecticut, and New York. Very focused on in Westchester and Fairfield County. They are a really highly rated regional bank for this area. They hold quite a few awards for the area for being supportive for the community, and are really moving forward technology wise, they really want to be a data driven bank, and they want to move into a more robust group. >> We got a lot to talk about. So data driven is an interesting topic and your role as Data Architecture, is really Senior Vice President Data Architecture. So you got a big responsibility as it relates to kind of transitioning to this digital data driven bank but tell us a little bit about your role in your Organization. >> Currently, today, we have a small group that is just working toward moving into a more futuristic, more data driven data warehousing. That's our first item. And then the other item is to drive new revenue by anticipating what customers do, when they go to the bank or when they log in to their account, to be able to give them the best offer. And the only way to do that is you have timely, accurate, complete data on the customer and what's really a great value on offer something to offer that, or a new product, or to help them continue to grow their savings, or do and grow their investments. >> Okay, and I really want to get into that. But before we do, and I know you're, sort of partway through your journey, you got a lot to do. But I want to ask you about Covid, how you guys handling that? You had the government coming down and small business loans and PPP, and huge volume of business and sort of data was at the heart of that. How did you manage through that? >> We were extremely successful, because we have a big, dedicated team that understands where their data is and was able to switch much faster than a larger bank, to be able to offer the PPP Long's out to our customers within lightning speed. And part of that was is we adapted to Salesforce very for we've had Salesforce in house for over 15 years. Pretty much that was the driving vehicle to get our PPP loans in, and then developing logic quickly, but it was a 24 seven development role and get the data moving on helping our customers fill out the forms. And a lot of that was manual, but it was a large community effort. >> Think about that too. The volume was probably much higher than the volume of loans to small businesses that you're used to granting and then also the initial guidelines were very opaque. You really didn't know what the rules were, but you were expected to enforce them. And then finally, you got more clarity. So you had to essentially code that logic into the system in real time. >> I wasn't directly involved, but part of my data movement team was, and we had to change the logic overnight. So it was on a Friday night it was released, we pushed our first set of loans through, and then the logic changed from coming from the government, it changed and we had to redevelop our data movement pieces again, and we design them and send them back through. So it was definitely kind of scary, but we were completely successful. We hit a very high peak. Again, I don't know the exact number but it was in the thousands of loans, from little loans to very large loans and not one customer who applied did not get what they needed for, that was the right process and filled out the right amount. >> Well, that is an amazing story and really great support for the region, your Connecticut, the Boston area. So that's fantastic. I want to get into the rest of your story now. Let's start with some of the business drivers in banking. I mean, obviously online. A lot of people have sort of joked that many of the older people, who kind of shunned online banking would love to go into the branch and see their friendly teller had no choice, during this pandemic, to go to online. So that's obviously a big trend you mentioned, the data driven data warehouse, I want to understand that, but what at the top level, what are some of the key business drivers that are catalyzing your desire for change? >> The ability to give a customer, what they need at the time when they need it. And what I mean by that is that we have customer interactions in multiple ways. And I want to be able for the customer to walk into a bank or online and see the same format, and being able to have the same feel the same love, and also to be able to offer them the next best offer for them. But they're if they want looking for a new mortgage or looking to refinance, or whatever it is that they have that data, we have the data and that they feel comfortable using it. And that's an untethered banker. Attitude is, whatever my banker is holding and whatever the person is holding in their phone, that is the same and it's comfortable. So they don't feel that they've walked into the bank and they have to do fill out different paperwork compared to filling out paperwork on just doing it on their phone. >> You actually do want the experience to be better. And it is in many cases. Now you weren't able to do this with your existing I guess mainframe based Enterprise Data Warehouses. Is that right? Maybe talk about that a little bit? >> Yeah, we were definitely able to do it with what we have today the technology we're using. But one of the issues is that it's not timely. And you need a timely process to be able to get the customers to understand what's happening. You need a timely process so we can enhance our risk management. We can apply for fraud issues and things like that. >> Yeah, so you're trying to get more real time. The traditional EDW. It's sort of a science project. There's a few experts that know how to get it. You can so line up, the demand is tremendous. And then oftentimes by the time you get the answer, it's outdated. So you're trying to address that problem. So part of it is really the cycle time the end to end cycle time that you're progressing. And then there's, if I understand it residual benefits that are pretty substantial from a revenue opportunity, other offers that you can make to the right customer, that you maybe know, through your data, is that right? >> Exactly. It's drive new customers to new opportunities. It's enhanced the risk, and it's to optimize the banking process, and then obviously, to create new business. And the only way we're going to be able to do that is if we have the ability to look at the data right when the customer walks in the door or right when they open up their app. And by doing creating more to New York times near real time data, or the data warehouse team that's giving the lines of business the ability to work on the next best offer for that customer as well. >> But Paula, we're inundated with data sources these days. Are there other data sources that maybe had access to before, but perhaps the backlog of ingesting and cleaning in cataloging and analyzing maybe the backlog was so great that you couldn't perhaps tap some of those data sources. Do you see the potential to increase the data sources and hence the quality of the data or is that sort of premature? >> Oh, no. Exactly. Right. So right now, we ingest a lot of flat files and from our mainframe type of front end system, that we've had for quite a few years. But now that we're moving to the cloud and off-prem and on-prem, moving off-prem, into like an S3 Bucket, where that data we can process that data and get that data faster by using real time tools to move that data into a place where, like snowflake could utilize that data, or we can give it out to our market. Right now we're about we do work in batch mode still. So we're doing 24 hours. >> Okay. So when I think about the data pipeline, and the people involved, maybe you could talk a little bit about the organization. You've got, I don't know, if you have data scientists or statisticians, I'm sure you do. You got data architects, data engineers, quality engineers, developers, etc. And oftentimes, practitioners like yourself, will stress about, hey, the data is in silos. The data quality is not where we want it to be. We have to manually categorize the data. These are all sort of common data pipeline problems, if you will. Sometimes we use the term data Ops, which is sort of a play on DevOps applied to the data pipeline. Can you just sort of describe your situation in that context? >> Yeah, so we have a very large data ops team. And everyone that who is working on the data part of Webster's Bank, has been there 13 to 14 years. So they get the data, they understand it, they understand the lines of business. So it's right now. We could the we have data quality issues, just like everybody else does. But we have places in them where that gets cleansed. And we're moving toward and there was very much siloed data. The data scientists are out in the lines of business right now, which is great, because I think that's where data science belongs, we should give them and that's what we're working towards now is giving them more self service, giving them the ability to access the data in a more robust way. And it's a single source of truth. So they're not pulling the data down into their own, like Tableau dashboards, and then pushing the data back out. So they're going to more not, I don't want to say, a central repository, but a more of a robust repository, that's controlled across multiple avenues, where multiple lines of business can access that data. Is that help? >> Got it, Yes. And I think that one of the key things that I'm taking away from your last comment, is the cultural aspects of this by having the data scientists in the line of business, the lines of business will feel ownership of that data as opposed to pointing fingers criticizing the data quality. They really own that that problem, as opposed to saying, well, it's Paula's problem. >> Well, I have my problem is I have data engineers, data architects, database administrators, traditional data reporting people. And because some customers that I have that are business customers lines of business, they want to just subscribe to a report, they don't want to go out and do any data science work. And we still have to provide that. So we still want to provide them some kind of regiment that they wake up in the morning, and they open up their email, and there's the report that they subscribe to, which is great, and it works out really well. And one of the things is why we purchased Io-Tahoe was, I would have the ability to give the lines of business, the ability to do search within the data. And we'll read the data flows and data redundancy and things like that, and help me clean up the data. And also, to give it to the data analysts who say, all right, they just asked me they want this certain report. And it used to take okay, four weeks we're going to go and we're going to look at the data and then we'll come back and tell you what we can do. But now with Io-Tahoe, they're able to look at the data, and then in one or two days, they'll be able to go back and say, Yes, we have the data, this is where it is. This is where we found it. This is the data flows that we found also, which is what I call it, is the break of a column. It's where the column was created, and where it went to live as a teenager. (laughs) And then it went to die, where we archive it. And, yeah, it's this cycle of life for a column. And Io-Tahoe helps us do that. And we do data lineage is done all the time. And it's just takes a very long time and that's why we're using something that has AI in it and machine running. It's accurate, it does it the same way over and over again. If an analyst leaves, you're able to utilize something like Io-Tahoe to be able to do that work for you. Is that help? >> Yeah, so got it. So a couple things there, in researching Io-Tahoe, it seems like one of the strengths of their platform is the ability to visualize data, the data structure and actually dig into it, but also see it. And that speeds things up and gives everybody additional confidence. And then the other piece is essentially infusing AI or machine intelligence into the data pipeline, is really how you're attacking automation. And you're saying it repeatable, and then that helps the data quality and you have this virtual cycle. Maybe you could sort of affirm that and add some color, perhaps. >> Exactly. So you're able to let's say that I have seven cars, lines of business that are asking me questions, and one of the questions they'll ask me is, we want to know, if this customer is okay to contact, and there's different avenues so you can go online, do not contact me, you can go to the bank and you can say, I don't want email, but I'll take texts. And I want no phone calls. All that information. So, seven different lines of business asked me that question in different ways. One said, "No okay to contact" the other one says, "Customer 123." All these. In each project before I got there used to be siloed. So one customer would be 100 hours for them to do that analytical work, and then another analyst would do another 100 hours on the other project. Well, now I can do that all at once. And I can do those types of searches and say, Yes, we already have that documentation. Here it is, and this is where you can find where the customer has said, "No, I don't want to get access from you by email or I've subscribed to get emails from you." >> Got it. Okay. Yeah Okay. And then I want to go back to the cloud a little bit. So you mentioned S3 Buckets. So you're moving to the Amazon cloud, at least, I'm sure you're going to get a hybrid situation there. You mentioned snowflake. What was sort of the decision to move to the cloud? Obviously, snowflake is cloud only. There's not an on-prem, version there. So what precipitated that? >> Alright, so from I've been in the data IT information field for the last 35 years. I started in the US Air Force, and have moved on from since then. And my experience with Bob Graham, was with snowflake with working with GE Capital. And that's where I met up with the team from Io-Tahoe as well. And so it's a proven so there's a couple of things one is Informatica, is worldwide known to move data. They have two products, they have the on-prem and the off-prem. I've used the on-prem and off-prem, they're both great. And it's very stable, and I'm comfortable with it. Other people are very comfortable with it. So we picked that as our batch data movement. We're moving toward probably HVR. It's not a total decision yet. But we're moving to HVR for real time data, which is changed capture data, moves it into the cloud. And then, so you're envisioning this right now. In which is you're in the S3, and you have all the data that you could possibly want. And that's JSON, all that everything is sitting in the S3 to be able to move it through into snowflake. And snowflake has proven to have a stability. You only need to learn and train your team with one thing. AWS as is completely stable at this point too. So all these avenues if you think about it, is going through from, this is your data lake, which is I would consider your S3. And even though it's not a traditional data lake like, you can touch it like a Progressive or Hadoop. And then into snowflake and then from snowflake into sandbox and so your lines of business and your data scientists just dive right in. That makes a big win. And then using Io-Tahoe with the data automation, and also their search engine. I have the ability to give the data scientists and data analysts the way of they don't need to talk to IT to get accurate information or completely accurate information from the structure. And we'll be right back. >> Yeah, so talking about snowflake and getting up to speed quickly. I know from talking to customers you can get from zero to snowflake very fast and then it sounds like the Io-Tahoe is sort of the automation cloud for your data pipeline within the cloud. Is that the right way to think about it? >> I think so. Right now I have Io-Tahoe attached to my on-prem. And I want to attach it to my off-prem eventually. So I'm using Io-Tahoe data automation right now, to bring in the data, and to start analyzing the data flows to make sure that I'm not missing anything, and that I'm not bringing over redundant data. The data warehouse that I'm working of, it's an on-prem. It's an Oracle Database, and it's 15 years old. So it has extra data in it. It has things that we don't need anymore, and Io-Tahoe's helping me shake out that extra data that does not need to be moved into my S3. So it's saving me money, when I'm moving from off-prem to on-prem. >> And so that was a challenge prior, because you couldn't get the lines of business to agree what to delete, or what was the issue there? >> Oh, it was more than that. Each line of business had their own structure within the warehouse. And then they were copying data between each other, and duplicating the data and using that. So there could be possibly three tables that have the same data in it, but it's used for different lines of business. We have identified using Io-Tahoe identified over seven terabytes in the last two months on data that has just been repetitive. It's the same exact data just sitting in a different schema. And that's not easy to find, if you only understand one schema, that's reporting for that line of business. >> More bad news for the storage companies out there. (both laughs) So far. >> It's cheap. That's what we were telling people. >> And it's true, but you still would rather not waste it, you'd like to apply it to drive more revenue. And so, I guess, let's close on where you see this thing going. Again, I know you're sort of partway through the journey, maybe you could sort of describe, where you see the phase is going and really what you want to get out of this thing, down the road, mid-term, longer term, what's your vision or your data driven organization. >> I want for the bankers to be able to walk around with an iPad in their hand, and be able to access data for that customer, really fast and be able to give them the best deal that they can get. I want Webster to be right there on top with being able to add new customers, and to be able to serve our existing customers who had bank accounts since they were 12 years old there and now our multi whatever. I want them to be able to have the best experience with our bankers. >> That's awesome. That's really what I want as a banking customer. I want my bank to know who I am, anticipate my needs, and create a great experience for me. And then let me go on with my life. And so that follow. Great story. Love your experience, your background and your knowledge. I can't thank you enough for coming on theCube. >> Now, thank you very much. And you guys have a great day. >> All right, take care. And thank you for watching everybody. Keep right there. We'll take a short break and be right back. (gentle music)

Published Date : Jun 23 2020

SUMMARY :

to you by Io-Tahoe. And I'm really excited to of a regional I think and they want to move it relates to kind of transitioning And the only way to do But I want to ask you about Covid, and get the data moving And then finally, you got more clarity. and filled out the right amount. and really great support for the region, and being able to have the experience to be better. to be able to get the customers that know how to get it. and it's to optimize the banking process, and analyzing maybe the backlog was and get that data faster and the people involved, And everyone that who is working is the cultural aspects of this the ability to do search within the data. and you have this virtual cycle. and one of the questions And then I want to go back in the S3 to be able to move it Is that the right way to think about it? and to start analyzing the data flows and duplicating the data and using that. More bad news for the That's what we were telling people. and really what you want and to be able to serve And so that follow. And you guys have a great day. And thank you for watching everybody.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Paula D'Amico	PERSON	0.99+
Paula	PERSON	0.99+
Connecticut	LOCATION	0.99+
Westchester	LOCATION	0.99+
Informatica	ORGANIZATION	0.99+
24 hours	QUANTITY	0.99+
one	QUANTITY	0.99+
13	QUANTITY	0.99+
thousands	QUANTITY	0.99+
100 hours	QUANTITY	0.99+
Bob Graham	PERSON	0.99+
iPad	COMMERCIAL_ITEM	0.99+
Webster Bank	ORGANIZATION	0.99+
GE Capital	ORGANIZATION	0.99+
first item	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
two products	QUANTITY	0.99+
seven	QUANTITY	0.99+
New York	LOCATION	0.99+
Boston	LOCATION	0.99+
three tables	QUANTITY	0.99+
Each line	QUANTITY	0.99+
first set	QUANTITY	0.99+
two days	QUANTITY	0.99+
DevOps	TITLE	0.99+
Webster bank	ORGANIZATION	0.99+
14 years	QUANTITY	0.99+
over 15 years	QUANTITY	0.99+
seven cars	QUANTITY	0.98+
each project	QUANTITY	0.98+
Friday night	DATE	0.98+
Enterprise Data Automation	ORGANIZATION	0.98+
New England	LOCATION	0.98+
Io-Tahoe	ORGANIZATION	0.98+
today	DATE	0.98+
Webster's Bank	ORGANIZATION	0.98+
one schema	QUANTITY	0.97+
Fairfield County	LOCATION	0.97+
One	QUANTITY	0.97+
one customer	QUANTITY	0.97+
over seven terabytes	QUANTITY	0.97+
Salesforce	ORGANIZATION	0.96+
both	QUANTITY	0.95+
single source	QUANTITY	0.93+
one thing	QUANTITY	0.93+
US Air Force	ORGANIZATION	0.93+
Webster	ORGANIZATION	0.92+
S3	COMMERCIAL_ITEM	0.92+
Enterprise Data Architecture	ORGANIZATION	0.91+
Io Tahoe	PERSON	0.91+
Oracle	ORGANIZATION	0.9+
15 years old	QUANTITY	0.9+
Io-Tahoe	PERSON	0.89+
12 years old	QUANTITY	0.88+
Tableau	TITLE	0.87+
four weeks	QUANTITY	0.86+
S3 Buckets	COMMERCIAL_ITEM	0.84+
Covid	PERSON	0.81+
Data Architecture	ORGANIZATION	0.79+
JSON	TITLE	0.79+
Senior Vice President	PERSON	0.78+
24 seven development role	QUANTITY	0.77+
last 35 years	DATE	0.77+
both laughs	QUANTITY	0.75+
Io-Tahoe	TITLE	0.73+
each	QUANTITY	0.72+
loans	QUANTITY	0.71+
zero	QUANTITY	0.71+

Paula D'Amico, Webster Bank

>> Narrator: From around the Globe, it's theCube with digital coverage of Enterprise Data Automation, and event series brought to you by Io-Tahoe. >> Everybody, we're back. And this is Dave Vellante, and we're covering the whole notion of Automated Data in the Enterprise. And I'm really excited to have Paula D'Amico here. Senior Vice President of Enterprise Data Architecture at Webster Bank. Paula, good to see you. Thanks for coming on. >> Hi, nice to see you, too. >> Let's start with Webster bank. You guys are kind of a regional I think New York, New England, believe it's headquartered out of Connecticut. But tell us a little bit about the bank. >> Webster bank is regional Boston, Connecticut, and New York. Very focused on in Westchester and Fairfield County. They are a really highly rated regional bank for this area. They hold quite a few awards for the area for being supportive for the community, and are really moving forward technology wise, they really want to be a data driven bank, and they want to move into a more robust group. >> We got a lot to talk about. So data driven is an interesting topic and your role as Data Architecture, is really Senior Vice President Data Architecture. So you got a big responsibility as it relates to kind of transitioning to this digital data driven bank but tell us a little bit about your role in your Organization. >> Currently, today, we have a small group that is just working toward moving into a more futuristic, more data driven data warehousing. That's our first item. And then the other item is to drive new revenue by anticipating what customers do, when they go to the bank or when they log in to their account, to be able to give them the best offer. And the only way to do that is you have timely, accurate, complete data on the customer and what's really a great value on offer something to offer that, or a new product, or to help them continue to grow their savings, or do and grow their investments. >> Okay, and I really want to get into that. But before we do, and I know you're, sort of partway through your journey, you got a lot to do. But I want to ask you about Covid, how you guys handling that? You had the government coming down and small business loans and PPP, and huge volume of business and sort of data was at the heart of that. How did you manage through that? >> We were extremely successful, because we have a big, dedicated team that understands where their data is and was able to switch much faster than a larger bank, to be able to offer the PPP Long's out to our customers within lightning speed. And part of that was is we adapted to Salesforce very for we've had Salesforce in house for over 15 years. Pretty much that was the driving vehicle to get our PPP loans in, and then developing logic quickly, but it was a 24 seven development role and get the data moving on helping our customers fill out the forms. And a lot of that was manual, but it was a large community effort. >> Think about that too. The volume was probably much higher than the volume of loans to small businesses that you're used to granting and then also the initial guidelines were very opaque. You really didn't know what the rules were, but you were expected to enforce them. And then finally, you got more clarity. So you had to essentially code that logic into the system in real time. >> I wasn't directly involved, but part of my data movement team was, and we had to change the logic overnight. So it was on a Friday night it was released, we pushed our first set of loans through, and then the logic changed from coming from the government, it changed and we had to redevelop our data movement pieces again, and we design them and send them back through. So it was definitely kind of scary, but we were completely successful. We hit a very high peak. Again, I don't know the exact number but it was in the thousands of loans, from little loans to very large loans and not one customer who applied did not get what they needed for, that was the right process and filled out the right amount. >> Well, that is an amazing story and really great support for the region, your Connecticut, the Boston area. So that's fantastic. I want to get into the rest of your story now. Let's start with some of the business drivers in banking. I mean, obviously online. A lot of people have sort of joked that many of the older people, who kind of shunned online banking would love to go into the branch and see their friendly teller had no choice, during this pandemic, to go to online. So that's obviously a big trend you mentioned, the data driven data warehouse, I want to understand that, but what at the top level, what are some of the key business drivers that are catalyzing your desire for change? >> The ability to give a customer, what they need at the time when they need it. And what I mean by that is that we have customer interactions in multiple ways. And I want to be able for the customer to walk into a bank or online and see the same format, and being able to have the same feel the same love, and also to be able to offer them the next best offer for them. But they're if they want looking for a new mortgage or looking to refinance, or whatever it is that they have that data, we have the data and that they feel comfortable using it. And that's an untethered banker. Attitude is, whatever my banker is holding and whatever the person is holding in their phone, that is the same and it's comfortable. So they don't feel that they've walked into the bank and they have to do fill out different paperwork compared to filling out paperwork on just doing it on their phone. >> You actually do want the experience to be better. And it is in many cases. Now you weren't able to do this with your existing I guess mainframe based Enterprise Data Warehouses. Is that right? Maybe talk about that a little bit? >> Yeah, we were definitely able to do it with what we have today the technology we're using. But one of the issues is that it's not timely. And you need a timely process to be able to get the customers to understand what's happening. You need a timely process so we can enhance our risk management. We can apply for fraud issues and things like that. >> Yeah, so you're trying to get more real time. The traditional EDW. It's sort of a science project. There's a few experts that know how to get it. You can so line up, the demand is tremendous. And then oftentimes by the time you get the answer, it's outdated. So you're trying to address that problem. So part of it is really the cycle time the end to end cycle time that you're progressing. And then there's, if I understand it residual benefits that are pretty substantial from a revenue opportunity, other offers that you can make to the right customer, that you maybe know, through your data, is that right? >> Exactly. It's drive new customers to new opportunities. It's enhanced the risk, and it's to optimize the banking process, and then obviously, to create new business. And the only way we're going to be able to do that is if we have the ability to look at the data right when the customer walks in the door or right when they open up their app. And by doing creating more to New York times near real time data, or the data warehouse team that's giving the lines of business the ability to work on the next best offer for that customer as well. >> But Paula, we're inundated with data sources these days. Are there other data sources that maybe had access to before, but perhaps the backlog of ingesting and cleaning in cataloging and analyzing maybe the backlog was so great that you couldn't perhaps tap some of those data sources. Do you see the potential to increase the data sources and hence the quality of the data or is that sort of premature? >> Oh, no. Exactly. Right. So right now, we ingest a lot of flat files and from our mainframe type of front end system, that we've had for quite a few years. But now that we're moving to the cloud and off-prem and on-prem, moving off-prem, into like an S3 Bucket, where that data we can process that data and get that data faster by using real time tools to move that data into a place where, like snowflake could utilize that data, or we can give it out to our market. Right now we're about we do work in batch mode still. So we're doing 24 hours. >> Okay. So when I think about the data pipeline, and the people involved, maybe you could talk a little bit about the organization. You've got, I don't know, if you have data scientists or statisticians, I'm sure you do. You got data architects, data engineers, quality engineers, developers, etc. And oftentimes, practitioners like yourself, will stress about, hey, the data is in silos. The data quality is not where we want it to be. We have to manually categorize the data. These are all sort of common data pipeline problems, if you will. Sometimes we use the term data Ops, which is sort of a play on DevOps applied to the data pipeline. Can you just sort of describe your situation in that context? >> Yeah, so we have a very large data ops team. And everyone that who is working on the data part of Webster's Bank, has been there 13 to 14 years. So they get the data, they understand it, they understand the lines of business. So it's right now. We could the we have data quality issues, just like everybody else does. But we have places in them where that gets cleansed. And we're moving toward and there was very much siloed data. The data scientists are out in the lines of business right now, which is great, because I think that's where data science belongs, we should give them and that's what we're working towards now is giving them more self service, giving them the ability to access the data in a more robust way. And it's a single source of truth. So they're not pulling the data down into their own, like Tableau dashboards, and then pushing the data back out. So they're going to more not, I don't want to say, a central repository, but a more of a robust repository, that's controlled across multiple avenues, where multiple lines of business can access that data. Is that help? >> Got it, Yes. And I think that one of the key things that I'm taking away from your last comment, is the cultural aspects of this by having the data scientists in the line of business, the lines of business will feel ownership of that data as opposed to pointing fingers criticizing the data quality. They really own that that problem, as opposed to saying, well, it's Paula's problem. >> Well, I have my problem is I have data engineers, data architects, database administrators, traditional data reporting people. And because some customers that I have that are business customers lines of business, they want to just subscribe to a report, they don't want to go out and do any data science work. And we still have to provide that. So we still want to provide them some kind of regiment that they wake up in the morning, and they open up their email, and there's the report that they subscribe to, which is great, and it works out really well. And one of the things is why we purchased Io-Tahoe was, I would have the ability to give the lines of business, the ability to do search within the data. And we'll read the data flows and data redundancy and things like that, and help me clean up the data. And also, to give it to the data analysts who say, all right, they just asked me they want this certain report. And it used to take okay, four weeks we're going to go and we're going to look at the data and then we'll come back and tell you what we can do. But now with Io-Tahoe, they're able to look at the data, and then in one or two days, they'll be able to go back and say, Yes, we have the data, this is where it is. This is where we found it. This is the data flows that we found also, which is what I call it, is the break of a column. It's where the column was created, and where it went to live as a teenager. (laughs) And then it went to die, where we archive it. And, yeah, it's this cycle of life for a column. And Io-Tahoe helps us do that. And we do data lineage is done all the time. And it's just takes a very long time and that's why we're using something that has AI in it and machine running. It's accurate, it does it the same way over and over again. If an analyst leaves, you're able to utilize something like Io-Tahoe to be able to do that work for you. Is that help? >> Yeah, so got it. So a couple things there, in researching Io-Tahoe, it seems like one of the strengths of their platform is the ability to visualize data, the data structure and actually dig into it, but also see it. And that speeds things up and gives everybody additional confidence. And then the other piece is essentially infusing AI or machine intelligence into the data pipeline, is really how you're attacking automation. And you're saying it repeatable, and then that helps the data quality and you have this virtual cycle. Maybe you could sort of affirm that and add some color, perhaps. >> Exactly. So you're able to let's say that I have seven cars, lines of business that are asking me questions, and one of the questions they'll ask me is, we want to know, if this customer is okay to contact, and there's different avenues so you can go online, do not contact me, you can go to the bank and you can say, I don't want email, but I'll take texts. And I want no phone calls. All that information. So, seven different lines of business asked me that question in different ways. One said, "No okay to contact" the other one says, "Customer 123." All these. In each project before I got there used to be siloed. So one customer would be 100 hours for them to do that analytical work, and then another analyst would do another 100 hours on the other project. Well, now I can do that all at once. And I can do those types of searches and say, Yes, we already have that documentation. Here it is, and this is where you can find where the customer has said, "No, I don't want to get access from you by email or I've subscribed to get emails from you." >> Got it. Okay. Yeah Okay. And then I want to go back to the cloud a little bit. So you mentioned S3 Buckets. So you're moving to the Amazon cloud, at least, I'm sure you're going to get a hybrid situation there. You mentioned snowflake. What was sort of the decision to move to the cloud? Obviously, snowflake is cloud only. There's not an on-prem, version there. So what precipitated that? >> Alright, so from I've been in the data IT information field for the last 35 years. I started in the US Air Force, and have moved on from since then. And my experience with Bob Graham, was with snowflake with working with GE Capital. And that's where I met up with the team from Io-Tahoe as well. And so it's a proven so there's a couple of things one is Informatica, is worldwide known to move data. They have two products, they have the on-prem and the off-prem. I've used the on-prem and off-prem, they're both great. And it's very stable, and I'm comfortable with it. Other people are very comfortable with it. So we picked that as our batch data movement. We're moving toward probably HVR. It's not a total decision yet. But we're moving to HVR for real time data, which is changed capture data, moves it into the cloud. And then, so you're envisioning this right now. In which is you're in the S3, and you have all the data that you could possibly want. And that's JSON, all that everything is sitting in the S3 to be able to move it through into snowflake. And snowflake has proven to have a stability. You only need to learn and train your team with one thing. AWS as is completely stable at this point too. So all these avenues if you think about it, is going through from, this is your data lake, which is I would consider your S3. And even though it's not a traditional data lake like, you can touch it like a Progressive or Hadoop. And then into snowflake and then from snowflake into sandbox and so your lines of business and your data scientists just dive right in. That makes a big win. And then using Io-Tahoe with the data automation, and also their search engine. I have the ability to give the data scientists and data analysts the way of they don't need to talk to IT to get accurate information or completely accurate information from the structure. And we'll be right back. >> Yeah, so talking about snowflake and getting up to speed quickly. I know from talking to customers you can get from zero to snowflake very fast and then it sounds like the Io-Tahoe is sort of the automation cloud for your data pipeline within the cloud. Is that the right way to think about it? >> I think so. Right now I have Io-Tahoe attached to my on-prem. And I want to attach it to my off-prem eventually. So I'm using Io-Tahoe data automation right now, to bring in the data, and to start analyzing the data flows to make sure that I'm not missing anything, and that I'm not bringing over redundant data. The data warehouse that I'm working of, it's an on-prem. It's an Oracle Database, and it's 15 years old. So it has extra data in it. It has things that we don't need anymore, and Io-Tahoe's helping me shake out that extra data that does not need to be moved into my S3. So it's saving me money, when I'm moving from off-prem to on-prem. >> And so that was a challenge prior, because you couldn't get the lines of business to agree what to delete, or what was the issue there? >> Oh, it was more than that. Each line of business had their own structure within the warehouse. And then they were copying data between each other, and duplicating the data and using that. So there could be possibly three tables that have the same data in it, but it's used for different lines of business. We have identified using Io-Tahoe identified over seven terabytes in the last two months on data that has just been repetitive. It's the same exact data just sitting in a different schema. And that's not easy to find, if you only understand one schema, that's reporting for that line of business. >> More bad news for the storage companies out there. (both laughs) So far. >> It's cheap. That's what we were telling people. >> And it's true, but you still would rather not waste it, you'd like to apply it to drive more revenue. And so, I guess, let's close on where you see this thing going. Again, I know you're sort of partway through the journey, maybe you could sort of describe, where you see the phase is going and really what you want to get out of this thing, down the road, mid-term, longer term, what's your vision or your data driven organization. >> I want for the bankers to be able to walk around with an iPad in their hand, and be able to access data for that customer, really fast and be able to give them the best deal that they can get. I want Webster to be right there on top with being able to add new customers, and to be able to serve our existing customers who had bank accounts since they were 12 years old there and now our multi whatever. I want them to be able to have the best experience with our bankers. >> That's awesome. That's really what I want as a banking customer. I want my bank to know who I am, anticipate my needs, and create a great experience for me. And then let me go on with my life. And so that follow. Great story. Love your experience, your background and your knowledge. I can't thank you enough for coming on theCube. >> Now, thank you very much. And you guys have a great day. >> All right, take care. And thank you for watching everybody. Keep right there. We'll take a short break and be right back. (gentle music)

Published Date : Jun 4 2020

SUMMARY :

to you by Io-Tahoe. And I'm really excited to of a regional I think and they want to move it relates to kind of transitioning And the only way to do But I want to ask you about Covid, and get the data moving And then finally, you got more clarity. and filled out the right amount. and really great support for the region, and being able to have the experience to be better. to be able to get the customers that know how to get it. and it's to optimize the banking process, and analyzing maybe the backlog was and get that data faster and the people involved, And everyone that who is working is the cultural aspects of this the ability to do search within the data. and you have this virtual cycle. and one of the questions And then I want to go back in the S3 to be able to move it Is that the right way to think about it? and to start analyzing the data flows and duplicating the data and using that. More bad news for the That's what we were telling people. and really what you want and to be able to serve And so that follow. And you guys have a great day. And thank you for watching everybody.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Paula D'Amico	PERSON	0.99+
Paula	PERSON	0.99+
Connecticut	LOCATION	0.99+
Westchester	LOCATION	0.99+
Informatica	ORGANIZATION	0.99+
24 hours	QUANTITY	0.99+
one	QUANTITY	0.99+
13	QUANTITY	0.99+
thousands	QUANTITY	0.99+
100 hours	QUANTITY	0.99+
Bob Graham	PERSON	0.99+
iPad	COMMERCIAL_ITEM	0.99+
Webster Bank	ORGANIZATION	0.99+
GE Capital	ORGANIZATION	0.99+
first item	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
two products	QUANTITY	0.99+
seven	QUANTITY	0.99+
New York	LOCATION	0.99+
Boston	LOCATION	0.99+
three tables	QUANTITY	0.99+
Each line	QUANTITY	0.99+
first set	QUANTITY	0.99+
two days	QUANTITY	0.99+
DevOps	TITLE	0.99+
Webster bank	ORGANIZATION	0.99+
14 years	QUANTITY	0.99+
over 15 years	QUANTITY	0.99+
seven cars	QUANTITY	0.98+
each project	QUANTITY	0.98+
Friday night	DATE	0.98+
New England	LOCATION	0.98+
Io-Tahoe	ORGANIZATION	0.98+
today	DATE	0.98+
Webster's Bank	ORGANIZATION	0.98+
one schema	QUANTITY	0.97+
Fairfield County	LOCATION	0.97+
One	QUANTITY	0.97+
one customer	QUANTITY	0.97+
over seven terabytes	QUANTITY	0.97+
Salesforce	ORGANIZATION	0.96+
both	QUANTITY	0.95+
single source	QUANTITY	0.93+
one thing	QUANTITY	0.93+
US Air Force	ORGANIZATION	0.93+
Webster	ORGANIZATION	0.92+
S3	COMMERCIAL_ITEM	0.92+
Enterprise Data Architecture	ORGANIZATION	0.91+
Oracle	ORGANIZATION	0.9+
15 years old	QUANTITY	0.9+
Io-Tahoe	PERSON	0.89+
12 years old	QUANTITY	0.88+
Tableau	TITLE	0.87+
four weeks	QUANTITY	0.86+
S3 Buckets	COMMERCIAL_ITEM	0.84+
Covid	PERSON	0.81+
Data Architecture	ORGANIZATION	0.79+
JSON	TITLE	0.79+
Senior Vice President	PERSON	0.78+
24 seven development role	QUANTITY	0.77+
last 35 years	DATE	0.77+
both laughs	QUANTITY	0.75+
Io-Tahoe	TITLE	0.73+
each	QUANTITY	0.72+
loans	QUANTITY	0.71+
zero	QUANTITY	0.71+
Amazon cloud	ORGANIZATION	0.65+
last two months	DATE	0.65+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Data Architecture: