Image Title

Search Results for Lake Formacion:

Rahul Pathak, AWS | AWS re:Invent 2020


 

>>from around the globe. It's the Cube with digital coverage of AWS reinvent 2020 sponsored by Intel and AWS. Yeah, welcome back to the cubes. Ongoing coverage of AWS reinvent virtual Cuba's Gone Virtual along with most events these days are all events and continues to bring our digital coverage of reinvent With me is Rahul Pathak, who is the vice president of analytics at AWS A Ro. It's great to see you again. Welcome. And thanks for joining the program. >>They have Great co two and always a pleasure. Thanks for having me on. >>You're very welcome. Before we get into your leadership discussion, I want to talk about some of the things that AWS has announced. Uh, in the early parts of reinvent, I want to start with a glue elastic views. Very notable announcement allowing people to, you know, essentially share data across different data stores. Maybe tell us a little bit more about glue. Elastic view is kind of where the name came from and what the implication is, >>Uh, sure. So, yeah, we're really excited about blue elastic views and, you know, as you mentioned, the idea is to make it easy for customers to combine and use data from a variety of different sources and pull them together into one or many targets. And the reason for it is that you know we're really seeing customers adopt what we're calling a lake house architectural, which is, uh, at its core Data Lake for making sense of data and integrating it across different silos, uh, typically integrated with the data warehouse, and not just that, but also a range of other purpose. Both stores like Aurora, Relation of Workloads or dynamodb for non relational ones. And while customers typically get a lot of benefit from using purpose built stores because you get the best possible functionality, performance and scale forgiven use case, you often want to combine data across them to get a holistic view of what's happening in your business or with your customers. And before glue elastic views, customers would have to either use E. T. L or data integration software, or they have to write custom code that could be complex to manage, and I could be are prone and tough to change. And so, with elastic views, you can now use sequel to define a view across multiple data sources pick one or many targets. And then the system will actually monitor the sources for changes and propagate them into the targets in near real time. And it manages the anti pipeline and can notify operators if if anything, changes. And so the you know the components of the name are pretty straightforward. Blues are survivalists E T Elling data integration service on blue elastic views about our about data integration their views because you could define these virtual tables using sequel and then elastic because it's several lists and will scale up and down to deal with the propagation of changes. So we're really excited about it, and customers are as well. >>Okay, great. So my understanding is I'm gonna be able to take what's called what the parlance of materialized views, which in my laypersons terms assumes I'm gonna run a query on the database and take that subset. And then I'm gonna be ableto thio. Copy that and move it to another data store. And then you're gonna automatically keep track of the changes and keep everything up to date. Is that right? >>Yes. That's exactly right. So you can imagine. So you had a product catalog for example, that's being updated in dynamodb, and you can create a view that will move that to Amazon Elasticsearch service. You could search through a current version of your catalog, and we will monitor your dynamodb tables for any changes and make sure those air all propagated in the real time. And all of that is is taken care of for our customers as soon as they defined the view on. But they don't be just kept in sync a za long as the views in effect. >>Let's see, this is being really valuable for a person who's building Looks like I like to think in terms of data services or data products that are gonna help me, you know, monetize my business. Maybe, you know, maybe it's a simple as a dashboard, but maybe it's actually a product. You know, it might be some content that I want to develop, and I've got transaction systems. I've got unstructured data, may be in a no sequel database, and I wanna actually combine those build new products, and I want to do that quickly. So So take me through what I would have to do. You you sort of alluded to it with, you know, a lot of e t l and but take me through in a little bit more detail how I would do that, you know, before this innovation. And maybe you could give us a sense as to what the possibilities are with glue. Elastic views? >>Sure. So, you know, before we announced elastic views, a customer would typically have toe think about using a T l software, so they'd have to write a neat L pipeline that would extract data periodically from a range of sources. They then have to write transformation code that would do things like matchup types. Make sure you didn't have any invalid values, and then you would combine it on periodically, Write that into a target. And so once you've got that pipeline set up, you've got to monitor it. If you see an unusual spike in data volume, you might have to add more. Resource is to the pipeline to make a complete on time. And then, if anything changed in either the source of the destination that prevented that data from flowing in the way you would expect it, you'd have toe manually, figure that out and have data, quality checks and all of that in place to make sure everything kept working but with elastic views just gets much simpler. So instead of having to write custom transformation code, you right view using sequel and um, sequel is, uh, you know, widely popular with data analysts and folks that work with data, as you well know. And so you can define that view and sequel. The view will look across multiple sources, and then you pick your destination and then glue. Elastic views essentially monitors both the source for changes as well as the source and the destination for any any issues like, for example, did the schema changed. The shape of the data change is something briefly unavailable, and it can monitor. All of that can handle any errors, but it can recover from automatically. Or if it can't say someone dropped an important table in the source. That was part of your view. You can actually get alerted and notified to take some action to prevent bad data from getting through your system or to prevent your pipeline from breaking without your knowledge and then the final pieces, the elasticity of it. It will automatically deal with adding more resource is if, for example, say you had a spiky day, Um, in the markets, maybe you're building a financial services application and you needed to add more resource is to process those changes into your targets more quickly. The system would handle that for you. And then, if you're monetizing data services on the back end, you've got a range of options for folks subscribing to those targets. So we've got capabilities like our, uh, Amazon data exchange, where people can exchange and monetize data set. So it allows this and to end flow in a much more straightforward way. It was possible before >>awesome. So a lot of automation, especially if something goes wrong. So something goes wrong. You can automatically recover. And if for whatever reason, you can't what happens? You quite ask the system and and let the operator No. Hey, there's an issue. You gotta go fix it. How does that work? >>Yes, exactly. Right. So if we can recover, say, for example, you can you know that for a short period of time, you can't read the target database. The system will keep trying until it can get through. But say someone dropped a column from your source. That was a key part of your ultimate view and destination. You just can't proceed at that point. So the pipeline stops and then we notify using a PS or an SMS alert eso that programmatic action can be taken. So this effectively provides a really great way to enforce the integrity of data that's going between the sources and the targets. >>All right, make it kindergarten proof of it. So let's talk about another innovation. You guys announced quicksight que, uh, kind of speaking to the machine in my natural language, but but give us some more detail there. What is quicksight Q and and how doe I interact with it. What What kind of questions can I ask it >>so quick? Like you is essentially a deep, learning based semantic model of your data that allows you to ask natural language questions in your dashboard so you'll get a search bar in your quick side dashboard and quick site is our service B I service. That makes it really easy to provide rich dashboards. Whoever needs them in the organization on what Q does is it's automatically developing relationships between the entities in your data, and it's able to actually reason about the questions you ask. So unlike earlier natural language systems, where you have to pre define your models, you have to pre define all the calculations that you might ask the system to do on your behalf. Q can actually figure it out. So you can say Show me the top five categories for sales in California and it'll look in your data and figure out what that is and will prevent. It will present you with how it parse that question, and there will, in line in seconds, pop up a dashboard of what you asked and actually automatically try and take a chart or visualization for that data. That makes sense, and you could then start to refine it further and say, How does this compare to what happened in New York? And we'll be able to figure out that you're tryingto overlay those two data sets and it'll add them. And unlike other systems, it doesn't need to have all of those things pre defined. It's able to reason about it because it's building a model of what your data means on the flight and we pre trained it across a variety of different domains So you can ask a question about sales or HR or any of that on another great part accused that when it presents to you what it's parsed, you're actually able toe correct it if it needs it and provide feedback to the system. So, for example, if it got something slightly off you could actually select from a drop down and then it will remember your selection for the next time on it will get better as you use it. >>I saw a demo on in Swamis Keynote on December 8. That was basically you were able to ask Quick psych you the same question, but in different ways, you know, like compare California in New York or and then the data comes up or give me the top, you know, five. And then the California, New York, the same exact data. So so is that how I kind of can can check and see if the answer that I'm getting back is correct is ask different questions. I don't have to know. The schema is what you're saying. I have to have knowledge of that is the user I can. I can triangulate from different angles and then look and see if that's correct. Is that is that how you verify or there are other ways? >>Eso That's one way to verify. You could definitely ask the same question a couple of different ways and ensure you're seeing the same results. I think the third option would be toe, uh, you know, potentially click and drill and filter down into that data through the dash one on, then the you know, the other step would be at data ingestion Time. Typically, data pipelines will have some quality controls, but when you're interacting with Q, I think the ability to ask the question multiple ways and make sure that you're getting the same result is a perfectly reasonable way to validate. >>You know what I like about that answer that you just gave, and I wonder if I could get your opinion on this because you're you've been in this business for a while? You work with a lot of customers is if you think about our operational systems, you know things like sales or E r. P systems. We've contextualized them. In other words, the business lines have inject context into the system. I mean, they kind of own it, if you will. They own the data when I put in quotes, but they do. They feel like they're responsible for it. There's not this constant argument because it's their data. It seems to me that if you look back in the last 10 years, ah, lot of the the data architecture has been sort of generis ized. In other words, the experts. Whether it's the data engineer, the quality engineer, they don't really have the business context. But the example that you just gave it the drill down to verify that the answer is correct. It seems to me, just in listening again to Swamis Keynote the other day is that you're really trying to put data in the hands of business users who have the context on the domain knowledge. And that seems to me to be a change in mindset that we're gonna see evolve over the next decade. I wonder if you could give me your thoughts on that change in the data architecture data mindset. >>David, I think you're absolutely right. I mean, we see this across all the customers that we speak with there's there's an increasing desire to get data broadly distributed into the hands of the organization in a well governed and controlled way. But customers want to give data to the folks that know what it means and know how they can take action on it to do something for the business, whether that's finding a new opportunity or looking for efficiencies. And I think, you know, we're seeing that increasingly, especially given the unpredictability that we've all gone through in 2020 customers are realizing that they need to get a lot more agile, and they need to get a lot more data about their business, their customers, because you've got to find ways to adapt quickly. And you know, that's not gonna change anytime in the future. >>And I've said many times in the The Cube, you know, there are industry. The technology industry used to be all about the products, and in the last decade it was really platforms, whether it's SAS platforms or AWS cloud platforms, and it seems like innovation in the coming years, in many respects is coming is gonna come from the ecosystem and the ability toe share data we've We've had some examples today and then But you hit on. You know, one of the key challenges, of course, is security and governance. And can you automate that if you will and protect? You know the users from doing things that you know, whether it's data access of corporate edicts for governance and compliance. How are you handling that challenge? >>That's a great question, and it's something that really emphasized in my leadership session. But the you know, the notion of what customers are doing and what we're seeing is that there's, uh, the Lake House architectural concept. So you've got a day late. Purpose build stores and customers are looking for easy data movement across those. And so we have things like blue elastic views or some of the other blue features we announced. But they're also looking for unified governance, and that's why we built it ws late formation. And the idea here is that it can quickly discover and catalog customer data assets and then allows customers to define granular access policies centrally around that data. And once you have defined that, it then sets customers free to give broader access to the data because they put the guardrails in place. They put the protections in place. So you know you can tag columns as being private so nobody can see them on gun were announced. We announced a couple of new capabilities where you can provide row based control. So only a certain set of users can see certain rose in the data, whereas a different set of users might only be able to see, you know, a different step. And so, by creating this fine grained but unified governance model, this actually sets customers free to give broader access to the data because they know that they're policies and compliance requirements are being met on it gets them out of the way of the analyst. For someone who can actually use the data to drive some value for the business, >>right? They could really focus on driving value. And I always talk about monetization. However monetization could be, you know, a generic term, for it could be saving lives, admission of the business or the or the organization I meant to ask you about acute customers in bed. Uh, looks like you into their own APs. >>Yes, absolutely so one of quick sites key strengths is its embed ability. And on then it's also serverless, so you could embed it at a really massive scale. And so we see customers, for example, like blackboard that's embedding quick side dashboards into information. It's providing the thousands of educators to provide data on the effectiveness of online learning. For example, on you could embed Q into that capability. So it's a really cool way to give a broad set of people the ability to ask questions of data without requiring them to be fluent in things like Sequel. >>If I ask you a question, we've talked a little bit about data movement. I think last year reinvent you guys announced our A three. I think it made general availability this year. And remember Andy speaking about it, talking about you know, the importance of having big enough pipes when you're moving, you know, data around. Of course you do. Doing tearing. You also announced Aqua Advanced Query accelerator, which kind of reduces bringing the computer. The data, I guess, is how I would think about that reducing that movement. But then we're talking about, you know, glue, elastic views you're copying and moving data. How are you ensuring you know, maintaining that that maximum performance for your customers. I mean, I know it's an architectural question, but as an analytics professional, you have toe be comfortable that that infrastructure is there. So how does what's A. W s general philosophy in that regard? >>So there's a few ways that we think about this, and you're absolutely right. I think there's data volumes were going up, and we're seeing customers going from terabytes, two petabytes and even people heading into the exabyte range. Uh, there's really a need to deliver performance at scale. And you know, the reality of customer architectures is that customers will use purpose built systems for different best in class use cases. And, you know, if you're trying to do a one size fits all thing, you're inevitably going to end up compromising somewhere. And so the reality is, is that customers will have more data. We're gonna want to get it to more people on. They're gonna want their analytics to be fast and cost effective. And so we look at strategies to enable all of this. So, for example, glue elastic views. It's about moving data, but it's about moving data efficiently. So What we do is we allow customers to define a view that represents the subset of their data they care about, and then we only look to move changes as efficiently as possible. So you're reducing the amount of data that needs to get moved and making sure it's focused on the essential. Similarly, with Aqua, what we've done, as you mentioned, is we've taken the compute down to the storage layer, and we're using our nitro chips to help with things like compression and encryption. And then we have F. P. J s in line to allow filtering an aggregation operation. So again, you're tryingto quickly and effectively get through as much data as you can so that you're only sending back what's relevant to the query that's being processed. And that again leads to more performance. If you can avoid reading a bite, you're going to speed up your queries. And that Awkward is trying to do. It's trying to push those operations down so that you're really reducing data as close to its origin as possible on focusing on what's essential. And that's what we're applying across our analytics portfolio. I would say one other piece we're focused on with performance is really about innovating across the stack. So you mentioned network performance. You know, we've got 100 gigabits per second throughout now, with the next 10 instances and then with things like Grab it on to your able to drive better price performance for customers, for general purpose workloads. So it's really innovating at all layers. >>It's amazing to watch it. I mean, you guys, it's a It's an incredible engineering challenge as you built this hyper distributed system. That's now, of course, going to the edge. I wanna come back to something you mentioned on do wanna hit on your leadership session as well. But you mentioned the one size fits all, uh, system. And I've asked Andy Jassy about this. I've had a discussion with many folks that because you're full and and of course, you mentioned the challenges you're gonna have to make tradeoffs if it's one size fits all. The flip side of that is okay. It's simple is you know, 11 of the Swiss Army knife of database, for example. But your philosophy is Amazon is you wanna have fine grained access and to the primitives in case the market changes you, you wanna be able to move quickly. So that puts more pressure on you to then simplify. You're not gonna build this big hairball abstraction layer. That's not what he gonna dio. Uh, you know, I think about, you know, layers and layers of paint. I live in a very old house. Eso your That's not your approach. So it puts greater pressure on on you to constantly listen to your customers, and and they're always saying, Hey, I want to simplify, simplify, simplify. We certainly again heard that in swamis presentation the other day, all about, you know, minimizing complexity. So that really is your trade office. It puts pressure on Amazon Engineering to continue to raise the bar on simplification. Isn't Is that a fair statement? >>Yeah, I think so. I mean, you know, I think any time we can do work, so our customers don't have to. I think that's a win for both of us. Um, you know, because I think we're delivering more value, and it makes it easier for our customers to get value from their data way. Absolutely believe in using the right tool for the right job. And you know you talked about an old house. You're not gonna build or renovate a house of the Swiss Army knife. It's just the wrong tool. It might work for small projects, but you're going to need something more specialized. The handle things that matter. It's and that is, uh, that's really what we see with that, you know, with that set of capabilities. So we want to provide customers with the best of both worlds. We want to give them purpose built tools so they don't have to compromise on performance or scale of functionality. And then we want to make it easy to use these together. Whether it's about data movement or things like Federated Queries, you can reach into each of them and through a single query and through a unified governance model. So it's all about stitching those together. >>Yeah, so far you've been on the right side of history. I think it serves you well on your customers. Well, I wanna come back to your leadership discussion, your your leadership session. What else could you tell us about? You know, what you covered there? >>So we we've actually had a bunch of innovations on the analytics tax. So some of the highlights are in m r, which is our managed spark. And to do service, we've been able to achieve 1.7 x better performance and open source with our spark runtime. So we've invested heavily in performance on now. EMR is also available for customers who are running and containerized environment. So we announced you Marnie chaos on then eh an integrated development environment and studio for you Marco D M R studio. So making it easier both for people at the infrastructure layer to run em are on their eks environments and make it available within their organizations but also simplifying life for data analysts and folks working with data so they can operate in that studio and not have toe mess with the details of the clusters underneath and then a bunch of innovation in red shift. We talked about Aqua already, but then we also announced data sharing for red Shift. So this makes it easy for red shift clusters to share data with other clusters without putting any load on the central producer cluster. And this also speaks to the theme of simplifying getting data from point A to point B so you could have central producer environments publishing data, which represents the source of truth, say into other departments within the organization or departments. And they can query the data, use it. It's always up to date, but it doesn't put any load on the producers that enables these really powerful data sharing on downstream data monetization capabilities like you've mentioned. In addition, like Swami mentioned in his keynote Red Shift ML, so you can now essentially train and run models that were built in sage maker and optimized from within your red shift clusters. And then we've also automated all of the performance tuning that's possible in red ships. So we really invested heavily in price performance, and now we've automated all of the things that make Red Shift the best in class data warehouse service from a price performance perspective up to three X better than others. But customers can just set red shift auto, and it'll handle workload management, data compression and data distribution. Eso making it easier to access all about performance and then the other big one was in Lake Formacion. We announced three new capabilities. One is transactions, so enabling consistent acid transactions on data lakes so you can do things like inserts and updates and deletes. We announced row based filtering for fine grained access control and that unified governance model and then automated storage optimization for Data Lake. So customers are dealing with an optimized small files that air coming off streaming systems, for example, like Formacion can auto compact those under the covers, and you can get a 78 x performance boost. It's been a busy year for prime lyrics. >>I'll say that, z that it no great great job, bro. Thanks so much for coming back in the Cube and, you know, sharing the innovations and, uh, great to see you again. And good luck in the coming here. Well, >>thank you very much. Great to be here. Great to see you. And hope we get Thio see each other in person against >>I hope so. All right. And thank you for watching everybody says Dave Volonte for the Cube will be right back right after this short break

Published Date : Dec 10 2020

SUMMARY :

It's great to see you again. They have Great co two and always a pleasure. to, you know, essentially share data across different And so the you know the components of the name are pretty straightforward. And then you're gonna automatically keep track of the changes and keep everything up to date. So you can imagine. services or data products that are gonna help me, you know, monetize my business. that prevented that data from flowing in the way you would expect it, you'd have toe manually, And if for whatever reason, you can't what happens? So if we can recover, say, for example, you can you know that for a So let's talk about another innovation. that you might ask the system to do on your behalf. but in different ways, you know, like compare California in New York or and then the data comes then the you know, the other step would be at data ingestion Time. But the example that you just gave it the drill down to verify that the answer is correct. And I think, you know, we're seeing that increasingly, You know the users from doing things that you know, whether it's data access But the you know, the notion of what customers are doing and what we're seeing is that admission of the business or the or the organization I meant to ask you about acute customers And on then it's also serverless, so you could embed it at a really massive But then we're talking about, you know, glue, elastic views you're copying and moving And you know, the reality of customer architectures is that customers will use purpose built So that puts more pressure on you to then really what we see with that, you know, with that set of capabilities. I think it serves you well on your customers. speaks to the theme of simplifying getting data from point A to point B so you could have central in the Cube and, you know, sharing the innovations and, uh, great to see you again. thank you very much. And thank you for watching everybody says Dave Volonte for the Cube will be right back right after

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Rahul PathakPERSON

0.99+

Andy JassyPERSON

0.99+

AWSORGANIZATION

0.99+

DavidPERSON

0.99+

CaliforniaLOCATION

0.99+

New YorkLOCATION

0.99+

AndyPERSON

0.99+

Swiss ArmyORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

December 8DATE

0.99+

Dave VolontePERSON

0.99+

last yearDATE

0.99+

2020DATE

0.99+

third optionQUANTITY

0.99+

SwamiPERSON

0.99+

eachQUANTITY

0.99+

bothQUANTITY

0.99+

A. WPERSON

0.99+

this yearDATE

0.99+

10 instancesQUANTITY

0.98+

A threeCOMMERCIAL_ITEM

0.98+

78 xQUANTITY

0.98+

two petabytesQUANTITY

0.98+

fiveQUANTITY

0.97+

Amazon EngineeringORGANIZATION

0.97+

Red Shift MLTITLE

0.97+

FormacionORGANIZATION

0.97+

11QUANTITY

0.96+

oneQUANTITY

0.96+

one wayQUANTITY

0.96+

IntelORGANIZATION

0.96+

OneQUANTITY

0.96+

five categoriesQUANTITY

0.94+

AquaORGANIZATION

0.93+

ElasticsearchTITLE

0.93+

terabytesQUANTITY

0.93+

both worldsQUANTITY

0.93+

next decadeDATE

0.92+

two data setsQUANTITY

0.91+

Lake FormacionORGANIZATION

0.9+

single queryQUANTITY

0.9+

Data LakeORGANIZATION

0.89+

thousands of educatorsQUANTITY

0.89+

Both storesQUANTITY

0.88+

ThioPERSON

0.88+

agileTITLE

0.88+

CubaLOCATION

0.87+

dynamodbORGANIZATION

0.86+

1.7 xQUANTITY

0.86+

SwamisPERSON

0.84+

EMRTITLE

0.82+

one sizeQUANTITY

0.82+

Red ShiftTITLE

0.82+

up to three XQUANTITY

0.82+

100 gigabits per secondQUANTITY

0.82+

MarniePERSON

0.79+

last decadeDATE

0.79+

reinvent 2020EVENT

0.74+

InventEVENT

0.74+

last 10 yearsDATE

0.74+

CubeCOMMERCIAL_ITEM

0.74+

todayDATE

0.74+

A RoEVENT

0.71+

three new capabilitiesQUANTITY

0.71+

twoQUANTITY

0.7+

E T EllingPERSON

0.69+

EsoORGANIZATION

0.66+

AquaTITLE

0.64+

CubeORGANIZATION

0.63+

QueryCOMMERCIAL_ITEM

0.63+

SASORGANIZATION

0.62+

AuroraORGANIZATION

0.61+

Lake HouseORGANIZATION

0.6+

SequelTITLE

0.58+

P.PERSON

0.56+