Wen Phan, Ahana & Satyam Krishna, Blinkit & Akshay Agarwal, Blinkit | AWS Startup Showcase S2 E2
(gentle music) >> Welcome everyone to theCUBE's presentation of the AWS Startup Showcase. The theme is Data as Code; The Future of Enterprise Data and Analytics. This is the season two, episode two of the ongoing series of covering the exciting startups in the AWS ecosystem around data analytics and cloud computing. I'm your host, John Furrier. Today we're joined by great guests here. Three guests. Wen Phan, who's a Director of Product Management at Ahana, Satyam Krishna, Engineering Manager at Blinkit, and we have Akshay Agarwal, Senior Engineer at Blinkit as well. We're going to get into the relationship there. Let's get into. We're going to talk about how Blinkit's using open data lake, data house with Presto on AWS. Gentlemen, thanks for joining us. >> Thanks for having us. >> So we're going to get into the deep dive on the open data lake, but I want to just quickly get your thoughts on what it is for the folks out there. Set the table. What is the open data lakehouse? Why it is important? What's in it for the customers? Why are we seeing adoption around this because this is a big story. >> Sure. Yeah, the open data lakehouse is really being able to run a gamut of analytics, whether it be BI, SQL, machine learning, data science, on top of the data lake, which is based on inexpensive, low cost, scalable storage. And more importantly, it's also on top of open formats. And this to the end customer really offers a tremendous range of flexibility. They can run a bunch of use cases on the same storage and great price performance. >> You guys have any other thoughts on what's your reaction to the lakehouse? What is your experience with it? What's going on with Blinkit? >> No, I think for us also, it has been the primary driver of how as a company we have shifted our completely delivery model from us delivering in one day to someone who is delivering in 10 minutes, right? And a lot of this was made possible by having this kind of architecture in place, which helps us to be more open-source, more... where the tools are open-source, we have an open table format which helps us be very modular in nature, meaning we can pick solutions which works best for us, right? And that is the kind of architecture that we want to be in. >> Awesome. Wen, you know last time we chat with Ahana, we had a great conversation around Presto, data. The theme of this episode is Data as Code, which is interesting because in all the conversations in these episodes all around developers, which administrators are turning into developers, there's a developer vibe with data. And with opensource, it's software. Now you've got data taking a similar trajectory as how software development was with code, but the people running data they're not developers, they're administrators, they're operators. Now they're turning into DataOps. So it's kind of a similar vibe going on with branches and taking stuff out of and putting it back in, and testing it. Datasets becoming much more stable, iterating on machine learning algorithm. This is a movement. What's your guys reaction before we get into the relationships here with you guys. But, what's your reaction to this Data as Code movement? >> Yeah, so I think the folks at Blinkit are doing a great job there. I mean, they have a pretty compact data engineering team and they have some pretty stringent SLAs, as well as in terms of time to value and reliability. And what that ultimately translates for them is not only flexibility but reliability. So they've done some very fantastic work on a lot of automation, a lot of integration with code, and their data pipelines. And I'm sure they can give the details on that. >> Yes. Satyam and Akshay, you guys are engineers' software, but this is becoming a whole another paradigm where the frontline coding and or work or engineer data engineering is implementing the operations as well. It's kind of like DevOps for data. >> For sure. Right. And I think whenever you're working, even as a software engineer, the understanding of business is equally important. You cannot be working on something and be away from business, right? And that's where, like I mentioned earlier, when we realized that we have to completely move our stack and start giving analytics at 10 minutes, right. Because when you're delivering in 10 minutes, your leaders want to take decisions in your real-time. That means you need to move with them. You need to move with business. And when you do that, the kind of flexibility these softwares give is what enables the businesses at the end of the day. >> Awesome. This is the really kind of like, is there going to be a book called agile data warehouses? I don't think so. >> I think so. (laughing) >> The agile cloud data. This is cool. So let's get into what you guys do. What is Blinkit up to? What do you guys do? Can you take a minute to explain the company and your product? >> Sure. I'll take that. So Blinkit is India's biggest 10 minute delivery platform. It pioneered the delivery model in the country with over 10 million Indian shopping on our platform, ranging from everything: grocery staples, vegetables, emergency services, electronics, and much more, right. It currently delivers over 200,000 orders every day, and is in a hurry to bring the future of farmers to everyone in India. >> What's the relationship with Ahana and Blinkit? Wen, what's the tie in? >> Yeah, so Blinkit had a pretty well formed stack. They needed a little bit more flexibility and control. They thought a managed service was the way to go. And here at Ahana, we provide a SaaS managed service for Presto. So they engaged us and they evaluated our offering. And more importantly, we're able to partner. As a early stage startup, we really rely on very strong partners with great use cases that are willing to collaborate. And the folks at Blinkit have been really great in helping us push our product, develop our product. And we've been very happy about the value that we've been able to deliver to them as well. >> Okay. So let's unpack the open data lakehouse. What is it? What's under the covers? Let's get into it. >> Sure. So if bring up a slide. Like I said before, it's really a paradigm on being able to run a gamut of analytics on top of the open data lake. So what does that mean? How did it come about? So on the left hand side of the slide, we are coming out of this world where for the last several decades, the primary workhorse for SQL based processing and reporting and dashboarding use cases was really the data warehouse. And what we're seeing is a shift due to the trends in inexpensive scalable storage, cloud storage. The proliferation of open formats to facilitate using this storage to get certain amounts of reliability and performance, and the adoption of frameworks that can operate on top of this cloud data lake. So while here at Ahana, we're primarily focused on SQL workloads and Presto, this architecture really allows for other types of frameworks. And you see the ML and AI side. And like to Satyam's point earlier, offers a great amount of flexibility modularity for many use cases in the cloud. So really, that's really the lakehouse, and people like it for the performance, the openness, and the price performance. >> How's the open-source open side of it playing in the open-source? It's kind of open formats. What is the open-source angle on this because there's a lot of different approaches. I'm hearing open formats. You know, you have data stores which are a big part of seeing that. You got SQL, you mentioned SQL. There's got a mishmash of opportunities. Is it all coexisting? Is it one tool to rule the world or is it interchangeable? What's the open-source angle? >> There's multiple angles and I'll let definitely Satyam add to what I'm saying. This was definitely a big piece for Blinkit. So on one hand, you have the open formats. And what really the open formats enable is multiple compute engines to work on that data. And that's very huge. 'Cause it's open, you're not locked in. I think the other part of open that is important and I think it was important to Blinkit was the governance around that. So in particular Presto is governed by the Linux Foundation. And so, as a customer of open-source technology, they want some assurances for things like how's it governed? Is the license going to change? So there's that aspect of openness that I think is very important. >> Yeah. Blinkit, what's the data strategy here with lakehouse and you guys? Why are you adopting this type of architecture? >> So adding to what... Yeah, I think adding to Wen said, right. When we are thinking in terms of all these OpenStacks, you have got these open table formats, everything which is deployed over cloud, the primary reason there is modularity. It's as simple as that, right. You can plug and play so many different table formats from one thing to another based on the use case that you're trying to serve, so that you get the most value out of data. Right? I'll give you a very simple example. So for us we use... not even use one single table format. It's not that one thing solves for everything, right? We use both Hudi and Iceberg to solve for different use cases. One is good for when you're working for a certain data site. Icebergs works well when you're in the SQL kind of interface, right. Hudi's still trying to reach there. It's going to go there very soon. So having the ability to plug and play different formats based on the use case helps you to grow faster, helps you to take decisions faster because you now you're not stuck on one thing. They will have to implement it. Right. So I think that's what it is great about this data lake strategy. Keeping yourself cost effective. Yeah, please. >> So the enablement is basically use case driven. You don't have to be rearchitecturing for use cases. You can simply plug can play based on what you need for the use case. >> Yeah. You can... and again, you can focus on your business use case. You can figure out what your business users need and not worry about these things because that's where Presto comes in, helps you stitch that data together with multiple data formats, give you the performance that you need and it works out the best there. And that's something that you don't get to with traditional warehouse these days. Right? The kind of thing that we need, you don't get that. >> I do want to add. This is just to riff on what Satyam said. I think it's pretty interesting. So, it really allowed him to take the best-of-breed of what he was seeing in the community, right? So in the case of table formats, you've got Delta, you've got Hudi, you've got Iceberg, and they all have got their own roadmap and it's kind of organic of how these different communities want to evolve, and I think that's great, but you have these end consumers like Blinkit who have different maybe use cases overlapping, and they're not forced to pick one. When you have an open architecture, they can really put together best-of-breed. And as these projects evolve, they can continue to monitor it and then make decisions and continue to remain agile based on the landscape and how it's evolving. >> So the agility is a key point. Flexibility and agility, and time to valuing with your data. >> Yeah. >> All right. Wen, I got to get in to why the Presto is important here. Where does that fit in? Why is Presto important? >> Yeah. For me, it all comes down to the use cases and the needs. And reporting and dashboarding is not going to go away anytime soon. It's a very common use case. Many of our customers like Blinkit come to us for that use case. The difference now is today, people want to do that particular use case on top of the modern data lake, on top of scalable, inexpensive, low cost storage. Right? In addition to that, there's a need for this low latency interactive ability to engage with the data. This is often arises when you need to do things in a ad hoc basis or you're in the developmental phase of building things up. So if that's what your need is. And latency's important and getting your arms around the problems, very important. You have a certain SLA, I need to deliver something. That puts some requirements in the technology. And Presto is a perfect for that ideal use case. It's ideal for that use case. It's distributed, it's scalable, it's in memory. And so it's able to really provide that. I think the other benefit for Presto and why we're bidding on Presto is it works well on the data lakes, but you have to think about how are these organizations maturing with this technology. So it's not necessarily an all or nothing. You have organizations that have maybe the data lake and it's augmented with other analytical data stores like Snowflake or Redshift. So Presto also... a core aspect is its ability to federate or connect and query across different data sources. So this can be a permanent thing. This could also be a transitionary thing. We have some customers that are moving and slowly shifting their data portfolio from maybe all data warehouse into 80% data lake. But it gives that optionality, it gives that ability to transition over a timeframe. But for all those reasons, the latency, the scalability, the federation, is why Presto for this particular use case. >> And you can connect with other databases. It can be purpose built database, could be whatever. Right? >> Sure. Yes, yes. Presto has a very pluggable architecture. >> Okay. Here's the question for the Blinkit team? Why did you choose Presto and what led you to Ahana? >> So I'll take this better, over this what Presto sits well in that reach is, is how it is designed. Like basically, Presto decouples your storage with the compute. Basically like, people can use any storage and Presto just works as a query engine for them. So basically, it has a constant connectors where you can connect with a real-time databases like Pinot or a Druid, along with your warehouses like Redshift, along with your data lake that's like based on Hudi or Iceberg. So it's like a very landscape that you can use with the Presto. And consumers like the analytics doesn't need to learn the SQL or different paradigms of the querying for different sources. They just need to learn a single source. And, they get a single place to consume from. They get a single consumer on their single destination to write on also. So, it's a homologous architecture, which allows you to put a central security like which Presto integrates. So it's also based on open architecture, that's Apache engine. And it has also certain innovative features that you can see based on caching, which reduces a lot of the cost. And since you have further decoupled your storage with the compute, you can further reduce your cost, because now the biggest part of our tradition warehouse is a storage. And the cost goes massively upwards with the amount of data that you've added. Like basically, each time that you add more data, you require more storage, and warehouses ask you to write the data in their own format. Over here since we have decoupled that, the storage cost have gone down. It's literally that your cost that you are writing, and you just pay for the compute, and you can scale in scale out based on the requirements. If you have high traffic, you scale out. If you have low traffic, you scale in. So all those. >> So huge cost savings. >> Yeah. >> Yeah. Cost effectiveness, for sure. >> Cost effectiveness and you get a very good price value out of it. Like for each query, you can estimate what's the cost for you based on that tracking and all those things. >> I mean, if you think about the other classic Iceberg and what's under the water you don't know, it's the hidden cost. You think about the tooling, right, and also, time it takes to do stuff. So if you have flexibility on choice, when we were riffing on this last time we chatted with you guys and you brought it up earlier around, you can have the open formats to have different use cases in different tools or different platforms to work on it. Redshift, you can use Redshift here, or use something over there. You don't have to get locking >> Absolutely. >> Satyam & Akshay: Yeah. >> Locking is a huge problem. How do you guys see that 'cause sounds like here there's not a lot of locking. You got the open formats, and you got choice. >> Yeah. So you get best of the both worlds. Like you get with Ahana or with the Presto, you can get the best of the both worlds. Since it's cloud native, you can easily deploy your clusters very easily within like five minutes. Your cluster is up, you can start working on it. You can deploy multiple clusters for multiple teams. You get also flexibility of adding new connectors since it's open and further it's also much more secure since it's based on cloud native. So basically, you can control your security endpoints very well. So all those things comes in together with this architecture. So you can definitely go more on the lakehouse architecture than warehousing when you want to deliver data value faster. And basically, you get the much more high value out of your data in a sorted template. >> So Satyam, it sounds like the old warehousing was like the application person, not a lot of usage, old, a lot of latency. Okay. Here and there. But now you got more speed to deploy clusters, scale up scale down. Application developers are as everyone. It's not one person. It's not one group. It's whenever you want. So, you got speed. You got more diversity in the data opportunities, and your coding. >> Yeah. I think data warehouses are a way to start for every organization who is getting into data. I don't think data warehousing is still a solution and will be a solution for a lot of teams which are still getting into data. But as soon as you start scaling, as you start seeing the cost going up, as you start seeing the number of use cases adding up, having an open format definitely helps. So, I would say that's where we are also heading into and that's how our journey as well started with Presto as well, why we even thought about Ahana, right. >> (John chuckles) >> So, like you mentioned, one of the things that happened was as we were moving to the lakehouse and the open table format, I think Ahana is one of the first ones in the market to have Hudi as a first class citizen completely supported with all the things which are not even present at the time of... even with Presto, right. So we see Ahana working behind the scenes, improving even some of the things already over the open-source ecosystem. And that's where we get the most value out of Ahana as well. >> This is the convergence of open-source magic and commercialization. Wen, because you think about Data as Code, reminds me, I hear, "Data warehouse, it's not going to go away." But you got cloud scale or scale. It reminds me of the old, "Oh yeah, I have a data center." Well, here comes the cloud. So, doesn't really kill the data center, although Amazon would say that the data center's going to be eliminated. No, you just use it for whatever you need it for. You use it for specific use cases, but everyone, all the action goes to the cloud for scale. The same things happen with data, and look at the open-source community. It's kind of coming together. Data as Code is coming together. >> Yeah, absolutely. >> Absolutely. >> I do want to again to connect on another dot in terms of cost and that. You know, we've been talking a little bit about price performance, but there's an implicit cost, and I think this was also very important to Blinkit, and also why we're offering a managed service. So one piece of it. And it really revolves around the people, right? So outside of the technology, the performance. One thing that Akshay brought up and it's another important piece that I should have highlighted a little bit more is, Presto exposes the ability to interact your data in a widely adopted way, which is basically ANSI SQL. So the ability for your practitioners to use this technology is huge. That's just regular Presto. In terms of a managed service, the guys at Blinkit are a great high performing team, but they have to be very efficient with their time and what they manage. And what we're trying to do is provide leverage for them. So take a lot of the heavy lifting away, but at the same time, figuring out the right things to expose so that they have that same flexibility. And that's been the balancing point that we've been trying to balance at Ahana, but that goes back to cost. How do I total cost of ownership? And that not doesn't include just the actual querying processing time, but the ability for the organization to go ahead and absorb the solution. And what does it cost in terms of the people involved? >> Yeah. Great conversation. I mean, this brings up the question of back in the data center, the cloud days, you had the concept of an SRE, which is now popular, site reliability engineer. One person does all the clusters and manages all the scale. Is the data engineer the new SRE for data? Are we seeing a similar trajectory? Just want to get your reaction. What do you guys think? >> Yes, so I would say, definitely. It depends on the teams and the sizes of that. We are high performing team so each automation takes bits on the pieces of the architecture, like where they want to invest in. And it comes out with the value of the engineer's time and basically like how much they can invest in, how much they need to configure the architecture, and how much time it'll take to time to market. So basically like, this is what I would also highlight as an engineer. I found Ahana like the... I would say as a Presto in a cloud native environment, or I think so there's the one in the market that seamlessly scales and then scales out. And further, with a team of us, I would say our team size like three to four engineers managing cluster day in day out, conferring, tuning and all those things takes a lot of time. And Ahana came in and takes it off our plate and the hands in a solution which works out of box. So that's where this comes in. Ahana it's also based on open-source community. >> So the time of the engineer's time is so valuable. >> Yeah. >> My take on it really in terms of the data engineering being the SRE. I think that can work, it depends on the actual person, and we definitely try to make the process as easy as possible. I think in Blinkit's case, you guys are... There are data platform owners, but they definitely are aware of the pipelines. >> John: Yeah. >> So they have very intimate knowledge of what data engineers do, but I think in their case, you guys, you're managing a ton of systems. So it's not just even Presto. They have a ton of systems and surfacing that interface so they can cater to all the data engineers across their data systems, I think is the big need for them. I know you guys you want to chime in. I mean, we've seen the architecture and things like that. I think you guys did an amazing job there. >> So, and to adding to Wen's point, right. Like I generally think what DevOps is to the tech team. I think, what is data engineer or the data teams are to the data organization, right? Like they play a very similar role that you have to act as a guardrail to ensure that everyone has access to the data so the democratizing and everything is there, but that has to also come with security, right? And when you do that, there are (indistinct) a lot of points where someone can interact with data. We have... And again, there's a mixed match of open-source tools that works well, as well. And there are some paid tools as well. So for us like for visualization, we use Redash for our ad hoc analysis. And we use Tableau as well whenever we want to give a very concise reporting. We have Jupyter notebooks in place and we have EMRs as well. So we always have a mixed batch of things where people can interact with data. And most of our time is spent in acting as that guardrail to ensure that everyone should have access to data, but it shouldn't be exploited, right. And I think that's where we spend most of our time in. >> Yeah. And I think the time is valuable, but that your point about the democratization aspect of it, there seems to be a bigger step function value that you're enabling and needs to be talked out. The 10x engineer, it's more like 50x, right? If you get it done right, the enablement downstream at the scale that we're seeing with this new trend is significant. It's not just, oh yeah, visualization and get some data quicker, there's actually real advantages on a multiple with that engineering. So, and we saw that with DevOps, right? Like, you do this right and then magic happens on the edges. So, yeah, it's interesting. You guys, congratulations. Great environment. Thanks for sharing the insight Blinkit. Wen, great to see you. Ahana again with Presto, congratulations. The open-source meets data engineering. Thanks so much. >> Thanks, John. >> Appreciate it. >> Okay. >> Thanks John. >> Thanks. >> Thanks for having us. >> This season two, episode two of our ongoing series. This one is Data as Code. This is theCUBE. I'm John furrier. Thanks for watching. (gentle music)
SUMMARY :
This is the season two, episode What is the open data lakehouse? And this to the end customer And that is the kind of into the relationships here with you guys. give the details on that. is implementing the operations as well. You need to move with business. This is the really kind of like, I think so. So let's get into what you guys do. and is in a hurry to bring And the folks at Blinkit the open data lakehouse. So on the left hand side of the slide, What is the open-source angle on this Is the license going to change? with lakehouse and you guys? So having the ability to plug So the enablement is and again, you can focus So in the case of table formats, So the agility is a key point. Wen, I got to get in and the needs. And you can connect Presto has a very pluggable architecture. and what led you to Ahana? And consumers like the analytics and you get a very good and also, time it takes to do stuff. and you got choice. best of the both worlds. like the old warehousing as you start seeing the cost going up, and the open table format, the data center's going to be eliminated. figuring out the right things to expose and manages all the scale. and the sizes of that. So the time of the it depends on the actual person, I think you guys did an amazing job there. So, and to adding Thanks for sharing the insight Blinkit. This is theCUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
John Furrier | PERSON | 0.99+ |
Wen Phan | PERSON | 0.99+ |
Akshay Agarwal | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Ahana | PERSON | 0.99+ |
India | LOCATION | 0.99+ |
Blinkit | ORGANIZATION | 0.99+ |
Satyam Krishna | PERSON | 0.99+ |
Linux Foundation | ORGANIZATION | 0.99+ |
Ahana | ORGANIZATION | 0.99+ |
five minutes | QUANTITY | 0.99+ |
Akshay | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
10 minutes | QUANTITY | 0.99+ |
Three guests | QUANTITY | 0.99+ |
Satyam | PERSON | 0.99+ |
Blinkit | PERSON | 0.99+ |
one day | QUANTITY | 0.99+ |
10 minute | QUANTITY | 0.99+ |
Redshift | TITLE | 0.99+ |
both worlds | QUANTITY | 0.99+ |
over 200,000 orders | QUANTITY | 0.99+ |
Presto | PERSON | 0.99+ |
over 10 million | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
10x | QUANTITY | 0.99+ |
Wen | PERSON | 0.98+ |
50x | QUANTITY | 0.98+ |
agile | TITLE | 0.98+ |
one piece | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
three | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
single destination | QUANTITY | 0.97+ |
One person | QUANTITY | 0.97+ |
each time | QUANTITY | 0.96+ |
each | QUANTITY | 0.96+ |
Presto | ORGANIZATION | 0.96+ |
one person | QUANTITY | 0.96+ |
single source | QUANTITY | 0.96+ |
Tableau | TITLE | 0.96+ |
one tool | QUANTITY | 0.96+ |
Icebergs | ORGANIZATION | 0.96+ |
Today | DATE | 0.95+ |
One | QUANTITY | 0.95+ |
one thing | QUANTITY | 0.95+ |