Wen Phan, Ahana & Satyam Krishna, Blinkit & Akshay Agarwal, Blinkit | AWS Startup Showcase S2 E2

(gentle music) >> Welcome everyone to theCUBE's presentation of the AWS Startup Showcase. The theme is Data as Code; The Future of Enterprise Data and Analytics. This is the season two, episode two of the ongoing series of covering the exciting startups in the AWS ecosystem around data analytics and cloud computing. I'm your host, John Furrier. Today we're joined by great guests here. Three guests. Wen Phan, who's a Director of Product Management at Ahana, Satyam Krishna, Engineering Manager at Blinkit, and we have Akshay Agarwal, Senior Engineer at Blinkit as well. We're going to get into the relationship there. Let's get into. We're going to talk about how Blinkit's using open data lake, data house with Presto on AWS. Gentlemen, thanks for joining us. >> Thanks for having us. >> So we're going to get into the deep dive on the open data lake, but I want to just quickly get your thoughts on what it is for the folks out there. Set the table. What is the open data lakehouse? Why it is important? What's in it for the customers? Why are we seeing adoption around this because this is a big story. >> Sure. Yeah, the open data lakehouse is really being able to run a gamut of analytics, whether it be BI, SQL, machine learning, data science, on top of the data lake, which is based on inexpensive, low cost, scalable storage. And more importantly, it's also on top of open formats. And this to the end customer really offers a tremendous range of flexibility. They can run a bunch of use cases on the same storage and great price performance. >> You guys have any other thoughts on what's your reaction to the lakehouse? What is your experience with it? What's going on with Blinkit? >> No, I think for us also, it has been the primary driver of how as a company we have shifted our completely delivery model from us delivering in one day to someone who is delivering in 10 minutes, right? And a lot of this was made possible by having this kind of architecture in place, which helps us to be more open-source, more... where the tools are open-source, we have an open table format which helps us be very modular in nature, meaning we can pick solutions which works best for us, right? And that is the kind of architecture that we want to be in. >> Awesome. Wen, you know last time we chat with Ahana, we had a great conversation around Presto, data. The theme of this episode is Data as Code, which is interesting because in all the conversations in these episodes all around developers, which administrators are turning into developers, there's a developer vibe with data. And with opensource, it's software. Now you've got data taking a similar trajectory as how software development was with code, but the people running data they're not developers, they're administrators, they're operators. Now they're turning into DataOps. So it's kind of a similar vibe going on with branches and taking stuff out of and putting it back in, and testing it. Datasets becoming much more stable, iterating on machine learning algorithm. This is a movement. What's your guys reaction before we get into the relationships here with you guys. But, what's your reaction to this Data as Code movement? >> Yeah, so I think the folks at Blinkit are doing a great job there. I mean, they have a pretty compact data engineering team and they have some pretty stringent SLAs, as well as in terms of time to value and reliability. And what that ultimately translates for them is not only flexibility but reliability. So they've done some very fantastic work on a lot of automation, a lot of integration with code, and their data pipelines. And I'm sure they can give the details on that. >> Yes. Satyam and Akshay, you guys are engineers' software, but this is becoming a whole another paradigm where the frontline coding and or work or engineer data engineering is implementing the operations as well. It's kind of like DevOps for data. >> For sure. Right. And I think whenever you're working, even as a software engineer, the understanding of business is equally important. You cannot be working on something and be away from business, right? And that's where, like I mentioned earlier, when we realized that we have to completely move our stack and start giving analytics at 10 minutes, right. Because when you're delivering in 10 minutes, your leaders want to take decisions in your real-time. That means you need to move with them. You need to move with business. And when you do that, the kind of flexibility these softwares give is what enables the businesses at the end of the day. >> Awesome. This is the really kind of like, is there going to be a book called agile data warehouses? I don't think so. >> I think so. (laughing) >> The agile cloud data. This is cool. So let's get into what you guys do. What is Blinkit up to? What do you guys do? Can you take a minute to explain the company and your product? >> Sure. I'll take that. So Blinkit is India's biggest 10 minute delivery platform. It pioneered the delivery model in the country with over 10 million Indian shopping on our platform, ranging from everything: grocery staples, vegetables, emergency services, electronics, and much more, right. It currently delivers over 200,000 orders every day, and is in a hurry to bring the future of farmers to everyone in India. >> What's the relationship with Ahana and Blinkit? Wen, what's the tie in? >> Yeah, so Blinkit had a pretty well formed stack. They needed a little bit more flexibility and control. They thought a managed service was the way to go. And here at Ahana, we provide a SaaS managed service for Presto. So they engaged us and they evaluated our offering. And more importantly, we're able to partner. As a early stage startup, we really rely on very strong partners with great use cases that are willing to collaborate. And the folks at Blinkit have been really great in helping us push our product, develop our product. And we've been very happy about the value that we've been able to deliver to them as well. >> Okay. So let's unpack the open data lakehouse. What is it? What's under the covers? Let's get into it. >> Sure. So if bring up a slide. Like I said before, it's really a paradigm on being able to run a gamut of analytics on top of the open data lake. So what does that mean? How did it come about? So on the left hand side of the slide, we are coming out of this world where for the last several decades, the primary workhorse for SQL based processing and reporting and dashboarding use cases was really the data warehouse. And what we're seeing is a shift due to the trends in inexpensive scalable storage, cloud storage. The proliferation of open formats to facilitate using this storage to get certain amounts of reliability and performance, and the adoption of frameworks that can operate on top of this cloud data lake. So while here at Ahana, we're primarily focused on SQL workloads and Presto, this architecture really allows for other types of frameworks. And you see the ML and AI side. And like to Satyam's point earlier, offers a great amount of flexibility modularity for many use cases in the cloud. So really, that's really the lakehouse, and people like it for the performance, the openness, and the price performance. >> How's the open-source open side of it playing in the open-source? It's kind of open formats. What is the open-source angle on this because there's a lot of different approaches. I'm hearing open formats. You know, you have data stores which are a big part of seeing that. You got SQL, you mentioned SQL. There's got a mishmash of opportunities. Is it all coexisting? Is it one tool to rule the world or is it interchangeable? What's the open-source angle? >> There's multiple angles and I'll let definitely Satyam add to what I'm saying. This was definitely a big piece for Blinkit. So on one hand, you have the open formats. And what really the open formats enable is multiple compute engines to work on that data. And that's very huge. 'Cause it's open, you're not locked in. I think the other part of open that is important and I think it was important to Blinkit was the governance around that. So in particular Presto is governed by the Linux Foundation. And so, as a customer of open-source technology, they want some assurances for things like how's it governed? Is the license going to change? So there's that aspect of openness that I think is very important. >> Yeah. Blinkit, what's the data strategy here with lakehouse and you guys? Why are you adopting this type of architecture? >> So adding to what... Yeah, I think adding to Wen said, right. When we are thinking in terms of all these OpenStacks, you have got these open table formats, everything which is deployed over cloud, the primary reason there is modularity. It's as simple as that, right. You can plug and play so many different table formats from one thing to another based on the use case that you're trying to serve, so that you get the most value out of data. Right? I'll give you a very simple example. So for us we use... not even use one single table format. It's not that one thing solves for everything, right? We use both Hudi and Iceberg to solve for different use cases. One is good for when you're working for a certain data site. Icebergs works well when you're in the SQL kind of interface, right. Hudi's still trying to reach there. It's going to go there very soon. So having the ability to plug and play different formats based on the use case helps you to grow faster, helps you to take decisions faster because you now you're not stuck on one thing. They will have to implement it. Right. So I think that's what it is great about this data lake strategy. Keeping yourself cost effective. Yeah, please. >> So the enablement is basically use case driven. You don't have to be rearchitecturing for use cases. You can simply plug can play based on what you need for the use case. >> Yeah. You can... and again, you can focus on your business use case. You can figure out what your business users need and not worry about these things because that's where Presto comes in, helps you stitch that data together with multiple data formats, give you the performance that you need and it works out the best there. And that's something that you don't get to with traditional warehouse these days. Right? The kind of thing that we need, you don't get that. >> I do want to add. This is just to riff on what Satyam said. I think it's pretty interesting. So, it really allowed him to take the best-of-breed of what he was seeing in the community, right? So in the case of table formats, you've got Delta, you've got Hudi, you've got Iceberg, and they all have got their own roadmap and it's kind of organic of how these different communities want to evolve, and I think that's great, but you have these end consumers like Blinkit who have different maybe use cases overlapping, and they're not forced to pick one. When you have an open architecture, they can really put together best-of-breed. And as these projects evolve, they can continue to monitor it and then make decisions and continue to remain agile based on the landscape and how it's evolving. >> So the agility is a key point. Flexibility and agility, and time to valuing with your data. >> Yeah. >> All right. Wen, I got to get in to why the Presto is important here. Where does that fit in? Why is Presto important? >> Yeah. For me, it all comes down to the use cases and the needs. And reporting and dashboarding is not going to go away anytime soon. It's a very common use case. Many of our customers like Blinkit come to us for that use case. The difference now is today, people want to do that particular use case on top of the modern data lake, on top of scalable, inexpensive, low cost storage. Right? In addition to that, there's a need for this low latency interactive ability to engage with the data. This is often arises when you need to do things in a ad hoc basis or you're in the developmental phase of building things up. So if that's what your need is. And latency's important and getting your arms around the problems, very important. You have a certain SLA, I need to deliver something. That puts some requirements in the technology. And Presto is a perfect for that ideal use case. It's ideal for that use case. It's distributed, it's scalable, it's in memory. And so it's able to really provide that. I think the other benefit for Presto and why we're bidding on Presto is it works well on the data lakes, but you have to think about how are these organizations maturing with this technology. So it's not necessarily an all or nothing. You have organizations that have maybe the data lake and it's augmented with other analytical data stores like Snowflake or Redshift. So Presto also... a core aspect is its ability to federate or connect and query across different data sources. So this can be a permanent thing. This could also be a transitionary thing. We have some customers that are moving and slowly shifting their data portfolio from maybe all data warehouse into 80% data lake. But it gives that optionality, it gives that ability to transition over a timeframe. But for all those reasons, the latency, the scalability, the federation, is why Presto for this particular use case. >> And you can connect with other databases. It can be purpose built database, could be whatever. Right? >> Sure. Yes, yes. Presto has a very pluggable architecture. >> Okay. Here's the question for the Blinkit team? Why did you choose Presto and what led you to Ahana? >> So I'll take this better, over this what Presto sits well in that reach is, is how it is designed. Like basically, Presto decouples your storage with the compute. Basically like, people can use any storage and Presto just works as a query engine for them. So basically, it has a constant connectors where you can connect with a real-time databases like Pinot or a Druid, along with your warehouses like Redshift, along with your data lake that's like based on Hudi or Iceberg. So it's like a very landscape that you can use with the Presto. And consumers like the analytics doesn't need to learn the SQL or different paradigms of the querying for different sources. They just need to learn a single source. And, they get a single place to consume from. They get a single consumer on their single destination to write on also. So, it's a homologous architecture, which allows you to put a central security like which Presto integrates. So it's also based on open architecture, that's Apache engine. And it has also certain innovative features that you can see based on caching, which reduces a lot of the cost. And since you have further decoupled your storage with the compute, you can further reduce your cost, because now the biggest part of our tradition warehouse is a storage. And the cost goes massively upwards with the amount of data that you've added. Like basically, each time that you add more data, you require more storage, and warehouses ask you to write the data in their own format. Over here since we have decoupled that, the storage cost have gone down. It's literally that your cost that you are writing, and you just pay for the compute, and you can scale in scale out based on the requirements. If you have high traffic, you scale out. If you have low traffic, you scale in. So all those. >> So huge cost savings. >> Yeah. >> Yeah. Cost effectiveness, for sure. >> Cost effectiveness and you get a very good price value out of it. Like for each query, you can estimate what's the cost for you based on that tracking and all those things. >> I mean, if you think about the other classic Iceberg and what's under the water you don't know, it's the hidden cost. You think about the tooling, right, and also, time it takes to do stuff. So if you have flexibility on choice, when we were riffing on this last time we chatted with you guys and you brought it up earlier around, you can have the open formats to have different use cases in different tools or different platforms to work on it. Redshift, you can use Redshift here, or use something over there. You don't have to get locking >> Absolutely. >> Satyam & Akshay: Yeah. >> Locking is a huge problem. How do you guys see that 'cause sounds like here there's not a lot of locking. You got the open formats, and you got choice. >> Yeah. So you get best of the both worlds. Like you get with Ahana or with the Presto, you can get the best of the both worlds. Since it's cloud native, you can easily deploy your clusters very easily within like five minutes. Your cluster is up, you can start working on it. You can deploy multiple clusters for multiple teams. You get also flexibility of adding new connectors since it's open and further it's also much more secure since it's based on cloud native. So basically, you can control your security endpoints very well. So all those things comes in together with this architecture. So you can definitely go more on the lakehouse architecture than warehousing when you want to deliver data value faster. And basically, you get the much more high value out of your data in a sorted template. >> So Satyam, it sounds like the old warehousing was like the application person, not a lot of usage, old, a lot of latency. Okay. Here and there. But now you got more speed to deploy clusters, scale up scale down. Application developers are as everyone. It's not one person. It's not one group. It's whenever you want. So, you got speed. You got more diversity in the data opportunities, and your coding. >> Yeah. I think data warehouses are a way to start for every organization who is getting into data. I don't think data warehousing is still a solution and will be a solution for a lot of teams which are still getting into data. But as soon as you start scaling, as you start seeing the cost going up, as you start seeing the number of use cases adding up, having an open format definitely helps. So, I would say that's where we are also heading into and that's how our journey as well started with Presto as well, why we even thought about Ahana, right. >> (John chuckles) >> So, like you mentioned, one of the things that happened was as we were moving to the lakehouse and the open table format, I think Ahana is one of the first ones in the market to have Hudi as a first class citizen completely supported with all the things which are not even present at the time of... even with Presto, right. So we see Ahana working behind the scenes, improving even some of the things already over the open-source ecosystem. And that's where we get the most value out of Ahana as well. >> This is the convergence of open-source magic and commercialization. Wen, because you think about Data as Code, reminds me, I hear, "Data warehouse, it's not going to go away." But you got cloud scale or scale. It reminds me of the old, "Oh yeah, I have a data center." Well, here comes the cloud. So, doesn't really kill the data center, although Amazon would say that the data center's going to be eliminated. No, you just use it for whatever you need it for. You use it for specific use cases, but everyone, all the action goes to the cloud for scale. The same things happen with data, and look at the open-source community. It's kind of coming together. Data as Code is coming together. >> Yeah, absolutely. >> Absolutely. >> I do want to again to connect on another dot in terms of cost and that. You know, we've been talking a little bit about price performance, but there's an implicit cost, and I think this was also very important to Blinkit, and also why we're offering a managed service. So one piece of it. And it really revolves around the people, right? So outside of the technology, the performance. One thing that Akshay brought up and it's another important piece that I should have highlighted a little bit more is, Presto exposes the ability to interact your data in a widely adopted way, which is basically ANSI SQL. So the ability for your practitioners to use this technology is huge. That's just regular Presto. In terms of a managed service, the guys at Blinkit are a great high performing team, but they have to be very efficient with their time and what they manage. And what we're trying to do is provide leverage for them. So take a lot of the heavy lifting away, but at the same time, figuring out the right things to expose so that they have that same flexibility. And that's been the balancing point that we've been trying to balance at Ahana, but that goes back to cost. How do I total cost of ownership? And that not doesn't include just the actual querying processing time, but the ability for the organization to go ahead and absorb the solution. And what does it cost in terms of the people involved? >> Yeah. Great conversation. I mean, this brings up the question of back in the data center, the cloud days, you had the concept of an SRE, which is now popular, site reliability engineer. One person does all the clusters and manages all the scale. Is the data engineer the new SRE for data? Are we seeing a similar trajectory? Just want to get your reaction. What do you guys think? >> Yes, so I would say, definitely. It depends on the teams and the sizes of that. We are high performing team so each automation takes bits on the pieces of the architecture, like where they want to invest in. And it comes out with the value of the engineer's time and basically like how much they can invest in, how much they need to configure the architecture, and how much time it'll take to time to market. So basically like, this is what I would also highlight as an engineer. I found Ahana like the... I would say as a Presto in a cloud native environment, or I think so there's the one in the market that seamlessly scales and then scales out. And further, with a team of us, I would say our team size like three to four engineers managing cluster day in day out, conferring, tuning and all those things takes a lot of time. And Ahana came in and takes it off our plate and the hands in a solution which works out of box. So that's where this comes in. Ahana it's also based on open-source community. >> So the time of the engineer's time is so valuable. >> Yeah. >> My take on it really in terms of the data engineering being the SRE. I think that can work, it depends on the actual person, and we definitely try to make the process as easy as possible. I think in Blinkit's case, you guys are... There are data platform owners, but they definitely are aware of the pipelines. >> John: Yeah. >> So they have very intimate knowledge of what data engineers do, but I think in their case, you guys, you're managing a ton of systems. So it's not just even Presto. They have a ton of systems and surfacing that interface so they can cater to all the data engineers across their data systems, I think is the big need for them. I know you guys you want to chime in. I mean, we've seen the architecture and things like that. I think you guys did an amazing job there. >> So, and to adding to Wen's point, right. Like I generally think what DevOps is to the tech team. I think, what is data engineer or the data teams are to the data organization, right? Like they play a very similar role that you have to act as a guardrail to ensure that everyone has access to the data so the democratizing and everything is there, but that has to also come with security, right? And when you do that, there are (indistinct) a lot of points where someone can interact with data. We have... And again, there's a mixed match of open-source tools that works well, as well. And there are some paid tools as well. So for us like for visualization, we use Redash for our ad hoc analysis. And we use Tableau as well whenever we want to give a very concise reporting. We have Jupyter notebooks in place and we have EMRs as well. So we always have a mixed batch of things where people can interact with data. And most of our time is spent in acting as that guardrail to ensure that everyone should have access to data, but it shouldn't be exploited, right. And I think that's where we spend most of our time in. >> Yeah. And I think the time is valuable, but that your point about the democratization aspect of it, there seems to be a bigger step function value that you're enabling and needs to be talked out. The 10x engineer, it's more like 50x, right? If you get it done right, the enablement downstream at the scale that we're seeing with this new trend is significant. It's not just, oh yeah, visualization and get some data quicker, there's actually real advantages on a multiple with that engineering. So, and we saw that with DevOps, right? Like, you do this right and then magic happens on the edges. So, yeah, it's interesting. You guys, congratulations. Great environment. Thanks for sharing the insight Blinkit. Wen, great to see you. Ahana again with Presto, congratulations. The open-source meets data engineering. Thanks so much. >> Thanks, John. >> Appreciate it. >> Okay. >> Thanks John. >> Thanks. >> Thanks for having us. >> This season two, episode two of our ongoing series. This one is Data as Code. This is theCUBE. I'm John furrier. Thanks for watching. (gentle music)

Published Date : Apr 1 2022

SUMMARY :

This is the season two, episode What is the open data lakehouse? And this to the end customer And that is the kind of into the relationships here with you guys. give the details on that. is implementing the operations as well. You need to move with business. This is the really kind of like, I think so. So let's get into what you guys do. and is in a hurry to bring And the folks at Blinkit the open data lakehouse. So on the left hand side of the slide, What is the open-source angle on this Is the license going to change? with lakehouse and you guys? So having the ability to plug So the enablement is and again, you can focus So in the case of table formats, So the agility is a key point. Wen, I got to get in and the needs. And you can connect Presto has a very pluggable architecture. and what led you to Ahana? And consumers like the analytics and you get a very good and also, time it takes to do stuff. and you got choice. best of the both worlds. like the old warehousing as you start seeing the cost going up, and the open table format, the data center's going to be eliminated. figuring out the right things to expose and manages all the scale. and the sizes of that. So the time of the it depends on the actual person, I think you guys did an amazing job there. So, and to adding Thanks for sharing the insight Blinkit. This is theCUBE.

ENTITIES

Entity	Category	Confidence
John Furrier	PERSON	0.99+
Wen Phan	PERSON	0.99+
Akshay Agarwal	PERSON	0.99+
John	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Ahana	PERSON	0.99+
India	LOCATION	0.99+
Blinkit	ORGANIZATION	0.99+
Satyam Krishna	PERSON	0.99+
Linux Foundation	ORGANIZATION	0.99+
Ahana	ORGANIZATION	0.99+
five minutes	QUANTITY	0.99+
Akshay	PERSON	0.99+
AWS	ORGANIZATION	0.99+
10 minutes	QUANTITY	0.99+
Three guests	QUANTITY	0.99+
Satyam	PERSON	0.99+
Blinkit	PERSON	0.99+
one day	QUANTITY	0.99+
10 minute	QUANTITY	0.99+
Redshift	TITLE	0.99+
both worlds	QUANTITY	0.99+
over 200,000 orders	QUANTITY	0.99+
Presto	PERSON	0.99+
over 10 million	QUANTITY	0.99+
SQL	TITLE	0.99+
10x	QUANTITY	0.99+
Wen	PERSON	0.98+
50x	QUANTITY	0.98+
agile	TITLE	0.98+
one piece	QUANTITY	0.98+
both	QUANTITY	0.98+
three	QUANTITY	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
single destination	QUANTITY	0.97+
One person	QUANTITY	0.97+
each time	QUANTITY	0.96+
each	QUANTITY	0.96+
Presto	ORGANIZATION	0.96+
one person	QUANTITY	0.96+
single source	QUANTITY	0.96+
Tableau	TITLE	0.96+
one tool	QUANTITY	0.96+
Icebergs	ORGANIZATION	0.96+
Today	DATE	0.95+
One	QUANTITY	0.95+
one thing	QUANTITY	0.95+

Steven Mih, Ahana & Girish Baliga, Uber | CUBE Conversation

(bright music) >> Hey everyone, welcome to this CUBE conversation featuring Ahana, I'm your host Lisa Martin. I've got two guests here with me today. Steven Mih joins us, the Presto Foundation governing board member, co-founder and CEO of Ahana, and Girish Baliga Presto Foundation governing board chair and senior engineering manager at Uber. Guys thanks for joining us. >> Thanks for having us. >> Thanks for having us. >> So Steven we're going to dig into and unpack Presto in the next few minutes or so, but Steven let's go ahead and start with you. Talk to us about some of the challenges with the open data lake house market. What are some of those key challenges that organizations are facing? >> Yeah, just pulling up the slide you know, what we see is that many organizations are dealing with a lot more data and very different data types and putting that all into, traditionally as the data warehouse, which has been the workhorse for BI and analytics traditionally, it becomes very, very expensive, and there's a lot of lock in associated with that. And so what's happening is that people are putting the data semistructured and unstructured data for example, in cloud data lakes or other data lakes, and they find that they can query directly with a SQL query engine like Presto. And that lets you have a much more approach to dealing with getting insights out of your data. And that's what this is all about, and that's why companies are moving to a modern architecture. Girish maybe you can share some of your thoughts on how Uber uses Presto for this. >> Yeah, at Uber we use Presto in our internal deployments. So at Uber we have our own data centers, we store data locally in our data centers, but we have made the conscious choice to go with an open data stack. Our entire data stack is built around open source technologies like Hadoop, Hive, Spark and Presto. And so Presto is an invaluable engine that is able to connect to all these different storage and data formats and allow us to have a single entry point for our users, to run their SQL engines and get insights rather quickly compared to some of the other engines that we have at Uber. >> So let's talk a little bit about Presto so that the audience gets a good overview of that. Steven starting with you, you talked about the challenges of the traditional data warehouse application. Talk to us about why Presto was founded the open, the project, give us that background information if you will. >> Absolutely, so Presto was originally developed out of the biggest hyperscaler out there which is Facebook now known as Meta. And they donated that project to the, and open sourced it and donated it to the Linux Foundation. And so Presto is a SQL query engine, it's a storage SQL query engine, that runs directly on open data lakes, so you can put your data into open formats like 4K or C, and get insights directly from that at a very good price performance ratio. The Presto Foundation of which Girish and I are part of, we're all working together as a consortium of companies that all want to see Presto continue to get bigger and bigger. Kind of like Kubernetes has a, has an organization called CNCF, Presto has Presto Foundation all under the umbrella of the Linux Foundation. And so there's a lot of exciting things that are coming on the roadmap that make Presto very unique. You know, RaptorX is a multilevel caching system that it's been fantastic, Aria optimizations are another area, we Ahana have developed some security features with donating the integrations with Apache Ranger and that's the type of things that we do to help the community. But maybe Girish can talk about some of the exciting items on the roadmap that you're looking forward to. >> Absolutely, I think from Uber's point of view just a sheer scale of data and our volume of query traffic. So we run about half a million Presto queries a day, right? And we have thousands of machines in our Presto deployments. So at that scale in addition to functionality you really want a system that can handle traffic reliably, that can scale, and that is backed by a strong community which guarantees that if you pull in the new version of Presto, you won't break anything, right? So all of those things are very important to us. So I think that's where we are relying on our partners particularly folks like Facebook and Twitter and Ahana to build and maintain this ecosystem that gives us those guarantees. So that is on the reliability front, but on the roadmap side we are also excited to see where Presto is extending. So in addition to the projects that Steven talked about, we are also looking at things like Presto and Spark, right? So take the Presto SQL and run it as a Spark job for instance, or running Presto on real-time analytics applications something that we built and contributed from Uber side. So we are all taking it in very different directions, we all have different use cases to support, and that's the exciting thing about the foundation. That it allows us all to work together to get Presto to a bigger and better and more flexible engine. >> You guys mentioned Facebook and I saw on the slide I think Twitter as well. Talk to me about some of the organizations that are leveraging the Presto engine and some of the business benefits. I think Steve you talked about insights, Steven obviously being able to get insights from data is critical for every business these days. >> Yeah, a major, major use case is finding the ad hoc and interactive queries, and being able to drive insights from doing so. And so, as I mentioned there's so much data that's being generated and stored, and to be able to query that data in place, at a, with very, very high performance, meaning that you can get answers back in seconds of time. That lets you have the interactive ability to drill into data and innovate your business. And so this is fantastic because it's been developed at hyperscalers like Uber that allow you to have open source technology, pick that up, and just download it right from prestodb.io, and then start to run with this and join the community. I think from an open source perspective this project under the governance of Linux Foundation gives you the confidence that it's fully transparent and you'll never see any licensing changes by the Linux Foundation charter. And therefore that means the technology remains free forever without later on limitations occurring, which then would perhaps favor commercialization of any one vendor. That's not the case. So maybe Girish your thoughts on how we've been able to attract industry giants to collaborate, to innovate further, and your thoughts on that. >> Yeah, so of the interesting I've seen in the space is that there is a bifurcation of companies in this ecosystem. So there are these large internet scale companies like Facebook, and Uber, and Twitter, which basically want to use something like Presto for their internal use cases. And then there is the second set of companies, enterprise companies like Ahana which basically wanted to take Presto and provide it as a service for other companies to use as an alternative to things like Snowflake and other systems right? So, and the foundation is a great place for both sets of companies to come together and work. The internet scale companies bring in the scale, the reliability, the different kind of ways in which you can challenge the system, optimize it, and so forth, and then companies like Ahana bring in the flexibility and the extensibility. So you can work with different clouds, different storage formats, different engines, and I think it's a great partnership that we can see happening primarily through the foundational spaces. Which you would be hard pressed to find in a single vendor or a, you know, a single-source system that is there on the market today. >> How long ago was the Presto Foundation initiated? >> It's been over three years now and it's been going strong, we're over a dozen members and it's open to everyone. And it's all governed like the Linux Foundation so we use best practices from that and you can just check it out at prestodb.io where you can get the software, or you can hear about how to join the foundation. So it includes members like Intel, and HPE as well, and we're really excited for new members to come, and contribute in and participate. >> Sounds like you've got good momentum there in the foundation. Steven talk a little bit about the last two years. Have you seen the acceleration in use cases in the number of users as we've been in such an interesting environment where the need for real-time insights is essential for every business initially a few couple of years ago to survive but now to be, to really thrive, is it, have you seen the acceleration in Presto in that timeframe? >> Absolutely, we see there's acceleration of being more data-driven and especially moving to cloud and having more data in the cloud, we think that innovation is happening, digital innovation is happening very fast and Presto is a major enabler of that, again, being able to get, drive insights from the data this is not just your typical business data, it's now getting into really clickstream data, knowing about how customers are operating today, Uber is a great example of all the different types of innovations they can drive, whether it be, you know, knowing in real time what's happening with rides, or offering you a subscription for special deals to use the service more. So, you know, Ahana we really love Presto, and we provide a SaaS manage service of the open source and provide free trials, and help people get up to speed that may not have the same type of skills as Uber or Facebook does. And we work with all companies in that way. >> Think about the consumers these days, we're very demanding, right? When I think one of the things that was in short supply during the last two years was patience. And if I think of Uber as a great example, I want to know if I'm asking for a ride I want to know exactly in real time what's coming for me? Where is it now? How many more minutes is it going to take? I mean, that need to fulfill real-time insights is critical across every industry but have you seen anything in the last couple years that's been more leading edge, like e-commerce or retail for example? I'm just curious. >> Girish you want to take that one or? >> Yeah, sure. So I can speak from the Uber point of view. So real-time insights has really exploded as an area, particularly as you mentioned with this just-in-time economy, right? Just to talk about it a little bit from Uber side, so some of the insights that you mentioned about when is your ride coming, and things of that nature, right? Look at it from the driver's point of view who are, now we have Uber Eats, so look at it from the restaurant manager's point of view, right? They also want to know how is their business coming? How many customer orders are coming for instance? what is the conversion rate? And so forth, right? And today these are all insights that are powered by a system which has a Presto as an front-end interface at Uber. And these queries run like, you have like tens of thousands of queries every single second, and the queries run in like a second and so forth. So you are really talking about production systems running on top of Presto, production serving systems. So coming to other use cases like eCommerce, we definitely have seen some of that uptake happen as well, so in the broader community for instance, we have companies like Stripe, and other folks who are also using this hashtag which is very similar to us based on another open source technology called Pino, using Presto as an interface. And so we are seeing this whole open data lakehouse more from just being, you know, about interactive analytics to driving all different kinds of analytics. Having anything to do with data and insights in this space. >> Yeah, sounds like the evolution has been kind of on a rocket ship the last couple years. Steven, one more time we're out of time, but can you mention that URL where folks can go to learn more? >> Yeah, prestodb.io and that's the Presto Foundation. And you know, just want to say that we'll be sharing the use case at the Startup Showcase coming up with theCUBE. We're excited about that and really welcome everyone to join the community, it's a real vibrant, expanding community and look forward to seeing you online. >> Sounds great guys. Thank you so much for sharing with us what Presto Foundation is doing, all of the things that it is catalyzing, great stuff, we look forward to hearing that customer use case, thanks for your time. >> Thank you. >> Thanks Lisa, thank you. >> Thanks everyone. >> For Steven and Girish, I'm Lisa Martin, you're watching theCUBE the leader in live tech coverage. (bright music)

Published Date : Mar 24 2022

SUMMARY :

and Girish Baliga Presto in the next few minutes or so, And that lets you have that is able to connect to so that the audience gets and that's the type of things that we do So that is on the reliability front, and some of the business benefits. and then start to run with So, and the foundation is a great place and it's open to everyone. in the number of users as we've been and having more data in the cloud, I mean, that need to fulfill so some of the insights that you mentioned Yeah, sounds like the evolution and look forward to seeing you online. all of the things that it For Steven and Girish, I'm Lisa Martin,

ENTITIES

Entity	Category	Confidence
Lisa Martin	PERSON	0.99+
Steven	PERSON	0.99+
Steve	PERSON	0.99+
Girish	PERSON	0.99+
Lisa	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Steven Mih	PERSON	0.99+
Presto Foundation	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
Ahana	ORGANIZATION	0.99+
Linux Foundation	ORGANIZATION	0.99+
CNCF	ORGANIZATION	0.99+
Twitter	ORGANIZATION	0.99+
Intel	ORGANIZATION	0.99+
two guests	QUANTITY	0.99+
HPE	ORGANIZATION	0.99+
Presto	ORGANIZATION	0.99+
second set	QUANTITY	0.99+
both sets	QUANTITY	0.99+
over three years	QUANTITY	0.99+
Ahana	PERSON	0.98+
Kubernetes	ORGANIZATION	0.98+
Spark	TITLE	0.97+
Girish Baliga	PERSON	0.97+
about half a million	QUANTITY	0.97+
today	DATE	0.97+
over a dozen members	QUANTITY	0.96+
one	QUANTITY	0.96+
Presto	TITLE	0.96+
SQL	TITLE	0.95+
single	QUANTITY	0.95+
thousands of machines	QUANTITY	0.94+
every single second	QUANTITY	0.93+
Girish Baliga Presto Foundation	ORGANIZATION	0.92+
prestodb.io	OTHER	0.91+
last couple years	DATE	0.9+
4K	OTHER	0.89+
Startup Showcase	EVENT	0.88+
one vendor	QUANTITY	0.88+

Dipti Borkar, Ahana, and Derrick Harcey, Securonix | CUBE Conversation, July 2021

(upbeat music) >> Welcome to theCUBE Conversation. I'm John Furrier, host of theCUBE here in Palo Alto, California, in our studios. We've got a great conversation around open data link analytics on AWS, two great companies, Ahana and Securonix. Dipti Borkar, Co-founder and Chief Product Officer at Ahana's here. Great to see you, and Derrick Harcey, Chief Architect at Securonix. Thanks for coming on, really appreciate you guys spending the time. >> Yeah, thanks so much, John. Thank you for having us and Derrick, hello again. (laughing) >> Hello, Dipti. >> We had a great conversation around our startup showcase, which you guys were featured last month this year, 2021. The conversation continues and a lot of people are interested in this idea of open systems, open source. Obviously open data lakes is really driving a lot of value, especially with machine learning and whatnot. So this is a key, key point. So can you guys just take a step back before we get under the hood and set the table on Securonix and Ahana? What's the big play here? What is the value proposition? >> Why sure, I'll give a quick update. Securonix has been in the security business. First, a user and entity, behavioral analytics, and then the next generation SIEM platform for 10 years now. And we really need to take advantage of some cutting edge technologies in the open source community and drive adoption and momentum that we can not only bring in data from our customers, that they can find security threats, but also store in a way that they can use for other purposes within their organization. That's where the open data lake is very critical. >> Yeah and to add on to that, John, what we've seen, you know, traditionally we've had data warehouses, right? We've had operational systems move all of their data into the warehouse and those, you know, while these systems are really good, built for good use cases, the amount of data is exploding, the types of data is exploding, different types, semi-structured, structured and so when, as companies like Securonix in the security space, as well as other verticals, look for getting more insights out of their data, there's a new approach that's emerging where you have a data lake, which AWS has revolutionized with S3 and commoditized and there's analytics that's built on top of it. And so we're seeing a lot of good advantages that come out of this new approach. >> Well, it's interesting EC2 and S3 are having their 15th birthday, as they say in Amazon's interesting teenage years, but while I got you guys here, I want to just ask you, can you define the SIEM thing because the SIEM market is exploding, it just changed a little bit. Obviously it's data, event management, but again, as data becomes more proliferating, and it's not stopping anytime soon, as cloud native applications emerge, why is this important? What is this SIEM category? What's it about? >> Yeah, thanks. I'll take that. So obviously SIEM traditionally has been around for about a couple of decades and it really started with first log collection and management and rule-based threat detection. Now what we call next generation SIEM is really the modernization of a security platform that includes streaming threat detection and behavioral analysis and data analytics. We literally look for thousands of different threat detection techniques, and we chained together sequences of events and we stream everything in real time and it's very important to find threats as quickly as possible. But the momentum that we see in the industry as we see massive sizes of customers, we have made a transition from on-premise to the cloud and we literally are processing tens of petabytes of data for our customers. And it's critical that we can adjust data quickly, find threats quickly and allow customers to have the tools to respond to those security incidents quickly and really get the handle on their security posture. >> Derrick, if I ask you what's different about this next gen SIEM, what would you say and what's the big a-ha? What's the moment there? What's the key thing? >> The real key is taking the off the boundaries of scale. We want to be able to ingest massive quantities of data. We want to be able to do instant threat detection, and we want to be able to search on the entire forensic data set across all of the history of our customer base. In the past, we had to make sacrifices, either on the amount of data we ingest or the amount of time that we stored that data. And the really the next generation SIEM platform is offering advanced capabilities on top of that data set because those boundaries are no longer barriers for us. >> Dipti, any comment before I jump into the question for you? >> Yeah, you know, absolutely. It is about scale and like I mentioned earlier, the amount of data is only increasing and it's also the types of information. So the systems that were built to process this information in the past are, you know, support maybe terabytes of data, right? And that's where new technologies open source engines like Presto come in, which were built to handle internet scale. Presto was kind of created at Facebook to handle these petabytes that Derrick is talking about that every industry is now seeing where we're are moving from gigs to terabytes to petabytes. And that's where the analytic stack is moving. >> That's a great segue. I want to ask you while I got you here 'cause this is again, the definitions, 'cause people love to hear the experts weigh in. What is open data lake analytics? How would you define that? And then talk about where Presto fits in. >> Yeah, that's a great question. So the way I define open data lake analytics is you have a data lake on the core, which is, let's say S3, it's the most popular one, but on top of it, there are open aspects, it is open format. Open formats play a very important role because you can have different types of processing. It could be SQL processing, it could be machine learning, it could be other types of workloads, all work on these open formats versus a proprietary format where it's locked and it's open interfaces. Open interfaces that are like SQL, JDBC, ODBC is widely accessible to a range of tools. And so it's everywhere. Open source is a very important part of it. As companies like Securonix pick these technologies for their mission critical systems, they want to know that this is going to be available and open for them for a long period of time. And that's why open source becomes important. And then finally, I would say open cloud because at the end of the day, you know, while AWS is where a lot of the innovations happening, a lot of the market is, there are other clouds and open cloud is something that these engines were built for, right? So that's how I define open data lake analytics. It's analytics with query engines built on top of these open formats, open source, open interfaces and open cloud. Now Presto comes in where you want to find the needle in the haystack, right? And so when you have these deep questions about where did the threat come from or who was it, right? You have to ask these questions of your data. And Presto is an open source distributed SQL engine that allows data platform teams to run queries on their data lakes in a high-performance ways, in memory and on these petabytes of data. So that's where Presto fits in. It's one of the defacto query engines for SQL analysis on the data lake. So hopefully that answers the question, gives more context. >> Yeah, I mean, the joke about data lakes has been you don't want to be a data swamp, right? That's what people don't want. >> That's right. >> But at the same time, the needle in the haystack, it's like big data is like a needle in a haystack of needles. So there's a constant struggle to getting that data, the right data at the right time. And what I learned in the last presentation, you guys both presented, your teams presented at the conference was the managed service approach. Could you guys talk about why that approach works well together with you guys? Because I think when people get to the cloud, they replatform, then they start refactoring and data becomes a real big part of that. Why is the managed service the best approach to solving these problems? >> Yeah and interestingly, both Securonix and Ahana have a managed service approach so maybe Derrick can go first and I can go after. >> Yeah, yeah. I'll be happy to go first. You know, we really have found making the transition over the last decade from off premise to the cloud for the majority of our customers that running a large open data lake requires a lot of different skillsets and there's hundreds of technologies in the open source community to choose from and to be able to choose the right blend of skillsets and technologies to produce a comprehensive service is something that customers can do, many customers did do, and it takes a lot of resources and effort. So what we really want to be able to do is take and package up our security service, our next generation SIEM platform to our customers where they don't need to become experts in every aspect of it. Now, an underlying component of that for us is how we store data in an open standards way and how we access that data in an open standards way. So just like we want our customers to get immediate value from the security services that we provide, we also want to be able take advantage of a search service that is offered to us and supported by a vendor like Ahana where we can very quickly take advantage of that value within our core underlying platform. So we really want to be able to make a frictionless effort to allow our customers achieve value as quick as possible. >> That's great stuff. And on the Ahana side, open data lakes, really the ease of use there, it sounds easy to me, but we know it's not easy just to put data in a data lake. At the end of the day, a lot of customers want simplicity 'cause they don't have the staffing. This comes up a lot. How do you leverage their open source participation and/or getting stood up quickly so they can get some value? Because that seems to be the number one thing people want right now. Dipti, how does that work? How do people get value quickly? >> Yeah, absolutely. When you talk about these open source press engines like Presto and others, right? They came out of these large internet companies that have a lot of distributed systems, engineers, PhDs, very kind of advanced level teams. And they can manage these distributed systems building onto them, add features at large scale, but not every company can and these engines are extremely powerful. So when you combine the power of Presto with the cloud and a managed service, that's where value for everyone comes in. And that's what I did with Ahana is looked at Presto, which is a great engine, but converted it into a great user experience so that whether it's a three person platform team or a five person platform team, they still get the same benefit of Presto that a Facebook gets, but at much, much a less operational complexity cost, as well as the ability to depend on a vendor who can then drive the innovation and make it even better. And so that's where managed services really com in. There's thousands of credit parameters that need to be tuned. With Ahana, you get it out of the box. So you have the best practices that are followed at these larger companies. Our team comes from Facebook, HuBERT and others, and you get that out of the box, with a few clicks you can get up and running. And so you see value immediately, in 30 minutes you're up and running and you can create your data lake versus with Hadoop and these prior systems, it would take months to receive real value from some of these systems. >> Yeah, we saw the Hadoop scar tissue is all great and all good now, but it takes too much resource, standing up clusters, managing it, you can't hire enough people. I got to ask you while you're on that topic, do you guys ship templates? How do you solve the problem of out of the box? You mentioned some out of the box capability. Do you guys think of as recipes, templates? What's your thoughts around what you're providing customers to get up and running? >> Yeah so in the case of Securonix, right, let's say they want to create a Presto cluster. They go into our SAS console. You essentially put in the number of nodes that you want. Number of workers you want. There's a lot of additional value that we built in like caching capabilities if you want more performance, built in cataloging that's again, another single click. And there isn't really as much of a template. Everybody gets the best tuned Presto for their workloads. Now there are certain workloads where you might have interactive in some cases, or you might have transformation batch ETL, and what we're doing next is actually giving you the knobs so that it comes pre tuned for the type of workload that you want to run versus you figuring it out. And so that's what I mean by out of the box, where you don't have to worry about these configuration parameters. You get the performance. And maybe Derrick can you talk a little bit about the benefits of the managed service and the usage as well. >> Yeah, absolutely. So, I'll answer the same question and then I'll tie back to what Dipti asked. Really, you know, our customers, we want it to be very easy for them to ingest security event logs. And there's really hundreds of types of a security event logs that we support natively out of the box, but the key for us is a standard that we call the open event format. And that is a normalized schema. We take any data source in it's normalized format, be a collector device a customer uses on-premise, they send the data up to our cloud, we do streaming analysis and data analytics to determine where the threats are. And once we do that, then we send the data off to a long-term storage format in a standards-based Parquet file. And that Parquet file is natively read by the Ahana service. So we simply deploy an Ahana cluster that uses the Presto engine that natively supports our open standard file format. And we have a normalized schema that our application can immediately start to see value from. So we handle the collection and streaming ingest, and we simply leverage the engine in Ahana to give us the appropriate scale. We can size up and down and control the cost to give the users the experience that they're paying for. >> I really love this topic because one, not only is it cutting edge, but it's very relevant for modern applications. You mentioned next gen SIEMs, SIEM, security information event management, not SIM as memory card, which I think of all the time because I always want to add more, but this brings up the idea of streaming data real-time, but as more services go to the cloud, Derrick, if you don't mind sharing more on this. Share the journey that you guys gone through, because I think a lot of people are looking at the cloud and saying, and I've been in a lot of these conversations about repatriation versus cloud. People aren't going that way. They're going more innovation with his net new revenue models emerging from the value that they're getting out of understanding events that are happening within the network and the apps, even when they're being stood up and torn down. So there's a lot of cloud native action going on where just controlling and understanding is way beyond the, just put stuff into an event log. It's a whole nother animal. >> Well, there's a couple of paradigm shifts that we've seen major patterns for in the last five or six years. Like I said, we started with the safe streaming ingest platform on premise. We use some different open source technologies. What we've done when we moved to the cloud is we've adopted cloud native services as part of our underlying platform to modernize and make our service cloud native. But what we're seeing as many customers either want to focus on on-premise deployments and especially financial institutions and government institute things, because they are very risk averse. Now we're seeing even those customers are realizing that it's very difficult to maintain the hundreds or thousands of servers that it requires on premise and have the large skilled staff required to keep it running. So what we're seeing now is a lot of those customers deployed some packaged products like our own, and even our own customers are doing a mass migration to the cloud because everything is handled for them as a service. And we have a team of experts that we maintain to support all of our global customers, rather than every one of our global customers having their own teams that we then support on the back end. So it's a much more efficient model. And then the other major approach that many of our customers also went down the path of is, is building their own security data lake. And many customers were somewhat successful in building their own security data lake but in order to keep up with the innovation, if you look at the analyst groups, the Gartner Magic Quadrant on the SIEM space, the feature set that is provided by a packaged product is a very large feature set. And even if somebody was put together all of the open source technologies to meet 20% of those features, just maintaining that over time is very expensive and very difficult. So we want to provide a service that has all of the best in class features, but also leverages the ability to innovate on the backend without the customer knowing. So we can do a technology shift to Ahana and Presto from our previous technology set. The customer doesn't know the difference, but they see the value add within the service that we're offering. >> So if I get this right, Derrick, Presto's enabling you guys to do threat detection at a level that you're super happy with as well as giving you the option for give self-service. Is that right for the, is that a kind of a- >> Well, let me clarify our definition. So we do streaming threat detection. So we do a machine learning based behavioral analysis and threat detection on rule-based correlation as well. So we do threat detection during the streaming process, but as part of the process of managing cybersecurity, the customer has a team of security analysts that do threat hunting. And the threat hunting is where Ahana comes in. So a human gets involved and starts searches for the forensic logs to determine what happened over time that might be suspicious and they start to investigate through a series of queries to give them the information that's relevant. And once they find information that's relevant, then they package it up into an algorithm that will do a analysis on an ongoing basis as part of the stream processing. So it's really part of the life cycle of hunting a real time threat detection. >> It's kind of like old adage hunters and farmers, you're farming through the streaming and hunting with the detection. I got to ask you, what would it be the alternative if you go back, I mean, I know cloud's so great because you have cutting edge applications and technologies. Without Presto, where would you be? I mean, what would be life like without these capabilities? What would have to happen? >> Well, the issue is not that we had the same feature set before we moved to Presto, but the challenge was on scale. The cost profile to continue to grow from 100 terabytes to one petabyte, to tens of petabytes, not only was it expensive, but it just, the scaling factors were not linear. So not only did we have a problem with the costs, but we also had a problem with the performance tailing off and keeping the service running. A large Hadoop cluster, for example, our first incarnation of this use, the hive service, in order to query data in a MapReduce cluster. So it's a completely different technology that uses a distributed Hadoop compute cluster to do the query. It does work, but then we start to see resource contention with that, and all the other things in the Hadoop platform. The Presto engine has the beauty of it, not only was it designed for scale, but it's feature built just for a query engine and that's the providing the right tool for the job, as opposed to a general purpose tool. >> Derrick, you've got a very busy job as chief architect. What are you excited about going forward when you look at the cloud technologies? What are you looking at? What are you watching? What are you getting excited about or what worries you? >> Well, that's a good question. What we're really doing, I'm leading up a group called the Securonix Innovation Labs, and we're looking at next generation technologies. We go through and analyze both open source technologies, technologies that are proprietary as well as building own technologies. And that's where we came across Ahana as part of a comprehensive analysis of different search engines, because we wanted to go through another round of search engine modernization, and we worked together in a partnership, and we're going to market together as part of our modernization efforts that we're continuously going through. So I'm looking forward to iterative continuous improvement over time. And this next journey, what we're seeing because of the growth in cybersecurity, really requires new and innovative technologies to work together holistically. >> Dipti, you got a great company that you co-founded. I got to ask you as the co-founder and chief product officer, you both the lead entrepreneur also, got the keys to the kingdom with the products. You got to balance that 20 miles stare out in the future while driving product excellence. You've got open source as a tailwind. What's on your mind as you go forward with your venture? >> Yeah. Great question. It's been super exciting to have found the Ahana in this space, cloud data and open source. That's where the action is happening these days, but there's two parts to it. One is making our customers successful and continuously delivering capabilities, features, continuing on our ease of use theme and a foundation to get customers like Securonix and others to get most value out of their data and as fast as possible, right? So that's a continuum. In terms of the longer term innovation, the way I see the space, there is a lot more innovation to be done and Presto itself can be made even better and there's a next gen Presto that we're working on. And given that Presto is a part of the foundation, the Linux Foundation, a lot of this innovation is happening together collaboratively with Facebook, with Uber who are members of the foundation with us. Securonix, we look forward to making a part of that foundation. And that innovation together can then benefit the entire community as well as the customer base. This includes better performance with more capabilities built in, caching and many other different types of database innovations, as well as scaling, auto scaling and keeping up with this ease of use theme that we're building on. So very exciting to work together with all these companies, as well as Securonix who's been a fantastic partner. We work together, build features together, and I look at delivering those features and functionalities to be used by these analysts, data scientists and threat hunters as Derrick called them. >> Great success, great partnership. And I love the open innovation, open co-creation you guys are doing together and open data lakes, great concept, open data analytics as well. This is the future. Insights coming from the open and sharing and actually having some standards. I love this topic, so Dipti, thank you very much, and Derrick, thanks for coming on and sharing on this Cube Conversation. Thanks for coming on. >> Thank you so much, John. >> Thanks for having us. >> Thanks. Take care. Bye-bye. >> Okay, it's theCube Conversation here in Palo Alto, California. I'm John furrier, your host of theCube. Thanks for watching. (upbeat music)

Published Date : Jul 30 2021

SUMMARY :

guys spending the time. and Derrick, hello again. and set the table on Securonix and Ahana? and momentum that we can into the warehouse and those, you know, because the SIEM market is exploding, and really get the handle either on the amount of data we ingest and it's also the types of information. hear the experts weigh in. So hopefully that answers the Yeah, I mean, the joke Why is the managed Yeah and interestingly, a search service that is offered to us And on the Ahana side, open data lakes, and you get that out of the box, I got to ask you while and the usage as well. and control the cost from the value that they're getting and have the large skilled staff as well as giving you the for the forensic logs to and hunting with the detection. and that's the providing when you look at the cloud technologies? because of the growth in cybersecurity, got the keys to the and a foundation to get And I love the open here in Palo Alto, California.

ENTITIES

Entity	Category	Confidence
Securonix	ORGANIZATION	0.99+
John	PERSON	0.99+
Derrick Harcey	PERSON	0.99+
Derrick	PERSON	0.99+
Facebook	ORGANIZATION	0.99+
Ahana	ORGANIZATION	0.99+
Ahana	PERSON	0.99+
John Furrier	PERSON	0.99+
20%	QUANTITY	0.99+
July 2021	DATE	0.99+
Uber	ORGANIZATION	0.99+
Dipti	PERSON	0.99+
100 terabytes	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
Linux Foundation	ORGANIZATION	0.99+
two parts	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Securonix Innovation Labs	ORGANIZATION	0.99+
tens of petabytes	QUANTITY	0.99+
30 minutes	QUANTITY	0.99+
one petabyte	QUANTITY	0.99+
Dipti Borkar	PERSON	0.99+
20 miles	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
five person	QUANTITY	0.99+
First	QUANTITY	0.99+
SQL	TITLE	0.99+
last month	DATE	0.99+
both	QUANTITY	0.99+
One	QUANTITY	0.98+
15th birthday	QUANTITY	0.97+
two great companies	QUANTITY	0.96+
HuBERT	ORGANIZATION	0.96+
Hadoop	TITLE	0.96+
S3	TITLE	0.96+
hundreds of technologies	QUANTITY	0.96+
three person	QUANTITY	0.95+
Parquet	TITLE	0.94+
first incarnation	QUANTITY	0.94+
first	QUANTITY	0.94+
Presto	ORGANIZATION	0.93+
Gartner	ORGANIZATION	0.93+
last decade	DATE	0.92+
terabytes of data	QUANTITY	0.92+
first log	QUANTITY	0.91+
single click	QUANTITY	0.9+
Presto	PERSON	0.9+
theCUBE	ORGANIZATION	0.88+

Steven Mih, Ahana and Sachin Nayyar, Securonix | AWS Startup Showcase

>> Voiceover: From theCUBE's Studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is theCUBE Conversation. >> Welcome back to theCUBE's coverage of the AWS Startup Showcase. Next Big Thing in AI, Security and Life Sciences featuring Ahana for the AI Trek. I'm your host, John Furrier. Today, we're joined by two great guests, Steven Mih, Ahana CEO, and Sachin Nayyar, Securonix CEO. Gentlemen, thanks for coming on theCUBE. We're talking about the Next-Gen technologies on AI, Open Data Lakes, et cetera. Thanks for coming on. >> Thanks for having us, John. >> Thanks, John. >> What a great line up here. >> Sachin: Thanks, Steven. >> Great, great stuff. Sachin, let's get in and talk about your company, Securonix. What do you guys do? Take us through, I know you've got a slide to help us through this, I want to introduce your stuff first then jump in with Steven. >> Absolutely. Thanks again, Steven. Ahana team for having us on the show. So Securonix, we started the company in 2010. We are the leader in security analytics and response capability for the cybermarket. So basically, this is a category of solutions called SIEM, Security Incident and Event Management. We are the quadrant leaders in Gartner, we now have about 500 customers today and have been plugging away since 2010. Started the company just really focused on analytics using machine learning and an advanced analytics to really find the needle in the haystack, then moved from there to needle in the needle stack using more algorithms, analysis of analysis. And then kind of, I evolved the company to run on cloud and become sort of the biggest security data lake on cloud and provide all the analytics to help companies with their insider threat, cyber threat, cloud solutions, application threats, emerging internally and externally, and then response and have a great partnership with Ahana as well as with AWS. So looking forward to this session, thank you. >> Awesome. I can't wait to hear the news on that Next-Gen SIEM leadership. Steven, Ahana, talk about what's going on with you guys, give us the update, a lot of stuff happening. >> Yeah. Great to be here and thanks for that such, and we appreciate the partnership as well with both Securonix and AWS. Ahana is the open source company based on PrestoDB, which is a project that came out of Facebook and is widely used, one of the fastest growing projects in data analytics today. And we make a managed service for Presto easily on AWS, all cloud native. And we'll be talking about that more during the show. Really excited to be here. We believe in open source. We believe in all the challenges of having data in the cloud and making it easy to use. So thanks for having us again. >> And looking forward to digging into that managed service and why that's been so successful. Looking forward to that. Let's get into the Securonix Next-Gen SIEM leadership first. Let's share the journey towards what you guys are doing here. As the Open Data Lakes on AWS has been a hot topic, the success of data in the cloud, no doubt is on everyone's mind especially with the edge coming. It's just, I mean, just incredible growth. Take us through Sachin, what do you guys got going on? >> Absolutely. Thanks, John. We are hearing about cyber threats every day. No question about it. So in the past, what was happening is companies, what we have done as enterprise is put all of our eggs in the basket of solutions that were evaluating the network data. With cloud, obviously there is no more network data. Now we have moved into focusing on EDR, right thing to do on endpoint detection. But with that, we also need security analytics across on-premise and cloud. And your other solutions like your OT, IOT, your mobile, bringing it all together into a security data lake and then running purpose built analytics on top of that, and then having a response so we can prevent some of these things from happening or detect them in real time versus innovating for hours or weeks and months, which is is obviously too late. So with some of the recent events happening around colonial and others, we all know cybersecurity is on top of everybody's mind. First and foremost, I also want to. >> Steven: (indistinct) slide one and that's all based off on top of the data lake, right? >> Sachin: Yes, absolutely. Absolutely. So before we go into on Securonix, I also want to congratulate everything going on with the new cyber initiatives with our government and just really excited to see some of the things that the government is also doing in this space to bring, to have stronger regulation and bring together the government and the private sector. From a Securonix perspective, today, we have one third of the fortune 500 companies using our technology. In addition, there are hundreds of small and medium sized companies that rely on Securonix for their cyber protection. So what we do is, again, we are running the solution on cloud, and that is very important. It is not just important for hosting, but in the space of cybersecurity, you need to have a solution, which is not, so where we can update the threat models and we can use the intelligence or the Intel that we gather from our customers, partners, and industry experts and roll it out to our customers within seconds and minutes, because the game is real time in cybersecurity. And that you can only do in cloud where you have the complete telemetry and access to these environments. When we go on-premise traditionally, what you will see is customers are even thinking about pushing the threat models through their standard Dev test life cycle management, and which is just completely defeating the purpose. So in any event, Securonix on the cloud brings together all the data, then runs purpose-built analytics on it. Helps you find very few, we are today pulling in several million events per second from our customers, and we provide just a very small handful of events and reduce the false positives so that people can focus on them. Their security command center can focus on that and then configure response actions on top of that. So we can take action for known issues and have intelligence in all the layers. So that's kind of what the Securonix is focused on. >> Steven, he just brought up, probably the most important story in technology right now. That's ransomware more than, first of all, cybersecurity in general, but ransomware, he mentioned some of the government efforts. Some are saying that the ransomware marketplace is bigger than some governments, nation state governments. There's a business model behind it. It's highly active. It's dominating the scene and it's a real threat. This is the new world we're living in, cloud creates the refactoring capabilities. We're hearing that story here with Securonix. How does Presto and Securonix work together? Because I'm connecting the dots here in real time. I think you're going to go there. So take us through because this is like the most important topic happening. >> Yeah. So as Sachin said, there's all this data that needs to go into the cloud and it's all moving to the cloud. And there's a massive amounts of data and hundreds of terabytes, petabytes of data that's moving into the data lakes and that's the S3-based data lakes, which are the easiest, cheapest, commodified place to put all this data. But in order to deliver the results that Sachin's company is driving, which is intelligence on when there's a ransomware or possibility, you need to have analytics on them. And so Presto is the open source project that is a open source SQL query engine for data lakes and other data sources. It was created by Facebook as part of the Linux foundation, something called Presto foundation. And it was built to replace the complicated Hadoop stack in order to then drive analytics at very lightning fast queries on large, large sets of data. And so Presto fits in with this Open Data Lake analytics movement, which has made Presto one of the fastest growing projects out there. >> What is an Open Data Lake? Real quick for the audience who wants to learn on what it means. Does is it means it's open source in the Linux foundation or open meaning it's open to multiple applications? What does that even mean? >> Yeah. Open Data Lake analytics means that you're, first of all, your data lake has open formats. So it is made up of say something called the ORC or Parquet. And these are formats that any engine can be used against. That's really great, instead of having locked in data types. Data lakes can have all different types of data. It can have unstructured, semi-structured data. It's not just the structured data, which is typically in your data warehouses. There's a lot more data going into the Open Data Lake. And then you can, based on what workload you're looking to get benefit from, the insights come from that, and actually slide two covers this pictorially. If you look on the left here on slide two, the Open Data Lake is where all the data is pulling. And Presto is the layer in between that and the insights which are driven by the visualization, reporting, dashboarding, BI tools or applications like in Securonix case. And so analytics are now being driven by every company for not just industries of security, but it's also for every industry out there, retail, e-commerce, you name it. There's a healthcare, financials, all are looking at driving more analytics for their SaaSified applications as well as for their own internal analysts, data scientists, and folks that are trying to be more data-driven. >> All right. Let's talk about the relationship now with where Presto fits in with Securonix because I get the open data layer. I see value in that. I get also what we're talking about the cloud and being faster with the datasets. So how does, Sachin' Securonix and Ahana fit in together? >> Yeah. Great question. So I'll tell you, we have two customers. I'll give you an example. We have two fortune 10 customers. One has moved most of their operations to the cloud and another customer which is in the process, early stage. The data, the amount of data that we are getting from the customer who's moved fully to the cloud is 20 times, 20 times more than the customer who's in the early stages of moving to the cloud. That is because the ability to add this level of telemetry in the cloud, in this case, it happens to be AWS, Office 365, Salesforce and several other rescalers across several other cloud technologies. But the level of logging that we are able to get the telemetry is unbelievable. So what it does is it allows us to analyze more, protect the customers better, protect them in real time, but there is a cost and scale factor to that. So like I said, when you are trying to pull in billions of events per day from a customer billions of events per day, what the customers are looking for is all of that data goes in, all of data gets enriched so that it makes sense to a normal analyst and all of that data is available for search, sometimes 90 days, sometimes 12 months. And then all of that data is available to be brought back into a searchable format for up to seven years. So think about the amount of data we are dealing with here and we have to provide a solution for this problem at a price that is affordable to the customer and that a medium-sized company as well as a large organization can afford. So after a lot of our analysis on this and again, Securonix is focused on cyber, bringing in the data, analyzing it, so after a lot of our analysis, we zeroed in on S3 as the core bucket where this data needs to be stored because the price point, the reliability, and all the other functions available on top of that. And with that, with S3, we've created a great partnership with AWS as well as with Snowflake that is providing this, from a data lake perspective, a bigger data lake, enterprise data lake perspective. So now for us to be able to provide customers the ability to search that data. So data comes in, we are enriching it. We are putting it in S3 in real time. Now, this is where Presto comes in. In our research, Presto came out as the best search engine to sit on top of S3. The engine is supported by companies like Facebook and Uber, and it is open source. So open source, like you asked the question. So for companies like us, we cannot depend on a very small technology company to offer mission critical capabilities because what if that company gets acquired, et cetera. In the case of open source, we are able to adopt it. We know there is a community behind it and it will be kind of available for us to use and we will be able to contribute in it for the longterm. Number two, from an open source perspective, we have a strong belief that customers own their own data. Traditionally, like Steven used the word locked in, it's a key term, customers have been locked in into proprietary formats in the past and those days are over. You should be, you own the data and you should be able to use it with us and with other systems of choice. So now you get into a data search engine like Presto, which scales independently of the storage. And then when we start looking at Presto, we came across Ahana. So for every open source system, you definitely need a sort of a for-profit company that invests in the community and then that takes the community forward. Because without a company like this, the community will die. So we are very excited about the partnership with Presto and Ahana. And Ahana provides us the ability to take Presto and cloudify it, or make the cloud operations work plus be our conduit to the Ahana community. Help us speed up certain items on the roadmap, help our team contribute to the community as well. And then you have to take a solution like Presto, you have to put it in the cloud, you have to make it scale, you have to put it on Kubernetes. Standard thing that you need to do in today's world to offer it as sort of a micro service into our architecture. So in all of those areas, that's where our partnership is with Ahana and Presto and S3 and we think, this is the search solution for the future. And with something like this, very soon, we will be able to offer our customers 12 months of data, searchable at extremely fast speeds at very reasonable price points and you will own your own data. So it has very significant business benefits for our customers with the technology partnership that we have set up here. So very excited about this. >> Sachin, it's very inspiring, a couple things there. One, decentralize on your own data, having a democratized, that piece is killer. Open source, great point. >> Absolutely. >> Company goes out of business, you don't want to lose the source code or get acquired or whatever. That's a key enabler. And then three, a fast managed service that has a commercial backing behind it. So, a great, and by the way, Snowflake wasn't around a couple of years ago. So like, so this is what we're talking about. This is the cloud scale. Steven, take us home with this point because this is what innovation looks like. Could you share why it's working? What's some of the things that people could walk away with and learn from as the new architecture for the new NextGen cloud is here, so this is a big part of and share how this works? >> That's right. As you heard from Sachin, every company is becoming data-driven and analytics are central to their business. There's more data and it needs to be analyzed at lower cost without the locked in and people want that flexibility. And so a slide three talks about what Ahana cloud for Presto does. It's the best Presto out of the box. It gives you very easy to use for your operations team. So it can be one or two people just managing this and they can get up to speed very quickly in 30 minutes, be up and running. And that jump starts their movement into an Open Data Lake analytics architecture. That architecture is going to be, it is the one that is at Facebook, Uber, Twitter, other large web scale, internet scale companies. And with the amount of data that's occurring, that's now becoming the standard architecture for everyone else in the future. And so just to wrap, we're really excited about making that easy, giving an open source solution because the open source data stack based off of data lake analytics is really happening. >> I got to ask you, you've seen many waves on the industry. Certainly, you've been through the big data waves, Steven. Sachin, you're on the cutting edge and just the cutting edge billions of signals from one client alone is pretty amazing scale and refactoring that value proposition is super important. What's different from 10 years ago when the Hadoop, you mentioned Hadoop earlier, which is RIP, obviously the cloud killed it. We all know that. Everyone kind of knows that. But like, what's different now? I mean, skeptics might say, I don't believe you, but it's just crazy. There's no way it works. S3 costs way too much. Why is this now so much more of an attractive proposition? What do you say the naysayers out there? With Steve, we'll start with you and then Sachin, I want you to like weigh in too. >> Yeah. Well, if you think about the Hadoop era and if you look at slide three, it was a very complicated system that was done mainly on-prem. And you'd have to go and set up a big data team and a rack and stack a bunch of servers and then try to put all this stuff together and candidly, the results and the outcomes of that were very hard to get unless you had the best possible teams and invested a lot of money in this. What you saw in this slide was that, that right hand side which shows the stack. Now you have a separate compute, which is based off of Intel based instances in the cloud. We run the best in that and they're part of the Presto foundation. And that's now data lakes. Now the distributed compute engines are the ones that have become very much easier. So the big difference in what I see is no longer called big data. It's just called data analytics because it's now become commodified as being easy and the bar is much, much lower, so everyone can get the benefit of this across industries, across organizations. I mean, that's good for the world, reduces the security threats, the ransomware, in the case of Securonix and Sachin here. But every company can benefit from this. >> Sachin, this is really as an example in my mind and you can comment too on if you'd believe or not, but replatform with the cloud, that's a no brainer. People do that. They did it. But the value is refactoring in the cloud. It's thinking differently with the assets you have and making sure you're using the right pieces. I mean, there's no brainer, you know it's good. If it costs more money to stand up something than to like get value out of something that's operating at scale, much easier equation. What's your thoughts on this? Go back 10 years and where we are now, what's different? I mean, replatforming, refactoring, all kinds of happening. What's your take on all this? >> Agreed, John. So we have been in business now for about 10 to 11 years. And when we started my hair was all black. Okay. >> John: You're so silly. >> Okay. So this, everything has happened here is the transition from Hadoop to cloud. Okay. This is what the result has been. So people can see it for themselves. So when we started off with deep partnerships with the Hadoop providers and again, Hadoop is the foundation, which has now become EMR and everything else that AWS and other companies have picked up. But when you start with some basic premise, first, the racking and stacking of hardware, companies having to project their entire data volume upfront, bringing the servers and have 50, 100, 500 servers sitting in their data centers. And then when there are spikes in data, or like I said, as you move to the cloud, your data volume will increase between five to 20x and projecting for that. And then think about the agility that it will take you three to six months to bring in new servers and then bring them into the architecture. So big issue. Number two big issue is that the backend of that was built for HDFS. So Hadoop in my mind was built to ingest large amounts of data in batches and then perform some spark jobs on it, some analytics. But we are talking in security about real time, high velocity, high variety data, which has to be available in real time. It wasn't built for that, to be honest. So what was happening is, again, even if you look at the Hadoop companies today as they have kind of figured, kind of define their next generation, they have moved from HDFS to now kind of a cloud based platform capability and have discarded the traditional HDFS architecture because it just wasn't scaling, wasn't searching fast enough, wasn't searching fast enough for hundreds of analysts at the same time. And then obviously, the servers, et cetera wasn't working. Then when we worked with the Hadoop companies, they were always two to three versions behind for the individual services that they had brought together. And again, when you're talking about this kind of a volume, you need to be on the cutting edge always of the technologies underneath that. So even while we were working with them, we had to support our own versions of Kafka, Solr, Zookeeper, et cetera to really bring it together and provide our customers this capability. So now when we have moved to the cloud with solutions like EMR behind us, AWS has invested in in solutions like EMR to make them scalable, to have scale and then scale out, which traditional Hadoop did not provide because they missed the cloud wave. And then on top of that, again, rather than throwing data in that traditional older HDFS format, we are now taking the same format, the parquet format that it supports, putting it in S3 and now making it available and using all the capabilities like you said, the refactoring of that is critical. That rather than on-prem having servers and redundancies with S3, we get built in redundancy. We get built in life cycle management, high degree of confidence data reliability. And then we get all this innovation from companies like, from groups like Presto, companies like Ahana sitting on double that S3. And the last item I would say is in the cloud we are now able to offer multiple, have multiple resilient options on our side. So for example, with us, we still have some premium searching going on with solutions like Solr and Elasticsearch, then you have Presto and Ahana providing majority of our searching, but we still have Athena as a backup in case something goes down in the architecture. Our queries will spin back up to Athena, AWS service on Presto and customers will still get served. So all of these options, but what it doesn't cost us anything, Athena, if we don't use it, but all of these options are not available on-prem. So in my mind, I mean, it's a whole new world we are living in. It is a world where now we have made it possible for companies to even enterprises to even think about having true security data lakes, which are useful and having real-time analytics. From my perspective, I don't even sign up today for a large enterprise that wants to build a data lake on-prem because I know that is not, that is going to be a very difficult project to make it successful. So we've come a long way and there are several details around this that we've kind of endured through the process, but very excited where we are today. >> Well, we certainly follow up with theCUBE on all your your endeavors. Quickly on Ahana, why them, why their solution? In your words, what would be the advice you'd give me if I'm like, okay, I'm looking at this, why do I want to use it, and what's your experience? >> Right. So the standard SQL query engine for data lake analytics, more and more people have more data, want to have something that's based on open source, based on open formats, gives you that flexibility, pay as you go. You only pay for what you use. And so it proved to be the best option for Securonix to create a self-service system that has all the speed and performance and scalability that they need, which is based off of the innovation from the large companies like Facebook, Uber, Twitter. They've all invested heavily. We contribute to the open source project. It's a vibrant community. We encourage people to join the community and even Securonix, we'll be having engineers that are contributing to the project as well. I think, is that right Sachin? Maybe you could share a little bit about your thoughts on being part of the community. >> Yeah. So also why we chose Ahana, like John said. The first reason is you see Steven is always smiling. Okay. >> That's for sure. >> That is very important. I mean, jokes apart, you need a great partner. You need a great partner. You need a partner with a great attitude because this is not a sprint, this is a marathon. So the Ahana founders, Steven, the whole team, they're world-class, they're world-class. The depth that the CTO has, his experience, the depth that Dipti has, who's running the cloud solution. These guys are world-class. They are very involved in the community. We evaluated them from a community perspective. They are very involved. They have the depth of really commercializing an open source solution without making it too commercial. The right balance, where the founding companies like Facebook and Uber, and hopefully Securonix in the future as we contribute more and more will have our say and they act like the right stewards in this journey and then contribute as well. So and then they have chosen the right niche rather than taking portions of the product and making it proprietary. They have put in the effort towards the cloud infrastructure of making that product available easily on the cloud. So I think it's sort of a no-brainer from our side. Once we chose Presto, Ahana was the no-brainer and just the partnership so far has been very exciting and I'm looking forward to great things together. >> Likewise Sachin, thanks so much for that. And we've only found your team, you're world-class as well, and working together and we look forward to working in the community also in the Presto foundation. So thanks for that. >> Guys, great partnership. Great insight and really, this is a great example of cloud scale, cloud value proposition as it unlocks new benefits. Open source, managed services, refactoring the opportunities to create more value. Stephen, Sachin, thank you so much for sharing your story here on open data lakes. Can open always wins in my mind. This is theCUBE we're always open and we're showcasing all the hot startups coming out of the AWS ecosystem for the AWS Startup Showcase. I'm John Furrier, your host. Thanks for watching. (bright music)

Published Date : Jun 24 2021

SUMMARY :

leaders all around the world, of the AWS Startup Showcase. to help us through this, and provide all the what's going on with you guys, in the cloud and making it easy to use. Let's get into the Securonix So in the past, what was So in any event, Securonix on the cloud Some are saying that the and that's the S3-based data in the Linux foundation or open meaning And Presto is the layer in because I get the open data layer. and all the other functions that piece is killer. and learn from as the new architecture for everyone else in the future. obviously the cloud killed it. and the bar is much, much lower, But the value is refactoring in the cloud. So we have been in business and again, Hadoop is the foundation, be the advice you'd give me system that has all the speed The first reason is you see and just the partnership so in the community also in for the AWS Startup Showcase.

ENTITIES

Entity	Category	Confidence
Steven	PERSON	0.99+
Sachin	PERSON	0.99+
John	PERSON	0.99+
Steve	PERSON	0.99+
Securonix	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Steven Mih	PERSON	0.99+
50	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
2010	DATE	0.99+
Stephen	PERSON	0.99+
Sachin Nayyar	PERSON	0.99+
Facebook	ORGANIZATION	0.99+
20 times	QUANTITY	0.99+
one	QUANTITY	0.99+
12 months	QUANTITY	0.99+
three	QUANTITY	0.99+
Twitter	ORGANIZATION	0.99+
Ahana	PERSON	0.99+
two customers	QUANTITY	0.99+
90 days	QUANTITY	0.99+
Ahana	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
100	QUANTITY	0.99+
30 minutes	QUANTITY	0.99+
Presto	ORGANIZATION	0.99+
hundreds of terabytes	QUANTITY	0.99+
five	QUANTITY	0.99+
First	QUANTITY	0.99+
One	QUANTITY	0.99+
two	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
six months	QUANTITY	0.99+
S3	TITLE	0.99+
Zookeeper	TITLE	0.99+

Data Power Panel V3

(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)

Published Date : Jun 2 2022

SUMMARY :

And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Tony	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Frank	PERSON	0.99+
Frank Klutman	PERSON	0.99+
Tony Baers	PERSON	0.99+
Mars	LOCATION	0.99+
Doug Henschen	PERSON	0.99+
2020	DATE	0.99+
AWS	ORGANIZATION	0.99+
Venus	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Databricks	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Holger Mueller	PERSON	0.99+
Andy Jassy	PERSON	0.99+
last year	DATE	0.99+
$5 billion	QUANTITY	0.99+
$10,000	QUANTITY	0.99+
14 vendors	QUANTITY	0.99+
Last year	DATE	0.99+
last week	DATE	0.99+
San Francisco	LOCATION	0.99+
SanjMo	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
8,500 users	QUANTITY	0.99+
Sanjeev	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
32 concurrent users	QUANTITY	0.99+
two	QUANTITY	0.99+
Constellation Research	ORGANIZATION	0.99+
Mongo	ORGANIZATION	0.99+
Sanjeev Mohan	PERSON	0.99+
Ahana	ORGANIZATION	0.99+
DaaS	ORGANIZATION	0.99+
EMR	ORGANIZATION	0.99+
32	QUANTITY	0.99+
Atlas	ORGANIZATION	0.99+
Delta	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
Athena	ORGANIZATION	0.99+
next week	DATE	0.99+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Ahana: