Image Title

Search Results for Airflow:

Steven Hillion & Jeff Fletcher, Astronomer | AWS Startup Showcase S3E1


 

(upbeat music) >> Welcome everyone to theCUBE's presentation of the AWS Startup Showcase AI/ML Top Startups Building Foundation Model Infrastructure. This is season three, episode one of our ongoing series covering exciting startups from the AWS ecosystem to talk about data and analytics. I'm your host, Lisa Martin and today we're excited to be joined by two guests from Astronomer. Steven Hillion joins us, it's Chief Data Officer and Jeff Fletcher, it's director of ML. They're here to talk about machine learning and data orchestration. Guys, thank you so much for joining us today. >> Thank you. >> It's great to be here. >> Before we get into machine learning let's give the audience an overview of Astronomer. Talk about what that is, Steven. Talk about what you mean by data orchestration. >> Yeah, let's start with Astronomer. We're the Airflow company basically. The commercial developer behind the open-source project, Apache Airflow. I don't know if you've heard of Airflow. It's sort of de-facto standard these days for orchestrating data pipelines, data engineering pipelines, and as we'll talk about later, machine learning pipelines. It's really is the de-facto standard. I think we're up to about 12 million downloads a month. That's actually as a open-source project. I think at this point it's more popular by some measures than Slack. Airflow was created by Airbnb some years ago to manage all of their data pipelines and manage all of their workflows and now it powers the data ecosystem for organizations as diverse as Electronic Arts, Conde Nast is one of our big customers, a big user of Airflow. And also not to mention the biggest banks on Wall Street use Airflow and Astronomer to power the flow of data throughout their organizations. >> Talk about that a little bit more, Steven, in terms of the business impact. You mentioned some great customer names there. What is the business impact or outcomes that a data orchestration strategy enables businesses to achieve? >> Yeah, I mean, at the heart of it is quite simply, scheduling and managing data pipelines. And so if you have some enormous retailer who's managing the flow of information throughout their organization they may literally have thousands or even tens of thousands of data pipelines that need to execute every day to do things as simple as delivering metrics for the executives to consume at the end of the day, to producing on a weekly basis new machine learning models that can be used to drive product recommendations. One of our customers, for example, is a British food delivery service. And you get those recommendations in your application that says, "Well, maybe you want to have samosas with your curry." That sort of thing is powered by machine learning models that they train on a regular basis to reflect changing conditions in the market. And those are produced through Airflow and through the Astronomer platform, which is essentially a managed platform for running airflow. So at its simplest it really is just scheduling and managing those workflows. But that's easier said than done of course. I mean if you have 10 thousands of those things then you need to make sure that they all run that they all have sufficient compute resources. If things fail, how do you track those down across those 10,000 workflows? How easy is it for an average data scientist or data engineer to contribute their code, their Python notebooks or their SQL code into a production environment? And then you've got reproducibility, governance, auditing, like managing data flows across an organization which we think of as orchestrating them is much more than just scheduling. It becomes really complicated pretty quickly. >> I imagine there's a fair amount of complexity there. Jeff, let's bring you into the conversation. Talk a little bit about Astronomer through your lens, data orchestration and how it applies to MLOps. >> So I come from a machine learning background and for me the interesting part is that machine learning requires the expansion into orchestration. A lot of the same things that you're using to go and develop and build pipelines in a standard data orchestration space applies equally well in a machine learning orchestration space. What you're doing is you're moving data between different locations, between different tools, and then tasking different types of tools to act on that data. So extending it made logical sense from a implementation perspective. And a lot of my focus at Astronomer is really to explain how Airflow can be used well in a machine learning context. It is being used well, it is being used a lot by the customers that we have and also by users of the open source version. But it's really being able to explain to people why it's a natural extension for it and how well it fits into that. And a lot of it is also extending some of the infrastructure capabilities that Astronomer provides to those customers for them to be able to run some of the more platform specific requirements that come with doing machine learning pipelines. >> Let's get into some of the things that make Astronomer unique. Jeff, sticking with you, when you're in customer conversations, what are some of the key differentiators that you articulate to customers? >> So a lot of it is that we are not specific to one cloud provider. So we have the ability to operate across all of the big cloud providers. I know, I'm certain we have the best developers that understand how best practices implementations for data orchestration works. So we spend a lot of time talking to not just the business outcomes and the business users of the product, but also also for the technical people, how to help them better implement things that they may have come across on a Stack Overflow article or not necessarily just grown with how the product has migrated. So it's the ability to run it wherever you need to run it and also our ability to help you, the customer, better implement and understand those workflows that I think are two of the primary differentiators that we have. >> Lisa: Got it. >> I'll add another one if you don't mind. >> You can go ahead, Steven. >> Is lineage and dependencies between workflows. One thing we've done is to augment core Airflow with Lineage services. So using the Open Lineage framework, another open source framework for tracking datasets as they move from one workflow to another one, team to another, one data source to another is a really key component of what we do and we bundle that within the service so that as a developer or as a production engineer, you really don't have to worry about lineage, it just happens. Jeff, may show us some of this later that you can actually see as data flows from source through to a data warehouse out through a Python notebook to produce a predictive model or a dashboard. Can you see how those data products relate to each other? And when something goes wrong, figure out what upstream maybe caused the problem, or if you're about to change something, figure out what the impact is going to be on the rest of the organization. So Lineage is a big deal for us. >> Got it. >> And just to add on to that, the other thing to think about is that traditional Airflow is actually a complicated implementation. It required quite a lot of time spent understanding or was almost a bespoke language that you needed to be able to develop in two write these DAGs, which is like fundamental pipelines. So part of what we are focusing on is tooling that makes it more accessible to say a data analyst or a data scientist who doesn't have or really needs to gain the necessary background in how the semantics of Airflow DAGs works to still be able to get the benefit of what Airflow can do. So there is new features and capabilities built into the astronomer cloud platform that effectively obfuscates and removes the need to understand some of the deep work that goes on. But you can still do it, you still have that capability, but we are expanding it to be able to have orchestrated and repeatable processes accessible to more teams within the business. >> In terms of accessibility to more teams in the business. You talked about data scientists, data analysts, developers. Steven, I want to talk to you, as the chief data officer, are you having more and more conversations with that role and how is it emerging and evolving within your customer base? >> Hmm. That's a good question, and it is evolving because I think if you look historically at the way that Airflow has been used it's often from the ground up. You have individual data engineers or maybe single data engineering teams who adopt Airflow 'cause it's very popular. Lots of people know how to use it and they bring it into an organization and say, "Hey, let's use this to run our data pipelines." But then increasingly as you turn from pure workflow management and job scheduling to the larger topic of orchestration you realize it gets pretty complicated, you want to have coordination across teams, and you want to have standardization for the way that you manage your data pipelines. And so having a managed service for Airflow that exists in the cloud is easy to spin up as you expand usage across the organization. And thinking long term about that in the context of orchestration that's where I think the chief data officer or the head of analytics tends to get involved because they really want to think of this as a strategic investment that they're making. Not just per team individual Airflow deployments, but a network of data orchestrators. >> That network is key. Every company these days has to be a data company. We talk about companies being data driven. It's a common word, but it's true. It's whether it is a grocer or a bank or a hospital, they've got to be data companies. So talk to me a little bit about Astronomer's business model. How is this available? How do customers get their hands on it? >> Jeff, go ahead. >> Yeah, yeah. So we have a managed cloud service and we have two modes of operation. One, you can bring your own cloud infrastructure. So you can say here is an account in say, AWS or Azure and we can go and deploy the necessary infrastructure into that, or alternatively we can host everything for you. So it becomes a full SaaS offering. But we then provide a platform that connects at the backend to your internal IDP process. So however you are authenticating users to make sure that the correct people are accessing the services that they need with role-based access control. From there we are deploying through Kubernetes, the different services and capabilities into either your cloud account or into an account that we host. And from there Airflow does what Airflow does, which is its ability to then reach to different data systems and data platforms and to then run the orchestration. We make sure we do it securely, we have all the necessary compliance certifications required for GDPR in Europe and HIPAA based out of the US, and a whole bunch host of others. So it is a secure platform that can run in a place that you need it to run, but it is a managed Airflow that includes a lot of the extra capabilities like the cloud developer environment and the open lineage services to enhance the overall airflow experience. >> Enhance the overall experience. So Steven, going back to you, if I'm a Conde Nast or another organization, what are some of the key business outcomes that I can expect? As one of the things I think we've learned during the pandemic is access to realtime data is no longer a nice to have for organizations. It's really an imperative. It's that demanding consumer that wants to have that personalized, customized, instant access to a product or a service. So if I'm a Conde Nast or I'm one of your customers, what can I expect my business to be able to achieve as a result of data orchestration? >> Yeah, I think in a nutshell it's about providing a reliable, scalable, and easy to use service for developing and running data workflows. And talking of demanding customers, I mean, I'm actually a customer myself, as you mentioned, I'm the head of data for Astronomer. You won't be surprised to hear that we actually use Astronomer and Airflow to run all of our data pipelines. And so I can actually talk about my experience. When I started I was of course familiar with Airflow, but it always seemed a little bit unapproachable to me if I was introducing that to a new team of data scientists. They don't necessarily want to have to think about learning something new. But I think because of the layers that Astronomer has provided with our Astro service around Airflow it was pretty easy for me to get up and running. Of course I've got an incentive for doing that. I work for the Airflow company, but we went from about, at the beginning of last year, about 500 data tasks that we were running on a daily basis to about 15,000 every day. We run something like a million data operations every month within my team. And so as one outcome, just the ability to spin up new production workflows essentially in a single day you go from an idea in the morning to a new dashboard or a new model in the afternoon, that's really the business outcome is just removing that friction to operationalizing your machine learning and data workflows. >> And I imagine too, oh, go ahead, Jeff. >> Yeah, I think to add to that, one of the things that becomes part of the business cycle is a repeatable capabilities for things like reporting, for things like new machine learning models. And the impediment that has existed is that it's difficult to take that from a team that's an analyst team who then provide that or a data science team that then provide that to the data engineering team who have to work the workflow all the way through. What we're trying to unlock is the ability for those teams to directly get access to scheduling and orchestrating capabilities so that a business analyst can have a new report for C-suite execs that needs to be done once a week, but the time to repeatability for that report is much shorter. So it is then immediately in the hands of the person that needs to see it. It doesn't have to go into a long list of to-dos for a data engineering team that's already overworked that they eventually get it to it in a month's time. So that is also a part of it is that the realizing, orchestration I think is fairly well and a lot of people get the benefit of being able to orchestrate things within a business, but it's having more people be able to do it and shorten the time that that repeatability is there is one of the main benefits from good managed orchestration. >> So a lot of workforce productivity improvements in what you're doing to simplify things, giving more people access to data to be able to make those faster decisions, which ultimately helps the end user on the other end to get that product or the service that they're expecting like that. Jeff, I understand you have a demo that you can share so we can kind of dig into this. >> Yeah, let me take you through a quick look of how the whole thing works. So our starting point is our cloud infrastructure. This is the login. You go to the portal. You can see there's a a bunch of workspaces that are available. Workspaces are like individual places for people to operate in. I'm not going to delve into all the deep technical details here, but starting point for a lot of our data science customers is we have what we call our Cloud IDE, which is a web-based development environment for writing and building out DAGs without actually having to know how the underpinnings of Airflow work. This is an internal one, something that we use. You have a notebook-like interface that lets you write python code and SQL code and a bunch of specific bespoke type of blocks if you want. They all get pulled together and create a workflow. So this is a workflow, which gets compiled to something that looks like a complicated set of Python code, which is the DAG. I then have a CICD process pipeline where I commit this through to my GitHub repo. So this comes to a repo here, which is where these DAGs that I created in the previous step exist. I can then go and say, all right, I want to see how those particular DAGs have been running. We then get to the actual Airflow part. So this is the managed Airflow component. So we add the ability for teams to fairly easily bring up an Airflow instance and write code inside our notebook-like environment to get it into that instance. So you can see it's been running. That same process that we built here that graph ends up here inside this, but you don't need to know how the fundamentals of Airflow work in order to get this going. Then we can run one of these, it runs in the background and we can manage how it goes. And from there, every time this runs, it's emitting to a process underneath, which is the open lineage service, which is the lineage integration that allows me to come in here and have a look and see this was that actual, that same graph that we built, but now it's the historic version. So I know where things started, where things are going, and how it ran. And then I can also do a comparison. So if I want to see how this particular run worked compared to one historically, I can grab one from a previous date and it will show me the comparison between the two. So that combination of managed Airflow, getting Airflow up and running very quickly, but the Cloud IDE that lets you write code and know how to get something into a repeatable format get that into Airflow and have that attached to the lineage process adds what is a complete end-to-end orchestration process for any business looking to get the benefit from orchestration. >> Outstanding. Thank you so much Jeff for digging into that. So one of my last questions, Steven is for you. This is exciting. There's a lot that you guys are enabling organizations to achieve here to really become data-driven companies. So where can folks go to get their hands on this? >> Yeah, just go to astronomer.io and we have plenty of resources. If you're new to Airflow, you can read our documentation, our guides to getting started. We have a CLI that you can download that is really I think the easiest way to get started with Airflow. But you can actually sign up for a trial. You can sign up for a guided trial where our teams, we have a team of experts, really the world experts on getting Airflow up and running. And they'll take you through that trial and allow you to actually kick the tires and see how this works with your data. And I think you'll see pretty quickly that it's very easy to get started with Airflow, whether you're doing that from the command line or doing that in our cloud service. And all of that is available on our website >> astronomer.io. Jeff, last question for you. What are you excited about? There's so much going on here. What are some of the things, maybe you can give us a sneak peek coming down the road here that prospects and existing customers should be excited about? >> I think a lot of the development around the data awareness components, so one of the things that's traditionally been complicated with orchestration is you leave your data in the place that you're operating on and we're starting to have more data processing capability being built into Airflow. And from a Astronomer perspective, we are adding more capabilities around working with larger datasets, doing bigger data manipulation with inside the Airflow process itself. And that lends itself to better machine learning implementation. So as we start to grow and as we start to get better in the machine learning context, well, in the data awareness context, it unlocks a lot more capability to do and implement proper machine learning pipelines. >> Awesome guys. Exciting stuff. Thank you so much for talking to me about Astronomer, machine learning, data orchestration, and really the value in it for your customers. Steve and Jeff, we appreciate your time. >> Thank you. >> My pleasure, thanks. >> And we thank you for watching. This is season three, episode one of our ongoing series covering exciting startups from the AWS ecosystem. I'm your host, Lisa Martin. You're watching theCUBE, the leader in live tech coverage. (upbeat music)

Published Date : Mar 9 2023

SUMMARY :

of the AWS Startup Showcase let's give the audience and now it powers the data ecosystem What is the business impact or outcomes for the executives to consume how it applies to MLOps. and for me the interesting that you articulate to customers? So it's the ability to run it if you don't mind. that you can actually see as data flows the other thing to think about to more teams in the business. about that in the context of orchestration So talk to me a little bit at the backend to your So Steven, going back to you, just the ability to spin up but the time to repeatability a demo that you can share that allows me to come There's a lot that you guys We have a CLI that you can download What are some of the things, in the place that you're operating on and really the value in And we thank you for watching.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
JeffPERSON

0.99+

Lisa MartinPERSON

0.99+

Jeff FletcherPERSON

0.99+

StevenPERSON

0.99+

StevePERSON

0.99+

Steven HillionPERSON

0.99+

LisaPERSON

0.99+

EuropeLOCATION

0.99+

Conde NastORGANIZATION

0.99+

USLOCATION

0.99+

thousandsQUANTITY

0.99+

twoQUANTITY

0.99+

HIPAATITLE

0.99+

AWSORGANIZATION

0.99+

two guestsQUANTITY

0.99+

AirflowORGANIZATION

0.99+

AirbnbORGANIZATION

0.99+

10 thousandsQUANTITY

0.99+

OneQUANTITY

0.99+

Electronic ArtsORGANIZATION

0.99+

oneQUANTITY

0.99+

PythonTITLE

0.99+

two modesQUANTITY

0.99+

AirflowTITLE

0.98+

10,000 workflowsQUANTITY

0.98+

about 500 data tasksQUANTITY

0.98+

todayDATE

0.98+

one outcomeQUANTITY

0.98+

tens of thousandsQUANTITY

0.98+

GDPRTITLE

0.97+

SQLTITLE

0.97+

GitHubORGANIZATION

0.96+

astronomer.ioOTHER

0.94+

SlackORGANIZATION

0.94+

AstronomerORGANIZATION

0.94+

some years agoDATE

0.92+

once a weekQUANTITY

0.92+

AstronomerTITLE

0.92+

theCUBEORGANIZATION

0.92+

last yearDATE

0.91+

KubernetesTITLE

0.88+

single dayQUANTITY

0.87+

about 15,000 every dayQUANTITY

0.87+

one cloudQUANTITY

0.86+

IDETITLE

0.86+

Paola Peraza Calderon & Viraj Parekh, Astronomer | Cube Conversation


 

(soft electronic music) >> Hey everyone, welcome to this CUBE conversation as part of the AWS Startup Showcase, season three, episode one, featuring Astronomer. I'm your host, Lisa Martin. I'm in the CUBE's Palo Alto Studios, and today excited to be joined by a couple of guests, a couple of co-founders from Astronomer. Viraj Parekh is with us, as is Paola Peraza-Calderon. Thanks guys so much for joining us. Excited to dig into Astronomer. >> Thank you so much for having us. >> Yeah, thanks for having us. >> Yeah, and we're going to be talking about the role of data orchestration. Paola, let's go ahead and start with you. Give the audience that understanding, that context about Astronomer and what it is that you guys do. >> Mm-hmm. Yeah, absolutely. So, Astronomer is a, you know, we're a technology and software company for modern data orchestration, as you said, and we're the driving force behind Apache Airflow. The Open Source Workflow Management tool that's since been adopted by thousands and thousands of users, and we'll dig into this a little bit more. But, by data orchestration, we mean data pipeline, so generally speaking, getting data from one place to another, transforming it, running it on a schedule, and overall just building a central system that tangibly connects your entire ecosystem of data services, right. So what, that's Redshift, Snowflake, DVT, et cetera. And so tangibly, we build, we at Astronomer here build products powered by Apache Airflow for data teams and for data practitioners, so that they don't have to. So, we sell to data engineers, data scientists, data admins, and we really spend our time doing three things. So, the first is that we build Astro, our flagship cloud service that we'll talk more on. But here, we're really building experiences that make it easier for data practitioners to author, run, and scale their data pipeline footprint on the cloud. And then, we also contribute to Apache Airflow as an open source project and community. So, we cultivate the community of humans, and we also put out open source developer tools that actually make it easier for individual data practitioners to be productive in their day-to-day jobs, whether or not they actually use our product and and pay us money or not. And then of course, we also have professional services and education and all of these things around our commercial products that enable folks to use our products and use Airflow as effectively as possible. So yeah, super, super happy with everything we've done and hopefully that gives you an idea of where we're starting. >> Awesome, so when you're talking with those, Paola, those data engineers, those data scientists, how do you define data orchestration and what does it mean to them? >> Yeah, yeah, it's a good question. So, you know, if you Google data orchestration you're going to get something about an automated process for organizing silo data and making it accessible for processing and analysis. But, to your question, what does that actually mean, you know? So, if you look at it from a customer's perspective, we can share a little bit about how we at Astronomer actually do data orchestration ourselves and the problems that it solves for us. So, as many other companies out in the world do, we at Astronomer need to monitor how our own customers use our products, right? And so, we have a weekly meeting, for example, that goes through a dashboard and a dashboarding tool called Sigma where we see the number of monthly customers and how they're engaging with our product. But, to actually do that, you know, we have to use data from our application database, for example, that has behavioral data on what they're actually doing in our product. We also have data from third party API tools, like Salesforce and HubSpot, and other ways in which our customer, we actually engage with our customers and their behavior. And so, our data team internally at Astronomer uses a bunch of tools to transform and use that data, right? So, we use FiveTran, for example, to ingest. We use Snowflake as our data warehouse. We use other tools for data transformations. And even, if we at Astronomer don't do this, you can imagine a data team also using tools like, Monte Carlo for data quality, or Hightouch for Reverse ETL, or things like that. And, I think the point here is that data teams, you know, that are building data-driven organizations have a plethora of tooling to both ingest the right data and come up with the right interfaces to transform and actually, interact with that data. And so, that movement and sort of synchronization of data across your ecosystem is exactly what data orchestration is responsible for. Historically, I think, and Raj will talk more about this, historically, schedulers like KRON and Oozie or Control-M have taken a role here, but we think that Apache Airflow has sort of risen over the past few years as the defacto industry standard for writing data pipelines that do tasks, that do data jobs that interact with that ecosystem of tools in your organization. And so, beyond that sort of data pipeline unit, I think where we see it is that data acquisition is not only writing those data pipelines that move your data, but it's also all the things around it, right, so, CI/CD tool and Secrets Management, et cetera. So, a long-winded answer here, but I think that's how we talk about it here at Astronomer and how we're building our products. >> Excellent. Great context, Paola. Thank you. Viraj, let's bring you into the conversation. Every company these days has to be a data company, right? They've got to be a software company- >> Mm-hmm. >> whether it's my bank or my grocery store. So, how are companies actually doing data orchestration today, Viraj? >> Yeah, it's a great question. So, I think one thing to think about is like, on one hand, you know, data orchestration is kind of a new category that we're helping define, but on the other hand, it's something that companies have been doing forever, right? You need to get data moving to use it, you know. You've got it all in place, aggregate it, cleaning it, et cetera. So, when you look at what companies out there are doing, right. Sometimes, if you're a more kind of born in the cloud company, as we say, you'll adopt all these cloud native tooling things your cloud provider gives you. If you're a bank or another sort of institution like that, you know, you're probably juggling an even wider variety of tools. You're thinking about a cloud migration. You might have things like Kron running in one place, Uzi running somewhere else, Informatics running somewhere else, while you're also trying to move all your workloads to the cloud. So, there's quite a large spectrum of what the current state is for companies. And then, kind of like Paola was saying, Apache Airflow started in 2014, and it was actually started by Airbnb, and they put out this blog post that was like, "Hey here's how we use Apache Airflow to orchestrate our data across all their sources." And really since then, right, it's almost been a decade since then, Airflow emerged as the open source standard, and there's companies of all sorts using it. And, it's really used to tie all these tools together, especially as that number of tools increases, companies move to hybrid cloud, hybrid multi-cloud strategies, and so on and so forth. But you know, what we found is that if you go to any company, especially a larger one and you say like, "Hey, how are you doing data orchestration?" They'll probably say something like, "Well, I have five data teams, so I have eight different ways I do data orchestration." Right. This idea of data orchestration's been there but the right way to do it, kind of all the abstractions you need, the way your teams need to work together, and so on and so forth, hasn't really emerged just yet, right? It's such a quick moving space that companies have to combine what they were doing before with what their new business initiatives are today. So, you know, what we really believe here at Astronomer is Airflow is the core of how you solve data orchestration for any sort of use case, but it's not everything. You know, it needs a little more. And, that's really where our commercial product, Astro comes in, where we've built, not only the most tried and tested airflow experience out there. We do employ a majority of the Airflow Core Committers, right? So, we're kind of really deep in the project. We've also built the right things around developer tooling, observability, and reliability for customers to really rely on Astro as the heart of the way they do data orchestration, and kind of think of it as the foundational layer that helps tie together all the different tools, practices and teams large companies have to do today. >> That foundational layer is absolutely critical. You've both mentioned open source software. Paola, I want to go back to you, and just give the audience an understanding of how open source really plays into Astronomer's mission as a company, and into the technologies like Astro. >> Mm-hmm. Yeah, absolutely. I mean, we, so we at Astronomers started using Airflow and actually building our products because Airflow is open source and we were our own customers at the beginning of our company journey. And, I think the open source community is at the core of everything we do. You know, without that open source community and culture, I think, you know, we have less of a business, and so, we're super invested in continuing to cultivate and grow that. And, I think there's a couple sort of concrete ways in which we do this that personally make me really excited to do my own job. You know, for one, we do things like we organize meetups and we sponsor the Airflow Summit and there's these sort of baseline community efforts that I think are really important and that reminds you, hey, there just humans trying to do their jobs and learn and use both our technology and things that are out there and contribute to it. So, making it easier to contribute to Airflow, for example, is another one of our efforts. As Viraj mentioned, we also employ, you know, engineers internally who are on our team whose full-time job is to make the open source project better. Again, regardless of whether or not you're a customer of ours or not, we want to make sure that we continue to cultivate the Airflow project in and of itself. And, we're also building developer tooling that might not be a part of the Apache Open Source project, but is still open source. So, we have repositories in our own sort of GitHub organization, for example, with tools that individual data practitioners, again customers are not, can use to make them be more productive in their day-to-day jobs with Airflow writing Dags for the most common use cases out there. The last thing I'll say is how important I think we've found it to build sort of educational resources and documentation and best practices. Airflow can be complex. It's been around for a long time. There's a lot of really, really rich feature sets. And so, how do we enable folks to actually use those? And that comes in, you know, things like webinars, and best practices, and courses and curriculum that are free and accessible and open to the community are just some of the ways in which I think we're continuing to invest in that open source community over the next year and beyond. >> That's awesome. It sounds like open source is really core, not only to the mission, but really to the heart of the organization. Viraj, I want to go back to you and really try to understand how does Astronomer fit into the wider modern data stack and ecosystem? Like what does that look like for customers? >> Yeah, yeah. So, both in the open source and with our commercial customers, right? Folks everywhere are trying to tie together a huge variety of tools in order to start making sense of their data. And you know, I kind of think of it almost like as like a pyramid, right? At the base level, you need things like data reliability, data, sorry, data freshness, data availability, and so on and so forth, right? You just need your data to be there. (coughs) I'm sorry. You just need your data to be there, and you need to make it predictable when it's going to be there. You need to make sure it's kind of correct at the highest level, some quality checks, and so on and so forth. And oftentimes, that kind of takes the case of ELT or ETL use cases, right? Taking data from somewhere and moving it somewhere else, usually into some sort of analytics destination. And, that's really what businesses can do to just power the core parts of getting insights into how their business is going, right? How much revenue did I had? What's in my pipeline, salesforce, and so on and so forth. Once that kind of base foundation is there and people can get the data they need, how they need it, it really opens up a lot for what customers can do. You know, I think one of the trendier things out there right now is MLOps, and how do companies actually put machine learning into production? Well, when you think about it you kind of have to squint at it, right? Like, machine learning pipelines are really just any other data pipeline. They just have a certain set of needs that might not not be applicable to ELT pipelines. And, when you kind of have a common layer to tie together all the ways data can move through your organization, that's really what we're trying to make it so companies can do. And, that happens in financial services where, you know, we have some customers who take app data coming from their mobile apps, and actually run it through their fraud detection services to make sure that all the activity is not fraudulent. We have customers that will run sports betting models on our platform where they'll take data from a bunch of public APIs around different sporting events that are happening, transform all of that in a way their data scientist can build models with it, and then actually bet on sports based on that output. You know, one of my favorite use cases I like to talk about that we saw in the open source is we had there was one company whose their business was to deliver blood transfusions via drone into remote parts of the world. And, it was really cool because they took all this data from all sorts of places, right? Kind of orchestrated all the aggregation and cleaning and analysis that happened had to happen via airflow and the end product would be a drone being shot out into a real remote part of the world to actually give somebody blood who needed it there. Because it turns out for certain parts of the world, the easiest way to deliver blood to them is via drone and not via some other, some other thing. So, these kind of, all the things people do with the modern data stack is absolutely incredible, right? Like you were saying, every company's trying to be a data-driven company. What really energizes me is knowing that like, for all those best, super great tools out there that power a business, we get to be the connective tissue, or the, almost like the electricity that kind of ropes them all together and makes so people can actually do what they need to do. >> Right. Phenomenal use cases that you just described, Raj. I mean, just the variety alone of what you guys are able to do and impact is so cool. So Paola, when you're with those data engineers, those data scientists, and customer conversations, what's your pitch? Why use Astro? >> Mm-hmm. Yeah, yeah, it's a good question. And honestly, to piggyback off of Viraj, there's so many. I think what keeps me so energized is how mission critical both our product and data orchestration is, and those use cases really are incredible and we work with customers of all shapes and sizes. But, to answer your question, right, so why use Astra? Why use our commercial products? There's so many people using open source, why pay for something more than that? So, you know, the baseline for our business really is that Airflow has grown exponentially over the last five years, and like we said has become an industry standard that we're confident there's a huge opportunity for us as a company and as a team. But, we also strongly believe that being great at running Airflow, you know, doesn't make you a successful company at what you do. What makes you a successful company at what you do is building great products and solving problems and solving pin points of your own customers, right? And, that differentiating value isn't being amazing at running Airflow. That should be our job. And so, we want to abstract those customers from meaning to do things like manage Kubernetes infrastructure that you need to run Airflow, and then hiring someone full-time to go do that. Which can be hard, but again doesn't add differentiating value to your team, or to your product, or to your customers. So, folks to get away from managing that infrastructure sort of a base, a base layer. Folks who are looking for differentiating features that make their team more productive and allows them to spend less time tweaking Airflow configurations and more time working with the data that they're getting from their business. For help, getting, staying up with Airflow releases. There's a ton of, we've actually been pretty quick to come out with new Airflow features and releases, and actually just keeping up with that feature set and working strategically with a partner to help you make the most out of those feature sets is a key part of it. And, really it's, especially if you're an organization who currently is committed to using Airflow, you likely have a lot of Airflow environments across your organization. And, being able to see those Airflow environments in a single place and being able to enable your data practitioners to create Airflow environments with a click of a button, and then use, for example, our command line to develop your Airflow Dags locally and push them up to our product, and use all of the sort of testing and monitoring and observability that we have on top of our product is such a key. It sounds so simple, especially if you use Airflow, but really those things are, you know, baseline value props that we have for the customers that continue to be excited to work with us. And of course, I think we can go beyond that and there's, we have ambitions to add whole, a whole bunch of features and expand into different types of personas. >> Right? >> But really our main value prop is for companies who are committed to Airflow and want to abstract themselves and make use of some of the differentiating features that we now have at Astronomer. >> Got it. Awesome. >> Thank you. One thing, one thing I'll add to that, Paola, and I think you did a good job of saying is because every company's trying to be a data company, companies are at different parts of their journey along that, right? And we want to meet customers where they are, and take them through it to where they want to go. So, on one end you have folks who are like, "Hey, we're just building a data team here. We have a new initiative. We heard about Airflow. How do you help us out?" On the farther end, you know, we have some customers that have been using Airflow for five plus years and they're like, "Hey, this is awesome. We have 10 more teams we want to bring on. How can you help with this? How can we do more stuff in the open source with you? How can we tell our story together?" And, it's all about kind of taking this vast community of data users everywhere, seeing where they're at, and saying like, "Hey, Astro and Airflow can take you to the next place that you want to go." >> Which is incredibly- >> Mm-hmm. >> and you bring up a great point, Viraj, that every company is somewhere in a different place on that journey. And it's, and it's complex. But it sounds to me like a lot of what you're doing is really stripping away a lot of the complexity, really enabling folks to use their data as quickly as possible, so that it's relevant and they can serve up, you know, the right products and services to whoever wants what. Really incredibly important. We're almost out of time, but I'd love to get both of your perspectives on what's next for Astronomer. You give us a a great overview of what the company's doing, the value in it for customers. Paola, from your lens as one of the co-founders, what's next? >> Yeah, I mean, I think we'll continue to, I think cultivate in that open source community. I think we'll continue to build products that are open sourced as part of our ecosystem. I also think that we'll continue to build products that actually make Airflow, and getting started with Airflow, more accessible. So, sort of lowering that barrier to entry to our products, whether that's price wise or infrastructure requirement wise. I think making it easier for folks to get started and get their hands on our product is super important for us this year. And really it's about, I think, you know, for us, it's really about focused execution this year and all of the sort of core principles that we've been talking about. And continuing to invest in all of the things around our product that again, enable teams to use Airflow more effectively and efficiently. >> And that efficiency piece is, everybody needs that. Last question, Viraj, for you. What do you see in terms of the next year for Astronomer and for your role? >> Yeah, you know, I think Paola did a really good job of laying it out. So it's, it's really hard to disagree with her on anything, right? I think executing is definitely the most important thing. My own personal bias on that is I think more than ever it's important to really galvanize the community around airflow. So, we're going to be focusing on that a lot. We want to make it easier for our users to get get our product into their hands, be that open source users or commercial users. And last, but certainly not least, is we're also really excited about Data Lineage and this other open source project in our umbrella called Open Lineage to make it so that there's a standard way for users to get lineage out of different systems that they use. When we think about what's in store for data lineage and needing to audit the way automated decisions are being made. You know, I think that's just such an important thing that companies are really just starting with, and I don't think there's a solution that's emerged that kind of ties it all together. So, we think that as we kind of grow the role of Airflow, right, we can also make it so that we're helping solve, we're helping customers solve their lineage problems all in Astro, which is our kind of the best of both worlds for us. >> Awesome. I can definitely feel and hear the enthusiasm and the passion that you both bring to Astronomer, to your customers, to your team. I love it. We could keep talking more and more, so you're going to have to come back. (laughing) Viraj, Paola, thank you so much for joining me today on this showcase conversation. We really appreciate your insights and all the context that you provided about Astronomer. >> Thank you so much for having us. >> My pleasure. For my guests, I'm Lisa Martin. You're watching this Cube conversation. (soft electronic music)

Published Date : Feb 21 2023

SUMMARY :

to this CUBE conversation Thank you so much and what it is that you guys do. and hopefully that gives you an idea and the problems that it solves for us. to be a data company, right? So, how are companies actually kind of all the abstractions you need, and just give the And that comes in, you of the organization. and analysis that happened that you just described, Raj. that you need to run Airflow, that we now have at Astronomer. Awesome. and I think you did a good job of saying and you bring up a great point, Viraj, and all of the sort of core principles and for your role? and needing to audit the and all the context that you (soft electronic music)

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Viraj ParekhPERSON

0.99+

Lisa MartinPERSON

0.99+

PaolaPERSON

0.99+

VirajPERSON

0.99+

2014DATE

0.99+

AstronomerORGANIZATION

0.99+

Paola Peraza-CalderonPERSON

0.99+

Paola Peraza CalderonPERSON

0.99+

AirflowORGANIZATION

0.99+

AirbnbORGANIZATION

0.99+

five plus yearsQUANTITY

0.99+

AstroORGANIZATION

0.99+

RajPERSON

0.99+

UziORGANIZATION

0.99+

GoogleORGANIZATION

0.99+

firstQUANTITY

0.99+

bothQUANTITY

0.99+

todayDATE

0.99+

KronORGANIZATION

0.99+

10 more teamsQUANTITY

0.98+

AstronomersORGANIZATION

0.98+

AstraORGANIZATION

0.98+

oneQUANTITY

0.98+

AirflowTITLE

0.98+

InformaticsORGANIZATION

0.98+

Monte CarloTITLE

0.98+

this yearDATE

0.98+

HubSpotORGANIZATION

0.98+

one companyQUANTITY

0.97+

AstronomerTITLE

0.97+

next yearDATE

0.97+

ApacheORGANIZATION

0.97+

Airflow SummitEVENT

0.97+

AWSORGANIZATION

0.95+

both worldsQUANTITY

0.93+

KRONORGANIZATION

0.93+

CUBEORGANIZATION

0.92+

MORGANIZATION

0.92+

RedshiftTITLE

0.91+

SnowflakeTITLE

0.91+

five data teamsQUANTITY

0.91+

GitHubORGANIZATION

0.91+

OozieORGANIZATION

0.9+

Data LineageORGANIZATION

0.9+

AWS Startup Showcase S3E1


 

(upbeat electronic music) >> Hello everyone, welcome to this CUBE conversation here from the studios in the CUBE in Palo Alto, California. I'm John Furrier, your host. We're featuring a startup, Astronomer. Astronomer.io is the URL, check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI, and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder of Astronomer, and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table who've worked very hard to get this company to the point that it's at. We have long ways to go, right? But there's been a lot of people involved that have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders, sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry to kind of highlight this shift that's happening. It's real, we've been chronicalizing, take a minute to explain what you guys do. >> Yeah, sure, we can get started. So, yeah, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data, and we were using an open source project called Apache Airflow that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into, and that running Airflow is actually quite challenging, and that there's a big opportunity for us to create a set of commercial products and an opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item in the old classic data infrastructure. But with AI, you're seeing a lot more emphasis on scale, tuning, training. Data orchestration is the center of the value proposition, when you're looking at coordinating resources, it's one of the most important things. Can you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one, and Viraj, feel free to jump in. So if you google data orchestration, here's what you're going to get. You're going to get something that says, "Data orchestration is the automated process" "for organizing silo data from numerous" "data storage points, standardizing it," "and making it accessible and prepared for data analysis." And you say, "Okay, but what does that actually mean," right, and so let's give sort of an an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Okay, give me a dashboard in Sigma, for example, for the number of customers or monthly active users, and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have in product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran to ingest data, a data warehouse, like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration, in our view, is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on and the company advances. And so, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run, and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CICD tooling, secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that it's the heartbeat, we think, of of the data ecosystem, and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> One of the things that jumped out, Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out. You mentioned a variety of things. You look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are fundamental, that were once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier, or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over the last however many years is that if a data team is using a bunch of tools to get what they need done, and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them, and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have some sort of base layer, right? That's kind of like, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, things like SageMaker, Redshift, whatever, but they also might need things that their cloud can't provide. Something like Fivetran, or Hightouch, those are other tools. And where data orchestration really shines, and something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need? So that somebody can read a dashboard and trust the number that it says, or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines, or machine learning, or whatever, you need different things to do them, and Airflow helps tie them together in a way that's really specific for a individual business' needs. >> Take a step back and share the journey of what you guys went through as a company startup. So you mentioned Apache, open source. I was just having an interview with a VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone/Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. Are you guys helping them? Take us through, 'cause you guys are on the front end of a big, big wave, and that is to make sense of the chaos, rain it in. Take us through your journey and why this is important. >> Yeah, Paola, I can take a crack at this, then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source, because we started using Airflow as an end user and started to say like, "Hey wait a second," "more and more people need this." Airflow, for background, started at Airbnb, and they were actually using that as a foundation for their whole data stack. Kind of how they made it so that they could give you recommendations, and predictions, and all of the processes that needed orchestrated. Airbnb created Airflow, gave it away to the public, and then fast forward a couple years and we're building a company around it, and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us, it's really been about watching the community and our customers take these problems, find a solution to those problems, standardize those solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting in their ELP infrastructure, they've solved that problem and now they're moving on to things like doing machine learning with their data, because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build a layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Viraj, I'll let you take that one too. (group laughs) >> So you know, a lot of it is... It goes across the gamut, right? Because it doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from other disparate sources into one place and then building on top of that. Be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection, because Airflow's orchestrating how transactions go, transactions get analyzed. They do things like analyzing marketing spend to see where your highest ROI is. And then you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers, kind of analyze and aggregating that at scale, and trying to automate decision making processes. So it goes from your most basic, what we call data plumbing, right? Just to make sure data's moving as needed, all the ways to your more exciting expansive use cases around automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future, is how critical Airflow is to all of those processes, and I think when you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic questions about your business and the growth of your company for so many organizations that we work with. So it's, I think, one of the things that gets Viraj and I and the rest of our company up every single morning is knowing how important the work that we do for all of those use cases across industries, across company sizes, and it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models is that you can integrate data into these models from outside. So you're starting to see the integration easier to deal with. Still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Has it already been disrupted? Would you categorize it as a new first inning kind of opportunity, or what's the state of the data orchestration area right now? Both technically and from a business model standpoint. How would you guys describe that state of the market? >> Yeah, I mean, I think in a lot of ways, in some ways I think we're category creating. Schedulers have been around for a long time. I released a data presentation sort of on the evolution of going from something like Kron, which I think was built in like the 1970s out of Carnegie Mellon. And that's a long time ago, that's 50 years ago. So sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out industry has 5X'd over the last 10 years. And so obviously as that ecosystem grows, and grows, and grows, and grows, the need for orchestration only increases. And I think, as Astronomer, I think we... And we work with so many different types of companies, companies that have been around for 50 years, and companies that got started not even 12 months ago. And so I think for us it's trying to, in a ways, category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration, and then there's folks who have such advanced use case, 'cause they're hitting sort of a ceiling and only want to go up from there. And so I think we, as a company, care about both ends of that spectrum, and certainly want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point, Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. If you rewind the clock like 5 or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business, and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is right on the money. And what we're finding is the need for it is spreading to all parts of the data team, naturally where Airflow's emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. We've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data, and that's data engineering, and then you're got to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I have to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers, or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean, there's so many... Sorry, Viraj, you can jump in. So there's so many companies using Airflow, right? So the baseline is that the open source project that is Airflow that came out of Airbnb, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in their organization, and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Viraj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's to start at the baseline, as we continue to grow our our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way, in a more efficient way, and that's really the crux of who we sell to. And so to answer your question on, we get a lot of inbound because they're... >> You have a built in audience. (laughs) >> The world that use it. Those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean, the power of the opensource community is really just so, so big, and I think that's also one of the things that makes this job fun. >> And you guys are in a great position. Viraj, you can comment a little, get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of productizing it, operationalizing it. This is a huge new dynamic, I mean, in the past 5 or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do, because we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running an e-commerce business, or maybe you're running, I don't know, some sort of like, any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it, you want to be able to google something and get answers for it, you want the benefits of open source. You don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, that you can benefit from, that you can learn from. But you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolve. We used a debate 10 years ago, can there be another Red Hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company? The milestones of Astromer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Viraj and I have obviously been at Astronomer along with our other founding team and leadership folks for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people, solving, again, mission critical problems for so many types of organizations. We've had some funding that has allowed us to invest in the team that we have and in the software that we have, and that's been really phenomenal. And so that investment, I think, keeps us confident, even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us that we know can get valuable out of what we do, and making developers' lives better, and growing the open source community and making sure that everything that we're doing makes it easier for folks to get started, to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> Don't know what the total is, but it's in the ballpark over $200 million. It feels good to... >> A little bit of capital. Got a little bit of cap to work with there. Great success. I know as a Series C Financing, you guys have been down. So you're up and running, what's next? What are you guys looking to do? What's the big horizon look like for you from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done. But really investing our product over the next, at least over the course of our company lifetime. And there's a lot of ways we want to make it more accessible to users, easier to get started with, easier to use, kind of on all areas there. And really, we really want to do more for the community, right, like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways, in more kind of events and everything else that we can keep those folks galvanized and just keep them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. I think we'll keep growing the team this year. We've got a couple of offices in the the US, which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York, and we're excited to be engaging with our coworkers in person finally, after years of not doing so. We've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world, and really focusing on our product and the open source community is where our heads are at this year. So, excited. >> Congratulations. 200 million in funding, plus. Good runway, put that money in the bank, squirrel it away. It's a good time to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open source community does, and good luck with the venture, continue to be successful, and we'll see you at the Startup Showcase. >> Thank you. >> Yeah, thanks so much, John. Appreciate it. >> Okay, that's the CUBE Conversation featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model, good solution for the next gen cloud scale data operations, data stacks that are emerging. I'm John Furrier, your host, thanks for watching. (soft upbeat music)

Published Date : Feb 14 2023

SUMMARY :

and that is the future of for the path we've been on so far. for the AI industry to kind of highlight So the crux of what we center of the value proposition, that it's the heartbeat, One of the things and the number of tools they're using of what you guys went and all of the processes That's a beautiful thing. all the tools that they need, What are some of the companies Viraj, I'll let you take that one too. all of the machine learning and the growth of your company that state of the market? and the value that we can provide and the data scientists that the data market's And so the folks that we sell to You have a built in audience. one of the things that makes this job fun. in the past 5 or so years, 10 years, that you can build on top of, the history of the company? and in the software that we have, How much have you guys raised? but it's in the ballpark What's the big horizon look like for you Kind of one of the best and worst things and continuing to hire the work you guys do. Yeah, thanks so much, John. for the next gen cloud

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Viraj ParekhPERSON

0.99+

PaolaPERSON

0.99+

VirajPERSON

0.99+

JohnPERSON

0.99+

John FurrierPERSON

0.99+

AirbnbORGANIZATION

0.99+

2017DATE

0.99+

San FranciscoLOCATION

0.99+

New YorkLOCATION

0.99+

ApacheORGANIZATION

0.99+

USLOCATION

0.99+

TwoQUANTITY

0.99+

AWSORGANIZATION

0.99+

Paola Peraza CalderonPERSON

0.99+

1970sDATE

0.99+

first questionQUANTITY

0.99+

Palo Alto, CaliforniaLOCATION

0.99+

iPhoneCOMMERCIAL_ITEM

0.99+

AirflowTITLE

0.99+

bothQUANTITY

0.99+

Linux FoundationORGANIZATION

0.99+

200 millionQUANTITY

0.99+

AstronomerORGANIZATION

0.99+

OneQUANTITY

0.99+

over 200 millionQUANTITY

0.99+

over $200 millionQUANTITY

0.99+

this yearDATE

0.99+

10 years agoDATE

0.99+

HubSpotORGANIZATION

0.98+

FivetranORGANIZATION

0.98+

50 years agoDATE

0.98+

over five yearsQUANTITY

0.98+

one stackQUANTITY

0.98+

12 months agoDATE

0.98+

10 yearsQUANTITY

0.97+

BothQUANTITY

0.97+

Apache AirflowTITLE

0.97+

both worldsQUANTITY

0.97+

CNCFORGANIZATION

0.97+

oneQUANTITY

0.97+

ChatGPTORGANIZATION

0.97+

5DATE

0.97+

next yearDATE

0.96+

AstromerORGANIZATION

0.96+

todayDATE

0.95+

5XQUANTITY

0.95+

over five years agoDATE

0.95+

CUBEORGANIZATION

0.94+

two thingsQUANTITY

0.94+

eachQUANTITY

0.93+

one personQUANTITY

0.93+

FirstQUANTITY

0.92+

S3TITLE

0.91+

Carnegie MellonORGANIZATION

0.91+

Startup ShowcaseEVENT

0.91+

AWS Startup Showcase S3E1


 

(soft music) >> Hello everyone, welcome to this Cube conversation here from the studios of theCube in Palo Alto, California. John Furrier, your host. We're featuring a startup, Astronomer, astronomer.io is the url. Check it out. And we're going to have a great conversation around one of the most important topics hitting the industry, and that is the future of machine learning and AI and the data that powers it underneath it. There's a lot of things that need to get done, and we're excited to have some of the co-founders of Astronomer here. Viraj Parekh, who is co-founder and Paola Peraza Calderon, another co-founder, both with Astronomer. Thanks for coming on. First of all, how many co-founders do you guys have? >> You know, I think the answer's around six or seven. I forget the exact, but there's really been a lot of people around the table, who've worked very hard to get this company to the point that it's at. And we have long ways to go, right? But there's been a lot of people involved that are, have been absolutely necessary for the path we've been on so far. >> Thanks for that, Viraj, appreciate that. The first question I want to get out on the table, and then we'll get into some of the details, is take a minute to explain what you guys are doing. How did you guys get here? Obviously, multiple co-founders sounds like a great project. The timing couldn't have been better. ChatGPT has essentially done so much public relations for the AI industry. Kind of highlight this shift that's happening. It's real. We've been chronologicalizing, take a minute to explain what you guys do. >> Yeah, sure. We can get started. So yeah, when Astronomer, when Viraj and I joined Astronomer in 2017, we really wanted to build a business around data and we were using an open source project called Apache Airflow, that we were just using sort of as customers ourselves. And over time, we realized that there was actually a market for companies who use Apache Airflow, which is a data pipeline management tool, which we'll get into. And that running Airflow is actually quite challenging and that there's a lot of, a big opportunity for us to create a set of commercial products and opportunity to grow that open source community and actually build a company around that. So the crux of what we do is help companies run data pipelines with Apache Airflow. And certainly we've grown in our ambitions beyond that, but that's sort of the crux of what we do for folks. >> You know, data orchestration, data management has always been a big item, you know, in the old classic data infrastructure. But with AI you're seeing a lot more emphasis on scale, tuning, training. You know, data orchestration is the center of the value proposition when you're looking at coordinating resources, it's one of the most important things. Could you guys explain what data orchestration entails? What does it mean? Take us through the definition of what data orchestration entails. >> Yeah, for sure. I can take this one and Viraj feel free to jump in. So if you google data orchestration, you know, here's what you're going to get. You're going to get something that says, data orchestration is the automated process for organizing silo data from numerous data storage points to organizing it and making it accessible and prepared for data analysis. And you say, okay, but what does that actually mean, right? And so let's give sort of an example. So let's say you're a business and you have sort of the following basic asks of your data team, right? Hey, give me a dashboard in Sigma, for example, for the number of customers or monthly active users and then make sure that that gets updated on an hourly basis. And then number two, a consistent list of active customers that I have in HubSpot so that I can send them a monthly product newsletter, right? Two very basic asks for all sorts of companies and organizations. And when that data team, which has data engineers, data scientists, ML engineers, data analysts get that request, they're looking at an ecosystem of data sources that can help them get there, right? And that includes application databases, for example, that actually have end product user behavior and third party APIs from tools that the company uses that also has different attributes and qualities of those customers or users. And that data team needs to use tools like Fivetran, to ingest data, a data warehouse like Snowflake or Databricks to actually store that data and do analysis on top of it, a tool like DBT to do transformations and make sure that that data is standardized in the way that it needs to be, a tool like Hightouch for reverse ETL. I mean, we could go on and on. There's so many partners of ours in this industry that are doing really, really exciting and critical things for those data movements. And the whole point here is that, you know, data teams have this plethora of tooling that they use to both ingest the right data and come up with the right interfaces to transform and interact with that data. And data orchestration in our view is really the heartbeat of all of those processes, right? And tangibly the unit of data orchestration, you know, is a data pipeline, a set of tasks or jobs that each do something with data over time and eventually run that on a schedule to make sure that those things are happening continuously as time moves on. And, you know, the company advances. And so, you know, for us, we're building a business around Apache Airflow, which is a workflow management tool that allows you to author, run and monitor data pipelines. And so when we talk about data orchestration, we talk about sort of two things. One is that crux of data pipelines that, like I said, connect that large ecosystem of data tooling in your company. But number two, it's not just that data pipeline that needs to run every day, right? And Viraj will probably touch on this as we talk more about Astronomer and our value prop on top of Airflow. But then it's all the things that you need to actually run data and production and make sure that it's trustworthy, right? So it's actually not just that you're running things on a schedule, but it's also things like CI/CD tooling, right? Secure secrets management, user permissions, monitoring, data lineage, documentation, things that enable other personas in your data team to actually use those tools. So long-winded way of saying that, it's the heartbeat that we think of the data ecosystem and certainly goes beyond scheduling, but again, data pipelines are really at the center of it. >> You know, one of the things that jumped out Viraj, if you can get into this, I'd like to hear more about how you guys look at all those little tools that are out there. You mentioned a variety of things. You know, if you look at the data infrastructure, it's not just one stack. You've got an analytic stack, you've got a realtime stack, you've got a data lake stack, you got an AI stack potentially. I mean you have these stacks now emerging in the data world that are >> Yeah. - >> fundamental, but we're once served by either a full package, old school software, and then a bunch of point solution. You mentioned Fivetran there, I would say in the analytics stack. Then you got, you know, S3, they're on the data lake stack. So all these things are kind of munged together. >> Yeah. >> How do you guys fit into that world? You make it easier or like, what's the deal? >> Great question, right? And you know, I think that one of the biggest things we've found in working with customers over, you know, the last however many years, is that like if a data team is using a bunch of tools to get what they need done and the number of tools they're using is growing exponentially and they're kind of roping things together here and there, that's actually a sign of a productive team, not a bad thing, right? It's because that team is moving fast. They have needs that are very specific to them and they're trying to make something that's exactly tailored to their business. So a lot of times what we find is that customers have like some sort of base layer, right? That's kind of like, you know, it might be they're running most of the things in AWS, right? And then on top of that, they'll be using some of the things AWS offers, you know, things like SageMaker, Redshift, whatever. But they also might need things that their Cloud can't provide, you know, something like Fivetran or Hightouch or anything of those other tools and where data orchestration really shines, right? And something that we've had the pleasure of helping our customers build, is how do you take all those requirements, all those different tools and whip them together into something that fulfills a business need, right? Something that makes it so that somebody can read a dashboard and trust the number that it says or somebody can make sure that the right emails go out to their customers. And Airflow serves as this amazing kind of glue between that data stack, right? It's to make it so that for any use case, be it ELT pipelines or machine learning or whatever, you need different things to do them and Airflow helps tie them together in a way that's really specific for a individual business's needs. >> Take a step back and share the journey of what your guys went through as a company startup. So you mentioned Apache open source, you know, we were just, I was just having an interview with the VC, we were talking about foundational models. You got a lot of proprietary and open source development going on. It's almost the iPhone, Android moment in this whole generative space and foundational side. This is kind of important, the open source piece of it. Can you share how you guys started? And I can imagine your customers probably have their hair on fire and are probably building stuff on their own. How do you guys, are you guys helping them? Take us through, 'cuz you guys are on the front end of a big, big wave and that is to make sense of the chaos, reigning it in. Take us through your journey and why this is important. >> Yeah Paola, I can take a crack at this and then I'll kind of hand it over to you to fill in whatever I miss in details. But you know, like Paola is saying, the heart of our company is open source because we started using Airflow as an end user and started to say like, "Hey wait a second". Like more and more people need this. Airflow, for background, started at Airbnb and they were actually using that as the foundation for their whole data stack. Kind of how they made it so that they could give you recommendations and predictions and all of the processes that need to be or needed to be orchestrated. Airbnb created Airflow, gave it away to the public and then, you know, fast forward a couple years and you know, we're building a company around it and we're really excited about that. >> That's a beautiful thing. That's exactly why open source is so great. >> Yeah, yeah. And for us it's really been about like watching the community and our customers take these problems, find solution to those problems, build standardized solutions, and then building on top of that, right? So we're reaching to a point where a lot of our earlier customers who started to just using Airflow to get the base of their BI stack down and their reporting and their ELP infrastructure, you know, they've solved that problem and now they're moving onto things like doing machine learning with their data, right? Because now that they've built that foundation, all the connective tissue for their data arriving on time and being orchestrated correctly is happening, they can build the layer on top of that. And it's just been really, really exciting kind of watching what customers do once they're empowered to pick all the tools that they need, tie them together in the way they need to, and really deliver real value to their business. >> Can you share some of the use cases of these customers? Because I think that's where you're starting to see the innovation. What are some of the companies that you're working with, what are they doing? >> Raj, I'll let you take that one too. (all laughing) >> Yeah. (all laughing) So you know, a lot of it is, it goes across the gamut, right? Because all doesn't matter what you are, what you're doing with data, it needs to be orchestrated. So there's a lot of customers using us for their ETL and ELT reporting, right? Just getting data from all the disparate sources into one place and then building on top of that, be it building dashboards, answering questions for the business, building other data products and so on and so forth. From there, these use cases evolve a lot. You do see folks doing things like fraud detection because Airflow's orchestrating how transactions go. Transactions get analyzed, they do things like analyzing marketing spend to see where your highest ROI is. And then, you know, you kind of can't not talk about all of the machine learning that goes on, right? Where customers are taking data about their own customers kind of analyze and aggregating that at scale and trying to automate decision making processes. So it goes from your most basic, what we call like data plumbing, right? Just to make sure data's moving as needed. All the ways to your more exciting and sexy use cases around like automated decision making and machine learning. >> And I'd say, I mean, I'd say that's one of the things that I think gets me most excited about our future is how critical Airflow is to all of those processes, you know? And I think when, you know, you know a tool is valuable is when something goes wrong and one of those critical processes doesn't work. And we know that our system is so mission critical to answering basic, you know, questions about your business and the growth of your company for so many organizations that we work with. So it's, I think one of the things that gets Viraj and I, and the rest of our company up every single morning, is knowing how important the work that we do for all of those use cases across industries, across company sizes. And it's really quite energizing. >> It was such a big focus this year at AWS re:Invent, the role of data. And I think one of the things that's exciting about the open AI and all the movement towards large language models, is that you can integrate data into these models, right? From outside, right? So you're starting to see the integration easier to deal with, still a lot of plumbing issues. So a lot of things happening. So I have to ask you guys, what is the state of the data orchestration area? Is it ready for disruption? Is it already been disrupted? Would you categorize it as a new first inning kind of opportunity or what's the state of the data orchestration area right now? Both, you know, technically and from a business model standpoint, how would you guys describe that state of the market? >> Yeah, I mean I think, I think in a lot of ways we're, in some ways I think we're categoric rating, you know, schedulers have been around for a long time. I recently did a presentation sort of on the evolution of going from, you know, something like KRON, which I think was built in like the 1970s out of Carnegie Mellon. And you know, that's a long time ago. That's 50 years ago. So it's sort of like the basic need to schedule and do something with your data on a schedule is not a new concept. But to our point earlier, I think everything that you need around your ecosystem, first of all, the number of data tools and developer tooling that has come out the industry has, you know, has some 5X over the last 10 years. And so obviously as that ecosystem grows and grows and grows and grows, the need for orchestration only increases. And I think, you know, as Astronomer, I think we, and there's, we work with so many different types of companies, companies that have been around for 50 years and companies that got started, you know, not even 12 months ago. And so I think for us, it's trying to always category create and adjust sort of what we sell and the value that we can provide for companies all across that journey. There are folks who are just getting started with orchestration and then there's folks who have such advanced use case 'cuz they're hitting sort of a ceiling and only want to go up from there. And so I think we as a company, care about both ends of that spectrum and certainly have want to build and continue building products for companies of all sorts, regardless of where they are on the maturity curve of data orchestration. >> That's a really good point Paola. And I think the other thing to really take into account is it's the companies themselves, but also individuals who have to do their jobs. You know, if you rewind the clock like five or 10 years ago, data engineers would be the ones responsible for orchestrating data through their org. But when we look at our customers today, it's not just data engineers anymore. There's data analysts who sit a lot closer to the business and the data scientists who want to automate things around their models. So this idea that orchestration is this new category is spot on, is right on the money. And what we're finding is it's spreading, the need for it, is spreading to all parts of the data team naturally where Airflows have emerged as an open source standard and we're hoping to take things to the next level. >> That's awesome. You know, we've been up saying that the data market's kind of like the SRE with servers, right? You're going to need one person to deal with a lot of data and that's data engineering and then you're going to have the practitioners, the democratization. Clearly that's coming in what you're seeing. So I got to ask, how do you guys fit in from a value proposition standpoint? What's the pitch that you have to customers or is it more inbound coming into you guys? Are you guys doing a lot of outreach, customer engagements? I'm sure they're getting a lot of great requirements from customers. What's the current value proposition? How do you guys engage? >> Yeah, I mean we've, there's so many, there's so many. Sorry Raj, you can jump in. - >> It's okay. So there's so many companies using Airflow, right? So our, the baseline is that the open source project that is Airflow that was, that came out of Airbnb, you know, over five years ago at this point, has grown exponentially in users and continues to grow. And so the folks that we sell to primarily are folks who are already committed to using Apache Airflow, need data orchestration in the organization and just want to do it better, want to do it more efficiently, want to do it without managing that infrastructure. And so our baseline proposition is for those organizations. Now to Raj's point, obviously I think our ambitions go beyond that, both in terms of the personas that we addressed and going beyond that data engineer, but really it's for, to start at the baseline. You know, as we continue to grow our company, it's really making sure that we're adding value to folks using Airflow and help them do so in a better way, in a larger way and a more efficient way. And that's really the crux of who we sell to. And so to answer your question on, we actually, we get a lot of inbound because they're are so many - >> A built-in audience. >> In the world that use it, that those are the folks who we talk to and come to our website and chat with us and get value from our content. I mean the power of the open source community is really just so, so big. And I think that's also one of the things that makes this job fun, so. >> And you guys are in a great position, Viraj, you can comment, to get your reaction. There's been a big successful business model to starting a company around these big projects for a lot of reasons. One is open source is continuing to be great, but there's also supply chain challenges in there. There's also, you know, we want to continue more innovation and more code and keeping it free and and flowing. And then there's the commercialization of product-izing it, operationalizing it. This is a huge new dynamic. I mean, in the past, you know, five or so years, 10 years, it's been happening all on CNCF from other areas like Apache, Linux Foundation, they're all implementing this. This is a huge opportunity for entrepreneurs to do this. >> Yeah, yeah. Open source is always going to be core to what we do because, you know, we wouldn't exist without the open source community around us. They are huge in numbers. Oftentimes they're nameless people who are working on making something better in a way that everybody benefits from it. But open source is really hard, especially if you're a company whose core competency is running a business, right? Maybe you're running e-commerce business or maybe you're running, I don't know, some sort of like any sort of business, especially if you're a company running a business, you don't really want to spend your time figuring out how to run open source software. You just want to use it, you want to use the best of it, you want to use the community around it. You want to take, you want to be able to google something and get answers for it. You want the benefits of open source. You don't want to have, you don't have the time or the resources to invest in becoming an expert in open source, right? And I think that dynamic is really what's given companies like us an ability to kind of form businesses around that, in the sense that we'll make it so people get the best of both worlds. You'll get this vast open ecosystem that you can build on top of, you can benefit from, that you can learn from, but you won't have to spend your time doing undifferentiated heavy lifting. You can do things that are just specific to your business. >> It's always been great to see that business model evolved. We used to debate 10 years ago, can there be another red hat? And we said, not really the same, but there'll be a lot of little ones that'll grow up to be big soon. Great stuff. Final question, can you guys share the history of the company, the milestones of the Astronomer's journey in data orchestration? >> Yeah, we could. So yeah, I mean, I think, so Raj and I have obviously been at astronomer along with our other founding team and leadership folks, for over five years now. And it's been such an incredible journey of learning, of hiring really amazing people. Solving again, mission critical problems for so many types of organizations. You know, we've had some funding that has allowed us to invest in the team that we have and in the software that we have. And that's been really phenomenal. And so that investment, I think, keeps us confident even despite these sort of macroeconomic conditions that we're finding ourselves in. And so honestly, the milestones for us are focusing on our product, focusing on our customers over the next year, focusing on that market for us, that we know can get value out of what we do. And making developers' lives better and growing the open source community, you know, and making sure that everything that we're doing makes it easier for folks to get started to contribute to the project and to feel a part of the community that we're cultivating here. >> You guys raised a little bit of money. How much have you guys raised? >> I forget what the total is, but it's in the ballpark of 200, over $200 million. So it feels good - >> A little bit of capital. Got a little bit of cash to work with there. Great success. I know it's a Series C financing, you guys been down, so you're up and running. What's next? What are you guys looking to do? What's the big horizon look like for you? And from a vision standpoint, more hiring, more product, what is some of the key things you're looking at doing? >> Yeah, it's really a little of all of the above, right? Like, kind of one of the best and worst things about working at earlier stage startups is there's always so much to do and you often have to just kind of figure out a way to get everything done, but really invest in our product over the next, at least the next, over the course of our company lifetime. And there's a lot of ways we wanting to just make it more accessible to users, easier to get started with, easier to use all kind of on all areas there. And really, we really want to do more for the community, right? Like I was saying, we wouldn't be anything without the large open source community around us. And we want to figure out ways to give back more in more creative ways, in more code driven ways and more kind of events and everything else that we can do to keep those folks galvanized and just keeping them happy using Airflow. >> Paola, any final words as we close out? >> No, I mean, I'm super excited. You know, I think we'll keep growing the team this year. We've got a couple of offices in the US which we're excited about, and a fully global team that will only continue to grow. So Viraj and I are both here in New York and we're excited to be engaging with our coworkers in person. Finally, after years of not doing so, we've got a bustling office in San Francisco as well. So growing those teams and continuing to hire all over the world and really focusing on our product and the open source community is where our heads are at this year, so. >> Congratulations. - >> Excited. 200 million in funding plus good runway. Put that money in the bank, squirrel it away. You know, it's good to kind of get some good interest on it, but still grow. Congratulations on all the work you guys do. We appreciate you and the open sourced community does and good luck with the venture. Continue to be successful and we'll see you at the Startup Showcase. >> Thank you. - >> Yeah, thanks so much, John. Appreciate it. - >> It's theCube conversation, featuring astronomer.io, that's the website. Astronomer is doing well. Multiple rounds of funding, over 200 million in funding. Open source continues to lead the way in innovation. Great business model. Good solution for the next gen, Cloud, scale, data operations, data stacks that are emerging. I'm John Furrier, your host. Thanks for watching. (soft music)

Published Date : Feb 8 2023

SUMMARY :

and that is the future of for the path we've been on so far. take a minute to explain what you guys do. and that there's a lot of, of the value proposition And that data team needs to use tools You know, one of the and then a bunch of point solution. and the number of tools they're using and that is to make sense of the chaos, and all of the processes that need to be That's a beautiful thing. you know, they've solved that problem What are some of the companies Raj, I'll let you take that one too. And then, you know, and the growth of your company So I have to ask you guys, and companies that got started, you know, and the data scientists that the data market's kind of you can jump in. And so the folks that we and come to our website and chat with us I mean, in the past, you to what we do because, you history of the company, and in the software that we have. How much have you guys raised? but it's in the ballpark What are you guys looking to do? and you often have to just kind of and the open source community the work you guys do. Yeah, thanks so much, John. that's the website.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Viraj ParekhPERSON

0.99+

PaolaPERSON

0.99+

VirajPERSON

0.99+

John FurrierPERSON

0.99+

JohnPERSON

0.99+

RajPERSON

0.99+

AirbnbORGANIZATION

0.99+

USLOCATION

0.99+

2017DATE

0.99+

New YorkLOCATION

0.99+

Paola Peraza CalderonPERSON

0.99+

AWSORGANIZATION

0.99+

ApacheORGANIZATION

0.99+

San FranciscoLOCATION

0.99+

Palo Alto, CaliforniaLOCATION

0.99+

1970sDATE

0.99+

10 yearsQUANTITY

0.99+

fiveQUANTITY

0.99+

TwoQUANTITY

0.99+

first questionQUANTITY

0.99+

over 200 millionQUANTITY

0.99+

bothQUANTITY

0.99+

BothQUANTITY

0.99+

over $200 millionQUANTITY

0.99+

Linux FoundationORGANIZATION

0.99+

50 years agoDATE

0.99+

oneQUANTITY

0.99+

fiveDATE

0.99+

iPhoneCOMMERCIAL_ITEM

0.99+

this yearDATE

0.98+

OneQUANTITY

0.98+

AirflowTITLE

0.98+

10 years agoDATE

0.98+

Carnegie MellonORGANIZATION

0.98+

over five yearsQUANTITY

0.98+

200QUANTITY

0.98+

12 months agoDATE

0.98+

both worldsQUANTITY

0.98+

5XQUANTITY

0.98+

ChatGPTORGANIZATION

0.98+

firstQUANTITY

0.98+

one stackQUANTITY

0.97+

one personQUANTITY

0.97+

two thingsQUANTITY

0.97+

FivetranORGANIZATION

0.96+

sevenQUANTITY

0.96+

next yearDATE

0.96+

todayDATE

0.95+

50 yearsQUANTITY

0.95+

eachQUANTITY

0.95+

theCubeORGANIZATION

0.94+

HubSpotORGANIZATION

0.93+

SigmaORGANIZATION

0.92+

Series COTHER

0.92+

AstronomerORGANIZATION

0.91+

astronomer.ioOTHER

0.91+

HightouchTITLE

0.9+

one placeQUANTITY

0.9+

AndroidTITLE

0.88+

Startup ShowcaseEVENT

0.88+

Apache AirflowTITLE

0.86+

CNCFORGANIZATION

0.86+

Darren Chinen, Malwarebytes - Big Data SV 17 - #BigDataSV - #theCUBE


 

>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Hey, welcome back everybody. Jeff Frick here with The Cube. We are at Big Data SV in San Jose at the Historic Pagoda Lounge, part of Big Data week which is associated with Strata + Hadoop. We've been coming here for eight years and we're excited to be back. The innovation and dynamicism of big data and evolutions now with machine learning and artificial intelligence, just continues to roll, and we're really excited to be here talking about one of the nasty aspects of this world, unfortunately, malware. So we're excited to have Darren Chinen. He's the senior director of data science and engineering from Malwarebytes. Darren, welcome. >> Darren: Thank you. >> So for folks that aren't familiar with the company, give us just a little bit of background on Malwarebytes. >> So Malwarebytes is basically a next-generation anti-virus software. We started off as humble roots with our founder at 14 years old getting infected with a piece of malware, and he reached out into the community and, at 14 years old, wrote his first, with the help of some people, wrote his first lines of code to remediate a couple of pieces of malware. It grew from there and I think by the ripe old age of 18, founded the company. And he's now I want to say 26 or 27 and we're doing quite well. >> It was interesting, before we went live you were talking about his philosophy and how important that is to the company and now has turned into really a strategic asset, that no one should have to suffer from malware, and he decided to really offer a solution for free to help people rid themselves of this bad software. >> Darren: That's right. Yeah, so Malwarebytes was founded under the principle that Marcin believes that everyone has the right to a malware-free existence and so we've always offered a free version Malwarebytes that will help you to remediate if your machine does get infected with a piece of malware. And that's actually still going to this day. >> And that's now given you the ability to have a significant amount of inpoint data, transactional data, trend data, that now you can bake back into the solution. >> Darren: That's right. It's turned into a strategic advantage for the company, it's not something I don't think that we could have planned at 18 years old when he was doing this. But we've instrumented it so that we can get some anonymous-level telemetry and we can understand how malware proliferates. For many, many years we've been positioned as a second-opinion scanner and so we're able to see a lot of things, some trends happening in there and we can actually now see that in real time. >> So, starting out as a second-position scanner, you're basically looking at, you're finding what others have missed. And how can you, what do you have to do to become the first line of defense? >> Well, with our new product Malwarebytes 3.0, I think some of that landscape is changing. We have a very complete and layered offering. I'm not the product manager, so I don't think, as the data science guy, I don't know that I'm qualified to give you the ins and outs, but I think some of that is changing as we have, we've combined a lot of products and we have a much more complete sweep of layered protection built into the product. >> And so, maybe tell us, without giving away all the secret sauce, what sort of platform technologies did you use that enabled you to scale to these hundreds of millions of in points, and then to be fast enough at identifying things that were trending that are bad that you had to prioritize? >> Right, so traditionally, I think AV companies, they have these honeypots, right, where they go and the collect a piece of virus or a piece of malware, and they'll take the MD5 hash of that and then they'll basically insert that into a definition's database. And that's a very exact way to do it. The problem is is that there's so much malware or viruses out there in the wild, it's impossible to get all of them. I think one of the things that we did was we set up telemetry and we have a phenomenal research team where we're able to actually have our team catch entire families of malware, and that's really the secret sauce to Malwarebytes. There's several other levels but that's where we're helping out in the immediate term. What we do is we have, internally, we sort of jokingly call it a Lambda Two architecture. We had considered Lambda long ago, long ago and I say about a year ago when we first started this journey. But there's, Lambda is riddled with, as you know, a number of issues. If you've ever talked to Jay Kreps from Confluent, he has a lot of opinions on that, right? And one of the key problems with that is, that if you do a traditional Lambda, you have to implement your code in two places, it's very difficult, things get out of sync, you have to have replay frameworks. And these are some of the challenges with Lambda. So we do processing in a number of areas. The first thing that we did was we implemented Kafka to handle all of the streaming data. We use Kafka streams to do inline stateless transformations and then we also use Kafka Connect. And we write all of our data both into HBase, we use that, we may swap that out later for something like Redis, and that would be a thin speed layer. And then we also move the data into S3 and we use some ephemeral clusters to do very large-scale batch processing, and that really provides our data lab. >> When you call that Lambda Two, is that because you're still working essentially on two different infrastructures, so your code isn't quite the same? You still have to check the results on either on either fork. >> That's right, yeah, we didn't feel like it was, we did evaluate doing everything in the stream. But there are certain operations that are difficult to do with purely streamed processing, and so we did need a little bit, we did need to have a thin, what we call real time indicators, a speed layer, to supplement what we were doing in the stream. And so that's the differentiating factor between a traditional Lambda architecture where you'd want to have everything in the stream and everything in batch, and the batch is really more of a truing mechanism as opposed to, our real time is really directional, so in the traditional sense, if you look at traditional business intelligence, you'd have KPIs that would allow you to gauge the health of your business. We have RTIs, Real Time Indicators, that allow us to gauge directionally, what is important to look at this day, this hour, this minute? >> This thing is burning up the charts, >> Exactly. >> Therefore it's priority one. >> That's right, you got it. >> Okay. And maybe tell us a little more, because everyone I'm sure is familiar with Kafka but the streams product from them is a little newer as is Kafka Connect, so it sounds like you've got, it's not just the transport, but you've got some basic analytics and you've got the ability to do the ETL because you've got Connect that comes from sources and destinations, sources and syncs. Tell us how you've used that. >> Well, the streams product is, it's quite different than something like Spark Streaming. It's not working off micro-batching, it's actually working off the stream. And the second thing is, it's not a separate cluster. It's just a library, effectively a .jar file, right? And so because it works natively with Kafka, it handles certain things there quite well. It handles back pressure and when you expand the cluster, it's pretty good with things like that. We've found it to be a fairly stable technology. It's just a library and we've worked very closely with Confluent to develop that. Whereas Kafka Connect is really something that we use to write out to S3. In fact, Confluent just released a new, an S3 connector direct. We were using Stream X, which was a wrapper on top of an HDFS connector and they rigged that up to write to S3 for us. >> So tell us, as you look out, what sorts of technologies do you see as enabling you to build a platform that's richer, and then how would that show up in the functionality consumers like we would see? >> Darren: With respect to the architecture? >> Yeah. >> Well one of the things that we had to do is we had to evaluate where we wanted to spend our time. We're a very small team, the entire data science and engineering team is less than I think 10 months old. So all of us got hired, we've started this platform, we've gone very, very fast. And we had to decide, how are we going to, a, get, we've made this big investment, how are we going to get value to our end customer quickly, so that they're not waiting around and you get the traditional big-data story where, we've spent all this money and now we're not getting anything out of it. And so we had to make some of those strategic decisions and because of the fact that the data was really truly big data in nature, there's just a huge amount of work that has to be done in these open-source technologies. They're not baked, it's not like going out to Oracle and giving them a purchase order and you install it and away you go. There's a tremendous amount of work, and so we've made some strategic decisions on what we're going to do in open-source and what we're going to do with a third-party vendor solution. And one of those solutions that we decided was workload automation. So I just did a talk on this about how Control-M from BMC was really the tool that we chose to handle a lot of the coordination, the sophisticated coordination, and the workload automation on the batch side, and we're about to implement that in a data-quality monitoring framework. And that's turned out to be an incredibly stable solution for us. It's allowed us to not spend time with open-source solutions that do the same things like Airflow, which may or may not work well, but there's really no support around that, and focus our efforts on what we believe to be the really, really hard problems to tackle in Kafka, Kafka Streams, Connect, et cetera. >> Is it fair to say that Kafka plus Kafka Connect solves many of the old ETL problems or do you still need some sort of orchestration tool on top of it to completely commoditize, essentially moving and transforming data from OLTP or operational system to a decision support system? >> I guess the answer to that is, it depends on your use case. I think there's a lot of things that Kafka and the stream's job can solve for you, but I don't think that we're at the point where everything can be streaming. I think that's a ways off. There's legacy systems that really don't natively stream to you anyway, and there's just certain operations that are just more efficient to do in batch. And so that's why we've, I don't think batch for us is going away any time soon and that's one of the reasons why workload automation in the batch layer initially was so important and we've decided to extend that, actually, into building out a data-quality monitoring framework to put a collar around how accurate our data is on the real-time side. >> Cuz it's really horses for courses, it's not one or the other, it's application-specific, what's the best solution for that particular is. >> Yeah, I don't think that there's, if there was a one-size-fits-all it'd be a company, and there would be no need for architects, so I think that you have to look at your use case, your company, what kind of data, what style of data, what type of analysis do you need. Do you really actually need the data in real time and if you do put in all the work to get it in real time, are you going to be able to take action on it? And I think Malwarebytes was a great candidate. When it came in, I said, "Well, it does look like we can justify "the need for real time data, and the effort "that goes into building out a real-time framework." >> Jeff: Right, right. And we always say, what is real time? In time to do something about it, (all chuckle) and if there's not time to do something about it, depending on how you define real time, really what difference does it make if you can't do anything about it that fast. So as you look out in the future with IoT, all these connected devices, this is a hugely increased attack surface as we just read our essay a few weeks back. How does that work into your planning? What do you guys think about the future where there's so many more connected devices out on the edge and various degrees of intelligence and opportunities to hi-jack, if you will? >> Yeah, I think, I don't think I'm qualified to speak about the Malwarebytes product roadmap as far as IoT goes. >> But more philosophically, from a professional point of view, cuz every coin has two sides, there's a lot of good stuff coming from IoT and connected devices, but as we keep hearing over and over, just this massive attack surface expansion. >> Well I think, for us, the key is we're small and we're not operating, like I came from Apple where we operated on a budget of infinity, so we're not-- >> Have to build the infinity or the address infinity (Darren laughs) with an actual budget. >> We're small and we have to make sure that whatever we do creates value. And so what I'm seeing in the future is, as we get more into the IoT space and logs begin to proliferate and data just exponentiates in size, it's really how do we do the same thing and how are we going to manage that in terms of cost? Generally, big data is very low in information density. It's not like transactional systems where you get the data, it's effectively an Excel spreadsheet and you can go run some pivot tables and filters and away you go. I think big data in general requires a tremendous amount of massaging to get to the point where a data scientist or an analyst can actually extract some insight and some value. And the question is, how do you massage that data in a way that's going to be cost-effective as IoT expands and proliferates? So that's the question that we're dealing with. We're, at this point, all in with cloud technologies, we're leveraging quite a few of Amazon services, server-less technologies as well. We just are in the process of moving to the Athena, to Athena, as just an on-demand query service. And we use a lot of ephemeral clusters as well, and that allows us to actually run all of our ETL in about two hours. And so these are some of the things that we're doing to prepare for this explosion of data and making sure that we're in a position where we're not spending a dollar to gain a penny if that makes sense. >> That's his business. Well, he makes fun of that business model. >> I think you could do it, you want to drive revenue to sell dollars for 90 cents. >> That's the dot com model, I was there. >> Exactly, and make it up in volume. All right, Darren Chenin, thanks for taking a few minutes out of your day and giving us the story on Malwarebytes, sounds pretty exciting and a great opportunity. >> Thanks, I enjoyed it. >> Absolutely, he's Darren, he's George, I'm Jeff, you're watching The Cube. We're at Big Data SV at the Historic Pagoda Lounge. Thanks for watching, we'll be right back after this short break. (upbeat techno music)

Published Date : Mar 15 2017

SUMMARY :

it's The Cube, and evolutions now with machine learning So for folks that aren't and he reached out into the community and, and how important that is to the company and so we've always offered a free version And that's now given you the ability it so that we can get what do you have to do to become and we have a much more complete sweep and that's really the secret the results on either and so we did need a little bit, and you've got the ability to do the ETL that we use to write out to S3. and because of the fact that the data and that's one of the reasons it's not one or the other, and if you do put in all the and opportunities to hi-jack, if you will? I don't think I'm qualified to speak and connected devices, or the address infinity and how are we going to Well, he makes fun of that business model. I think you could do it, and giving us the story on Malwarebytes, the Historic Pagoda Lounge.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
JeffPERSON

0.99+

Darren ChinenPERSON

0.99+

DarrenPERSON

0.99+

Jeff FrickPERSON

0.99+

Darren CheninPERSON

0.99+

GeorgePERSON

0.99+

Jay KrepsPERSON

0.99+

90 centsQUANTITY

0.99+

two sidesQUANTITY

0.99+

AppleORGANIZATION

0.99+

AthenaLOCATION

0.99+

MarcinPERSON

0.99+

AmazonORGANIZATION

0.99+

two placesQUANTITY

0.99+

San JoseLOCATION

0.99+

BMCORGANIZATION

0.99+

eight yearsQUANTITY

0.99+

San Jose, CaliforniaLOCATION

0.99+

first linesQUANTITY

0.99+

MalwarebytesORGANIZATION

0.99+

KafkaTITLE

0.99+

oneQUANTITY

0.99+

10 monthsQUANTITY

0.99+

Kafka ConnectTITLE

0.99+

OracleORGANIZATION

0.99+

LambdaTITLE

0.99+

firstQUANTITY

0.99+

second thingQUANTITY

0.99+

GenePERSON

0.99+

ExcelTITLE

0.99+

ConfluentORGANIZATION

0.99+

The CubeTITLE

0.98+

first lineQUANTITY

0.98+

27QUANTITY

0.97+

26QUANTITY

0.97+

RedisTITLE

0.97+

Kafka StreamsTITLE

0.97+

S3TITLE

0.97+

18QUANTITY

0.96+

14 years oldQUANTITY

0.96+

18 years oldQUANTITY

0.96+

about two hoursQUANTITY

0.96+

g agoDATE

0.96+

ConnectTITLE

0.96+

second-positionQUANTITY

0.95+

HBaseTITLE

0.95+

first thingQUANTITY

0.95+

Historic Pagoda LoungeLOCATION

0.94+

bothQUANTITY

0.93+

two different infrastructuresQUANTITY

0.92+

S3COMMERCIAL_ITEM

0.91+

Big DataEVENT

0.9+

The CubeORGANIZATION

0.88+

Lambda TwoTITLE

0.87+

Malwarebytes 3.0TITLE

0.84+

AirflowTITLE

0.83+

a year agoDATE

0.83+

second-opinionQUANTITY

0.82+

hundreds of millions ofQUANTITY

0.78+