Rob Skillington & Martin Mao, Chronosphere | KubeCon + CloudNativeCon NA 2019

>> Narrator: Live from San Diego, California. It's theCube! Covering KubeCon and CloudNativeCon, brought to you by Red Hat. A cloud native computing foundation. >> Welcome back. 12 thousand here in attendance for KubeCon CloudNativeCon 2019 in San Diego. I am Stu Miniman, my cohost for this afternoon is John troyer. And happy to welcome to the program, recently out of Stealth, two gentlemen from Chronosphere, Austin. To my right is Martin Mao who is the co-founder and CEO and his co-founder Rob Skillington, who's also the CTO, we've stated on theCUBE actually, you understand where this conference is, where co-founder and CTO is like you know, the most prominent title that we've seen to get on here, because that's the type of geeks we love on the program and in this community. So first of all, congratulations on the launch >> Thank you so much >> And thank you so much for joining us. >> No worries. >> All right, when I've got the founders on, I'm going to start with the whys. How was kind of the problem statement, where you were coming from, and what led to the creation of Chronosphere. >> For sure for sure. So with Chronosphere we found a actual gap in the monitoring market, and a very crowded monitoring market, we found a gap, and the gap exists when companies with very large complex technology stacks, or large enterprises, move on to Cloud Native Technology and Kubernetes. So with this migration, what we've found was there's actually a lot more monitoring data being produced, because there's a lot more pieces now, we're moving from monoliths microservices, we're moving from like physical machines to VMs, to containers and pods. And that generates a lot more things that you need to monitor and track. And not only a lot more things, but you generally monitoring the relationship between these things. So as the number of things increases, the number of relationships exponentially increases. So yeah, that's the sort of problem we're solving, it's like monitoring all of these things at large scale, and when we couldn't find anything, and I could even store all of theses things, so that's it sort of. >> All right, so what is the background of the team that made you into position to work on this problem? >> Yeah great question. I mean me and Martin go back quite a few years. I officiated his wedding, only very very recently actually. And I, yeah we basically work together at several different companies. You know, I think both of us are entrepreneurial at heart. I'll let Martin talk a little bit more about the last few years. >> Yeah, so like you know, a few years ago we started working at Uber. And at Uber, we went through this migrations to our native communities and through that migration that's when we sort of had to solve the problem ourselves. And we solved the problem at Uber, with an open-source project called M3. That's really where this whole thing started. And Chronosphere sort of you know, building on top of M3, and now providing a product on top of the open-source platform that we created. >> Can we talk a little bit about the business? I noticed that you know, there are many ways of approaching open-source, in 2019, you know open core and but also as a service. So can you talk a little bit about how you've approached your business model. >> Yeah for sure. So we're very much in the position or in the camp of as a service, right, because you know a lot of companies do do open core, and they're sort of going into the enterprise support model, we sort of didn't want to go down that route. And also with our open-source product, it's not really an end to end solution in itself, like you use an open-source M3, but you still need to plug it together with other things yourself. So what we really wanted to do was to give customers, and end to end solution, and that was built on top of the great technology, we built with M3, but really it solves the problem sort of end to end, and we do that best as a service. >> Rob maybe you can help explain M3 a little bit for us as to how that fits in the landscape, but what it works with and the like. >> Yeah of course. Yeah it's basically at it's heart a metrics platform, that is built on, at first the lower layer in 3DB, which is a distributive time series database. And then on top of that, we have basically an aggregation platform, that is actually aggregating a lot of the samples, and metrics that we're, collecting. So we can really do some transformations on the data, as it comes in, before it's stored in the database itself. And this let's us do a lot of like smart processing, of what signals actually matter, what signals don't matter, kind of like storing them in a way that can be accessed, much faster than like, other typical systems that don't really do any aggregation before it gets stored. And then, you know we have of course like a query engine that works with this distributed set of data, and so, you know, it's really a database that was designed from day one, to be a metric store. You know, it's not built on Cassandra, it doesn't use Rocks DB, at the lower layers, it literarily every part of it, was built for this purpose. >> Can you talk a little bit about dimensionality and cardinality? Because as I look at this observability monitoring space, I see a lot of current discussion about that and frankly a little bit of fighting, and I'm not always, I can kind of see it, why it's important, but what are some of the reasons and what do people do where you know by having it, and what is it actually, let's start with that. >> Yeah for sure. So you know, with this hot topic of like high cardinality and high dimensionality is, what I was talking about earlier, where as you move into cloud native world, you're now monitoring things at like a pod level. So it's like instead of tracking things on like a per host level, you're now tracking things on like a per pod level now, and that is at >> (interjects) You're tracking more things per pod. >> More things per pod and like every pod unit, these are ephemeral pods now, so they don't live for very long. So you end up having more pieces of data and they're kept around for shorter period of time. And now you need a system that can store all of these pieces of data, because you want to see them uniquely. So you want to monitor each individual pod to see exactly what is running at the finest levels. Right, so you actually need technology that can store a lot more data than you could before. >> And I you know, adding to that, there's a lot more people running with like mobile applications, they use you know that are running in markets all round the world, using different cell providers, and different backend services. You may deploy your backend services multiple times, a week or even a day, and if you want to tag you know, the meta data on and slice and dice by that metadata, with your business and with your applications and your system, that requires you know, adding yet another dimension on your data, which adds to that cardinality. Every time you add a dimension, you know that just multiplies the cardinality of your existing data set of monitoring data. >> And it quickly adds up a lot right, so. >> All right Martin, maybe, since you're just out of Stealth, give us some of the speeds and feeds you know, the product GA, is it globally available? Series A funding, who's behind that? >> Yeah so we just kind of still two weeks ago, we closed up Series A a few months ago actually. It was led by Great Luck, we raised 11 million dollars, and our partner at Great Luck is Gary, and we like him very much. And you know the state of the companies that we are currently in private beta right now. So with our hosted platform, we are onboarding to customers into a private offering right now. And early next year, we'll sort of open that up for more public beta. Yeah. >> And the way folks would use this. You'll be using Prometheus or Graphite or something, and you'd be, so you'd have tracing, you'd have logs, you'd have other things and you would be plugging all of them into, into your services. >> Yeah it's a great question. So you mentioned two of the technologies. So if you're Prometheus or Graphite like to try find metrics, both of those can be pushed into the M3 system for sure. We actually just announced a trace integration, this week a KubeCon actually, Rob David spoke about that integration earlier this week at KubeCon. We haven't moved into the logs yet because the way we look at the problem is not from like a sort of like providing a one-stop shop for all observability solutions, we actually look at it from a use case perspective. So the use case we're looking at is like, realtime monitoring and remediation. So tracing is a part of that stroy, it's a critical part of that story, and now to add additional context, when you get to load it based on your metrics, but, we haven't quite moved into logging yet. >> Yeah, and we don't really want to solve any of these problems without knowing it'll work at scale, you know like a fundamental reason we even built the open-source project in the first place, was we were dealing with cardinality in the tens of billions of unique time series, and so, we don't want to just kind of like roll into any, every single feature under the sun, we really want to solve it once correctly and be able to systematically roll that out to enterprises at scale. >> Without, I mean without talking too much about Uber and any Uber secrets, I mean it seems like the game has changed with that kind of a scale of, you could not have done, you can't run Uber if you're tracking all those cars like literarily without some sort of a tracing like high cardinality sort of a system right? Because you're literarily tracking cars all over the world people all over the world, routes all over the world. >> Exactly, well uniquely positioned, we had the requirements to solve it at such a scale, and that's why we had to build this technology to solve it for that unique situation, because you know technologies ahead of time, did not really have this use case to solve. So that's why we had to sort of, we couldn't find anything out in the market because to solve it at that scale, that's why we sort of had to build our own, to uniquely solve it for this use case. >> And yeah, I would add to that, that typically engineers you know, at larger organizations, tend to want to organize everything very nicely, and split it up, and really control how they're monitoring that data, but we've noticed actually, definitely over the last few years, more and more people are open to letting people just start collecting you know, random data, that is relevant to the systems that they're building as they're rolling it out, even as they're experimenting with it, and you know systems today that are built from scratch, to deal with, to be as efficient as possible, with very unstructured data is becoming wildly popular because that's how developers want to develop software. You know, they don't want to have to have to like slice and dice it neatly and package it up and pass it on to others to run. They want to basically slice and dice however they want to, and dynamically , and as they scale up. >> I've always enjoyed every sequel skimmer I've had two, or change oh, yeah. (laughter) >> All right, how have you found the show? How's the reception been? Give us a little bit of the vibe of the show and how it's been going for you. >> Yeah it's been fantastic for us actually. So we just came in at silk so like the name is still quite new, but yeah, we've had a bunch of folks set up with the whole day, we've been giving a demo on the product, so a lot of companies are getting excited about it. I think a we're solving at a scale and that really resonates with, you know, a lot of the people here at the show, we're still solving at a scope, we're solving at a scale that's also in a cost efficient way as well. So that's really been our, we sleep quite well so far. >> Yeah Rob, you gave some sessions. What kind of feedback are you getting from people? Is the problem statement that we talked about at the beginning you know resonating with people that you talk to. >> I mean, I was really, yeah pleased to hear that after my session today, that a lot of people came up to me and said you know, I've never really seen metrics been linked to tracers, the way that we're doing it, in fact that's the first time they'd ever seen a demo, that can do, what we're kind of trying to upstream, we're actually you know, up-streaming a lot of those changes in the open-source well, as well at the same time. And so, you know we've found especially in a lot of the companies today that are pushing everything forward with development wise and how they are running operations is that they using a lot of pages in open source, and then those pages are battle tested in open-source, generally it becomes abstracted, to the point where we're actually a very large amount of people, but then when they need to scale it up, that's when it becomes difficult. So, no I think that you know, a lot of people have been very positive with basically us being able to also push forward the feature on >> Back upstream into the M3 project. >> And also into Prometheus. So I, you know I'm an open metrics, contributor and that's essentially, an exposition format that's built on the Prometheus, exposition format. So it's kind of become a standard way of exchanging metrics, from one system to another. And that's kind of like, basically commoditized and democratize the exchange of metrics to make a lot more systems, interoperable with each another. Which we fundamentally believe in as well, of course we're developing in open-source, and we believe that this systems need to play nicely together. So we can build you know, have building blocks that large companies and organizations can all share and build better things on top of. >> All right, so looking to go to public beta early 2020s, what we said, when we come back in 2020, what kind of the, some of the key KPIs and metrics that you'll be looking at to be successfull in your first year out of Stealth. >> Yeah it's a great question. So you know, since some of the KPIs you guys were looking at doing is coming at the public beta, making it available to a large range of companies, because right now we're sort of onboarding companies sort of one or two at a time, so yeah it's seeing how many companies adopt the product and also, we're again adding more features over time, for that particular use case of like you know, monitoring your technology just like in your business in real time. So it'll be a lot more features coming down the pipeline, and a lot more customer adoption along with that. >> And I would also say you know, our hosted platform is really about offering like deep isolation, between our tenants as well, so basically when we you know, in the next few months to come, we want to make sure that it works basically like clockwork, and everyone can, we can roll out and scale that highly isolated platform for you know tens and hundreds of organizations, and thousands eventually. And so, and doing that at scale is hard. So I think yeah, we'll see how we're doing with that. >> Yeah for sure. >> All right. Rob, Martin congratulations on coming out of Stealth, look forward to hearing more and thank you so much for joining us. >> Glad, thank you so much. >> All right, for John Troyer I'm Stu Miniman, we'll be back getting towards the end of three days, want to walk over here KubeCon, CloudNativeCon thanks for watching. (upbeat music)

Published Date : Nov 21 2019

SUMMARY :

brought to you by Red Hat. where co-founder and CTO is like you know, where you were coming from, that you need to monitor and track. the last few years. And Chronosphere sort of you know, I noticed that you know, and end to end solution, Rob maybe you can help and so, you know, and frankly a little bit of fighting, So you know, tracking more things per pod. So you want to monitor each individual pod and if you want to tag you know, And you know the state of the companies and you would be plugging because the way we look at the problem Yeah, and we don't really want to solve you can't run Uber if you're because you know and you know systems today I've had two, or change oh, yeah. of the vibe of the show a lot of the people here at the show, at the beginning you know And so, you know we've found especially So we can build you know, All right, so looking to case of like you know, And I would also say you know, and thank you so much for joining us. the end of three days,

ENTITIES

Entity	Category	Confidence
Rob Skillington	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Martin Mao	PERSON	0.99+
Martin	PERSON	0.99+
Rob	PERSON	0.99+
Great Luck	ORGANIZATION	0.99+
John Troyer	PERSON	0.99+
2019	DATE	0.99+
John	PERSON	0.99+
tens	QUANTITY	0.99+
two	QUANTITY	0.99+
San Diego	LOCATION	0.99+
Gary	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Martin Mao	PERSON	0.99+
San Diego, California	LOCATION	0.99+
2020	DATE	0.99+
11 million dollars	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Rob David	PERSON	0.99+
Prometheus	TITLE	0.99+
CloudNativeCon	EVENT	0.99+
both	QUANTITY	0.99+
today	DATE	0.99+
this week	DATE	0.99+
12 thousand	QUANTITY	0.98+
KubeCon	EVENT	0.98+
each individual pod	QUANTITY	0.98+
first year	QUANTITY	0.98+
a week	QUANTITY	0.98+
early next year	DATE	0.98+
two weeks ago	DATE	0.98+
one	QUANTITY	0.98+
Series A	OTHER	0.97+
a day	QUANTITY	0.97+
early 2020s	DATE	0.97+
M3	TITLE	0.97+
two gentlemen	QUANTITY	0.96+
Chronosphere, Austin	LOCATION	0.96+
tens of billions	QUANTITY	0.96+
first time	QUANTITY	0.96+
earlier this week	DATE	0.95+
Graphite	TITLE	0.93+
KubeCon CloudNativeCon 2019	EVENT	0.92+
Cassandra	TITLE	0.91+
three days	QUANTITY	0.87+
few months ago	DATE	0.81+
one system	QUANTITY	0.78+
afternoon	DATE	0.78+
few years ago	DATE	0.77+
hundreds of organizations	QUANTITY	0.75+
next few months	DATE	0.73+
single feature	QUANTITY	0.69+
KubeCon	ORGANIZATION	0.69+
first place	QUANTITY	0.69+
one-	QUANTITY	0.67+
first	QUANTITY	0.66+
last few years	DATE	0.65+
people	QUANTITY	0.65+
Chronosphere	ORGANIZATION	0.65+
NA 2019	EVENT	0.62+
Kubernetes	TITLE	0.62+
Rocks DB	TITLE	0.61+
lot more people	QUANTITY	0.61+
Chronosphere	TITLE	0.6+

Bryan Duxbury, StreamSets | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts. This is "The Cube" covering Spark Summit East 2017. Brought to you by Databricks. Now here are your hosts Dave Volante and George Gilbert. >> Welcome back to snowy Boston everybody. This is "The Cube." The leader in live tech coverage. This is Spark Summit. Spark Summit East #SparkSummit. Bryan Duxbury's here. He's the vice president of engineering at StreamSets. Cleveland boy! Welcome to "The Cube." >> Thanks for having me. >> You've very welcome. Tell us, let's start with StreamSets. We're going to talk about Spark and some of the use cases that it's enabling and some of the integrations you're doing. But what does StreamSets do? >> Sure, StreamSets is a data movement software. So I like to think of it either the first mile or the last mile of a lot of different analytical or data movement workflows. Basically we build a product that allows you to build a workflow, or build a data pipeline that doesn't require you to code. It's a graphical user interphase for dropping an origin, several destinations, and then lightweight transformations onto a canvas. You click play and it runs. So this is kind of different than, a lot of the market today is a programming tool or a command line tool. That still requires your systems engineers or your unfortunate data scientists pretending to be systems engineers to do systems engineering. To do a science project to figure out how to move data. The challenge of data movement I think is often underplayed how challenging it is. But it's extremely tedious work. You know, you have to connect to dozens or hundreds of different data sources. Totally different schemas. Different database drivers, or systems altogether. And it break all the time. So the home-built stuff is really challenging to keep online. When it goes down, your business is not, you're not moving data. You can't actually get the insights you built in the first place. >> I remember I broke into this industry you know, in the days of mainframe. You used to read about them and they had this high-speed data mover. And it was this key component. And it had to be integrated. It had to be able to move, back then, it was large amounts of data fast. Today especially with the advent of Hadoop, people say okay don't move the data, keep it in place. Now that's not always practical. So talk about the sort of business case for starting a company that basically moves data. >> We handle basically the one step before. I agree with you completely. Many data analytical situations today where you're doing like the true, like business-oriented detail, where you're actually analyzing data and producing value, you can do it in place. Which is to say in your cluster, in your Spark cluster, all the different environments you can imagine. The problem is that if it's not there already, then it's a pretty monumental effort to get it there. I think we see. You know a lot of people think oh I can just write a SQL script, right? And that works for the first two to 20 tables you want to deploy. But for instance, in my background, I used to work at Square. I ran a data platform there. We had 500 tables we had to move on a regular basis. Coupled with a whole variety of other data sources. So at some point it becomes really impractical to hand-code these solutions. And even when you build your own framework, and you start to build tools internally, you know, it's not your job really, these companies, to build a world class data movement tool. It's their job to make the data valuable, right? And actually data movement is like utility, right. Providing the utility, really the thing to do is be productive and cost effective, right? So the reason why we build StreamSets, the reason why this thing is a thing in the first place, is because we think people shouldn't be in the business of building data movement tools. They should be in the business of moving their data and then getting on with it. Does that make sense? >> Yeah absolutely. So talk about how it all fits in with Spark generally and specifically Spark coming to the enterprise. >> Well in terms of how StreamSets connects to stuff, we deploy in every way you can imagine, whether you want to run your own premise, on your own machines, or in the Cloud. It's up to you to deploy however you like. We're not prescriptive about that. We often get deployed on the edge of clusters, wether it's your Hadoop cluster or your Spark cluster. And basically we try not to get in the way of these analysis tools. There are many great analytical tools out there like Spark is a great example. We focus really on the moving of data. So what you'll see is someone will build a Spark streaming application or some big Spark SQL thing that actually produces the reports. And we plug in ahead of that. So if you're data is being collected from, you know, Edge web logs or some thing or some Kafka thing or a third party AVI or scripting website. We do the first collection. And then it's usually picked up from there with the next tool. Whether it's Spark or other things. I'm trying to think about the right way to put this. I think that people who write Spark they should focus on the part that's like the business value for them. They should be doing the thing that actually is applying the machine learning model, or is producing the report that the CEO or CTO wants to see. And move away from the ingest part of the business. Does that make sense? >> [] Yeah. >> Yeah. When the Spark guys sort of aspire to that by saying you don't have to worry about exactly when's delivery. And you know you can make sure this sort of guarantee, you've got guarantees that will get from point A to point B. >> Bryan: Yeah. >> Things like that. But all those sources of data and all those targets, writing all those adapters is, I mean, that's been a La Brea tar pit for many companies over time. >> In essence that is our business. I think that you touch on a good point. Spark can actually do some of these things right. There's not complete, but significant overlap in some cases. But the important difference is that Spark is a cluster tool for working with cluster data. And we're not going to beat you running a Spark application for consuming from Kafka to do your analysis. But you want to use Spark for reading local files? Do you want to use Spark for reading from a mainframe? Like these are things that StreamSets is built for. And that library of connectors you're talking about, it's our bread and butter. It's not your job as a data scientist, you know, applying Spark, to build a library of connectors. So actually the challenge is not the difficulty of building any one connector, because we have that down to an art now. But we can afford to invest, we can build a portfolio of connectors. But you as a user of Spark, can only afford to do it on demand. Reactive. And so that turn around time, of the cost it might take you to build that connector is pretty significant. And actually I often see the flow side. This is a problem I faced at Square, which was that people asked me to integrate new data sources, I had to say no. Because it was too rare, it was too unusual for what we had to do. We had other things to support. So the problem with that is that I have no idea what kind of opportunity cost I left behind. Like what kind of data we didn't get, kind of analysis we couldn't do. And with an approach like StreamSets, you can solve that problem sort of up front even. >> So sort of two follow ups. One is it would seem to be an evergreen effort to maintain the existing connectors. >> Bryan: Certainly. >> And two, is there a way to leverage connectors that others have built, like the Kafka connect type stuff. >> Truthfully we are a heavy-duty user of open source software so our actual product, if you dig in to what you see, it's a framework for executing pipelines. And it's for connecting other software into our product. So it's not like when we integrate Kafka we built a build brand new blue sky Kafka connector. We actually integrate what stuff is out there. So our idea is to bring as much of that stuff in there as we can. And really be part of the community. You know, our product is also open source. So we play well with the community. We have had people contribute connectors. People who say we love the product, we need it to connect to this other database. And then they do it for us. So it's been a pretty exciting situation. >> We were talking earlier off-camera, George and I have been talking all week about the badge workloads, interactive workloads, now you've got this sort of new emerging workloads, continuous screening workloads, which is in the name. What are you seeing there? And what kind of use cases is that enabling? >> So we're focused on mostly the continuous delivery workload. We also deliver the batch stuff. We're finding is people are moving farther and farther away from batch in general. Because batch was not the goal it was a means to the end. People wanted to get their data into their environment, so they could do their analysis. They want to run their daily reports, things like that. But ask any data scientist, they would rather the data show up immediately. So we're definitely seeing a lot of customers who want to do things like moving data live from a log file into Hadoop they can read immediately, in the order of minutes. We're trying to do our best to enable those kind of use cases. In particular we're seeing a lot of interest in the Spark arena, obviously that's kind of why we're here today. You know people want to add their event processing, or their aggregation, and analysis, like Spark, especially like Spark SQL. And they want that to be almost happening at the time of ingest. Not once it landed, but like when it's happening. So we're starting to build integration. We have kind of our foot in the door there, with our Spark processor. Which allows you to put a Spark workflow right in the middle of your data pipeline. Or as many of them as you want in fact. And we all sort of manage the lifecycle of that. And do all those connections as required to make your pipeline pretend to have a Spark processor in the middle. We really think that with that kind of workload, you can do your ingest, but you can also capture your real-time analytics along the way. And that doesn't replace batch reporting for say that'll happen after the fact. Our your daily reports or what have you. But it makes it that much easier for your data scientists to have, you know, a piece of intelligence that they had in flight. You know? >> I love talking to someone who's a practitioner now sort of working for a company that's selling technology. What do you see, from both perspectives, as Spark being good at? You know, what's the best fit? And what's it not good at? >> Well I think that Spark is following the arc of like Hadoop basically. It started out as infrastructure for engineers, for building really big scary things. But it's becoming more and more a productivity tool for analysts, data scientist, machine-learning experts. And we see that popping up all the time. And it's really exciting frankly, to think about these streaming analytics that can happen. These scoring machine-learning models. Really bringing a lot more power into the hands of these people who are not engineers. People who are much more focused on the semantic value of the data. And not the garbage in garbage out value of the data. >> You were talking before about it's really hard, data movement and the data's not always right. Data quality continues to be a challenge. >> Bryan: Yeah. >> Maybe comment on that. State the data quality and how the industry is dealing with that problem. >> It is hard, it is hard. I think that the traditional approach to data quality is to try and specify a quality up front. We take the opposite approach. We basically say that it's impossible to know that your data will be correct at all times. So we have what we call schema drift tools. So we try to go, we say like intent-driven approach. We're interacting with your data. Rather then a schema driven approach. So of course your data has an implicit schema as it's passing through the pipeline. Rather than saying, let's transform com three, we want you to use the name. We want you to be aware of what it is you're trying to actually change and affect. And the rest just kind of flows along with it. There's no magic bullet for every kind of data-quality issue or schema change that could possibly come into your pipeline. We try to do the best to make it easy for you to do effectively the best practice. The easiest thing that will survive the future, build robust data pipelines. This is one of the biggest challenges I think with like home-grown solutions. Is that it's really easy to build something that works. It's not easy to build something that works all the time. It's very easy to not imagine the edge cases. 'Cause it might take you a year until you've actually encountered you know, the first big problem. The real, the gotcha that you didn't consider when you were building your own thing. And those of us at StreamSets who have been in the industry and on the user side, we've had some of these experiences. So we're trying to export that knowledge in the product. >> Dave: Who do you guys sell to? >> Everybody. (laughing) We see a lot of success today with, we call it Hadoop replatforming. Which is people who are moving from their huge variety of data sources environment into like a Hadoop data-like kind of environment. Also Cloud, people are moving into the Cloud. The need a way for their data to get from wherever it is to where they want it to be. And certainly people could script these things manually. They could build their own tools for this. But it's just so much more productive to do it quickly in a UI. >> Is it an architect who's buying your product? Is it a developer? >> It's a variety. So I think our product resonates greatly with a developer. But also people who are higher up in the chain. People who are trying to design their whole topology. I think the thing I love to talk about is everyone, when they start on a data project, they sit down and they draw this beautiful diagram with boxes and arrows that says here's where the data's going to go. But a month later, it works, kind of, but it's never that thing. >> Dave: Yeah because the data is just everywhere. >> Exactly. And the reality is that what you have to do to make it work correctly within SLA guidelines and things like that is so not what you imagined. But then you can almost never go backwards. You can never say based on what I have, give me the box scenarios, because it's a systems analysis effort that no one has the time to engage in. But since StreamSets is actually instruments, every step of the pipeline, and we have a view into how all your pipelines actually fit together. We can give you that. We can just generate it. So we actually have a product. We've been talking about the StreamSet data collector which is the core like data movement product. We have like our enterprise edition, which is called the Dataflow Performance Manager, or DPM, It basically gives you a lot of collaboration and enterprise grade authentication. And access control, and the commander control features. So it aggregates your metrics across all your data collectors. It helps you visualize your topology. So people like your director of analytics, or your CIO, who want to know is everything okay? We have a dashboard for them now. And that's really powerful. It's a beautiful UI. And it's really a platform for us to build visualizations with more intelligence. That looks across your whole infrastructure. >> Dave: That's good. >> Yeah. And then the thing is this is strangely kind of unprecedented. Because, you know, again, the engineer who wants to build this himself would say, I could just deploy Graphite. And all of a sudden I've got graphs it's fine right. But they're missing the details. What about the systems that aren't under your control? What about the failure cases? All these things, these are the things we tackle. 'Cause it's our business we can afford to invest massively and make this a really first-class data engineering environment. >> Would it be fair to say that Kafka sort of as it exists today is just data movement built on a log, but that it doesn't do the analytics. And it doesn't really yet, maybe it's just beginning to do some of the monitoring you know, with a dashboard, or that's a statement of direction. Would it be fair to say that you can layer on top of that? Or you can substitute on top of it with all the analytics? And then when you want the really fancy analytic soup, you know, call out to Spark. >> Sure, I would say that for one thing we definitely want to stay out of the analytics base. We think there's many great analytics tools out there like Spark. We also are not a storage tool. In fact, we're kind of like, we're queue-like but we view ourselves more like, if there's a pipe and a pump, we're the pump. And Kafka is the pipe. I think that from like a monitoring perspective, we monitor Kafka indirectly. 'Cause if we know what's coming out, and we know what's going in later, we can give you the stats. And that's actually what's important. This is actually one of the challenges of having sort of a home-grown or disconnected solution, is that stitching together so you understand the end to end is extremely difficult. 'Cause if you have a relational database, and a Kafka, and a Hadoop, and a Spark job, sure you can monitor all those things. They all have their own UIs. But if you can't understand what the is on the whole system you're left like with four windows open trying to figure out where things connect. And it's just too difficult. >> So just on a sort of a positioning point of view for someone who's trying to make sense out of all the choices they have, to what extent would you call yourself a management framework for someone who's building these pipelines, whether from Scratch, or buying components. And to what extent is it, I guess, when you talk about a pump, that would be almost like the run time part of it. >> Bryan: Yeah, yeah. >> So you know there's a control plane and then there's a data plane. >> Bryan: Sure. >> What's the mix? >> Yeah well we do both for sure. I mean I would say that the data point for us is StreamSet's data collector. We move data, we physically move the data. We have our own internal pipeline execution engine. So it doesn't presuppose any other existing technologies, not dependent on Hadoop or Spark or Kafka or anything. You know to some degree data collector is also the control plane for small deployments. Because it does give you start to stop commanding control. Some metrics monitoring, things like that. Now, what people need to expand beyond the realm of single data collector, when they have enterprises with more than one business unit, or data center, or security zone, things like that. You don't just deploy one data collector, you deploy a bunch, dozens or hundreds. And in that case, that's where dataflow performance manager again comes in, as that control plane. Now dataflow performance manager has no data in it. It does not pass your actual business data. But it does again aggregate all of your metrics from all your data collectors and gives you a unified view across your whole enterprise. >> And one more follow-up along those lines. When you have a multi-vendor stack, or a multi-vendor pipeline. >> Bryan: Yeah. >> What gives you the meta view? >> Well we're at the ins and outs. We see the interfaces. So in theory if someone were to consume data out of Kafka do something right. Then there's another job later, like a Spark job. >> George: Yeah. >> So we don't automatic visibility for that. But our plan in the future is to expand as dataflow performance manager to take third party metric sources effectively. To broaden the view of your entire enterprise. >> You've got a bunch of stuff on your website here which is kind of interesting. Talking about some of the things we talked about. You know taming data drift is one of your papers. The silent killer of data integrity. And some other good resources. So just in sort of closing, how do we learn more? What would you suggest? >> Sure, yeah please visit the website. The product is open source and free to download. Data collector is free to download. I would encourage people to try it out. It's really easy to take for a spin. And if you love it you should check out our community. We have a very active Slack channel and Google group, which you can find from the website as well. And there's also a blog full of tutorials. >> Yeah well you're solving gnarly problems that a lot of companies just don't want to deal with. That's good thanks for doing the dirty work, we appreciate it. >> Yeah my pleasure. >> Alright Bryan thanks for coming on "The Cube." >> Thanks for having me. >> Good to see you. You're welcome. Keep right there buddy we'll be back with our next guest. This is "The Cube" we're live from Boston Spark Summit. Spark Summit East #SparkSummit right back. >> Narrator: Since the dawn.

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. He's the vice president of engineering at StreamSets. and some of the integrations you're doing. And it break all the time. And it had to be integrated. all the different environments you can imagine. generally and specifically Spark coming to the enterprise. And move away from the ingest part of the business. When the Spark guys sort of aspire to that But all those sources of data and all those targets, of the cost it might take you to build that connector to maintain the existing connectors. like the Kafka connect type stuff. And really be part of the community. about the badge workloads, interactive workloads, We have kind of our foot in the door there, What do you see, from both perspectives, And not the garbage in garbage out value of the data. data movement and the data's not always right. and how the industry is dealing with that problem. The real, the gotcha that you didn't consider Also Cloud, people are moving into the Cloud. I think the thing I love to talk about is And the reality is that what you have to do What about the systems that aren't under your control? And then when you want the really fancy And Kafka is the pipe. to what extent would you call yourself So you know there's a control plane and gives you a unified view across your whole enterprise. When you have a multi-vendor stack, We see the interfaces. But our plan in the future is to expand Talking about some of the things we talked about. And if you love it you should check out our community. That's good thanks for doing the dirty work, Good to see you.

ENTITIES

Entity	Category	Confidence
Bryan	PERSON	0.99+
Dave	PERSON	0.99+
Dave Volante	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Bryan Duxbury	PERSON	0.99+
StreamSets	ORGANIZATION	0.99+
first mile	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
dozens	QUANTITY	0.99+
Spark	TITLE	0.99+
500 tables	QUANTITY	0.99+
first	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
20 tables	QUANTITY	0.99+
Kafka	TITLE	0.99+
hundreds	QUANTITY	0.99+
One	QUANTITY	0.99+
more than one business unit	QUANTITY	0.98+
Boston	LOCATION	0.98+
a year	QUANTITY	0.98+
Spark SQL	TITLE	0.98+
today	DATE	0.98+
first collection	QUANTITY	0.98+
one	QUANTITY	0.98+
a month later	DATE	0.98+
both	QUANTITY	0.98+
two	QUANTITY	0.98+
SQL	TITLE	0.98+
StreamSets	TITLE	0.98+
Today	DATE	0.97+
Databricks	ORGANIZATION	0.97+
Spark Summit East	LOCATION	0.97+
one data collector	QUANTITY	0.97+
Boston Spark Summit	LOCATION	0.97+
Spark Summit East 2017	EVENT	0.97+
Spark Summit East	EVENT	0.96+
one step	QUANTITY	0.96+
Cleveland	LOCATION	0.95+
both perspectives	QUANTITY	0.95+
StreamSet	ORGANIZATION	0.95+
Slack	ORGANIZATION	0.95+
Square	ORGANIZATION	0.95+
Hadoop	TITLE	0.94+
four windows	QUANTITY	0.93+
first two	QUANTITY	0.93+
Spark Summit	EVENT	0.93+
single data collector	QUANTITY	0.92+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Graphite: