Bryan Duxbury, StreamSets | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts. This is "The Cube" covering Spark Summit East 2017. Brought to you by Databricks. Now here are your hosts Dave Volante and George Gilbert. >> Welcome back to snowy Boston everybody. This is "The Cube." The leader in live tech coverage. This is Spark Summit. Spark Summit East #SparkSummit. Bryan Duxbury's here. He's the vice president of engineering at StreamSets. Cleveland boy! Welcome to "The Cube." >> Thanks for having me. >> You've very welcome. Tell us, let's start with StreamSets. We're going to talk about Spark and some of the use cases that it's enabling and some of the integrations you're doing. But what does StreamSets do? >> Sure, StreamSets is a data movement software. So I like to think of it either the first mile or the last mile of a lot of different analytical or data movement workflows. Basically we build a product that allows you to build a workflow, or build a data pipeline that doesn't require you to code. It's a graphical user interphase for dropping an origin, several destinations, and then lightweight transformations onto a canvas. You click play and it runs. So this is kind of different than, a lot of the market today is a programming tool or a command line tool. That still requires your systems engineers or your unfortunate data scientists pretending to be systems engineers to do systems engineering. To do a science project to figure out how to move data. The challenge of data movement I think is often underplayed how challenging it is. But it's extremely tedious work. You know, you have to connect to dozens or hundreds of different data sources. Totally different schemas. Different database drivers, or systems altogether. And it break all the time. So the home-built stuff is really challenging to keep online. When it goes down, your business is not, you're not moving data. You can't actually get the insights you built in the first place. >> I remember I broke into this industry you know, in the days of mainframe. You used to read about them and they had this high-speed data mover. And it was this key component. And it had to be integrated. It had to be able to move, back then, it was large amounts of data fast. Today especially with the advent of Hadoop, people say okay don't move the data, keep it in place. Now that's not always practical. So talk about the sort of business case for starting a company that basically moves data. >> We handle basically the one step before. I agree with you completely. Many data analytical situations today where you're doing like the true, like business-oriented detail, where you're actually analyzing data and producing value, you can do it in place. Which is to say in your cluster, in your Spark cluster, all the different environments you can imagine. The problem is that if it's not there already, then it's a pretty monumental effort to get it there. I think we see. You know a lot of people think oh I can just write a SQL script, right? And that works for the first two to 20 tables you want to deploy. But for instance, in my background, I used to work at Square. I ran a data platform there. We had 500 tables we had to move on a regular basis. Coupled with a whole variety of other data sources. So at some point it becomes really impractical to hand-code these solutions. And even when you build your own framework, and you start to build tools internally, you know, it's not your job really, these companies, to build a world class data movement tool. It's their job to make the data valuable, right? And actually data movement is like utility, right. Providing the utility, really the thing to do is be productive and cost effective, right? So the reason why we build StreamSets, the reason why this thing is a thing in the first place, is because we think people shouldn't be in the business of building data movement tools. They should be in the business of moving their data and then getting on with it. Does that make sense? >> Yeah absolutely. So talk about how it all fits in with Spark generally and specifically Spark coming to the enterprise. >> Well in terms of how StreamSets connects to stuff, we deploy in every way you can imagine, whether you want to run your own premise, on your own machines, or in the Cloud. It's up to you to deploy however you like. We're not prescriptive about that. We often get deployed on the edge of clusters, wether it's your Hadoop cluster or your Spark cluster. And basically we try not to get in the way of these analysis tools. There are many great analytical tools out there like Spark is a great example. We focus really on the moving of data. So what you'll see is someone will build a Spark streaming application or some big Spark SQL thing that actually produces the reports. And we plug in ahead of that. So if you're data is being collected from, you know, Edge web logs or some thing or some Kafka thing or a third party AVI or scripting website. We do the first collection. And then it's usually picked up from there with the next tool. Whether it's Spark or other things. I'm trying to think about the right way to put this. I think that people who write Spark they should focus on the part that's like the business value for them. They should be doing the thing that actually is applying the machine learning model, or is producing the report that the CEO or CTO wants to see. And move away from the ingest part of the business. Does that make sense? >> [] Yeah. >> Yeah. When the Spark guys sort of aspire to that by saying you don't have to worry about exactly when's delivery. And you know you can make sure this sort of guarantee, you've got guarantees that will get from point A to point B. >> Bryan: Yeah. >> Things like that. But all those sources of data and all those targets, writing all those adapters is, I mean, that's been a La Brea tar pit for many companies over time. >> In essence that is our business. I think that you touch on a good point. Spark can actually do some of these things right. There's not complete, but significant overlap in some cases. But the important difference is that Spark is a cluster tool for working with cluster data. And we're not going to beat you running a Spark application for consuming from Kafka to do your analysis. But you want to use Spark for reading local files? Do you want to use Spark for reading from a mainframe? Like these are things that StreamSets is built for. And that library of connectors you're talking about, it's our bread and butter. It's not your job as a data scientist, you know, applying Spark, to build a library of connectors. So actually the challenge is not the difficulty of building any one connector, because we have that down to an art now. But we can afford to invest, we can build a portfolio of connectors. But you as a user of Spark, can only afford to do it on demand. Reactive. And so that turn around time, of the cost it might take you to build that connector is pretty significant. And actually I often see the flow side. This is a problem I faced at Square, which was that people asked me to integrate new data sources, I had to say no. Because it was too rare, it was too unusual for what we had to do. We had other things to support. So the problem with that is that I have no idea what kind of opportunity cost I left behind. Like what kind of data we didn't get, kind of analysis we couldn't do. And with an approach like StreamSets, you can solve that problem sort of up front even. >> So sort of two follow ups. One is it would seem to be an evergreen effort to maintain the existing connectors. >> Bryan: Certainly. >> And two, is there a way to leverage connectors that others have built, like the Kafka connect type stuff. >> Truthfully we are a heavy-duty user of open source software so our actual product, if you dig in to what you see, it's a framework for executing pipelines. And it's for connecting other software into our product. So it's not like when we integrate Kafka we built a build brand new blue sky Kafka connector. We actually integrate what stuff is out there. So our idea is to bring as much of that stuff in there as we can. And really be part of the community. You know, our product is also open source. So we play well with the community. We have had people contribute connectors. People who say we love the product, we need it to connect to this other database. And then they do it for us. So it's been a pretty exciting situation. >> We were talking earlier off-camera, George and I have been talking all week about the badge workloads, interactive workloads, now you've got this sort of new emerging workloads, continuous screening workloads, which is in the name. What are you seeing there? And what kind of use cases is that enabling? >> So we're focused on mostly the continuous delivery workload. We also deliver the batch stuff. We're finding is people are moving farther and farther away from batch in general. Because batch was not the goal it was a means to the end. People wanted to get their data into their environment, so they could do their analysis. They want to run their daily reports, things like that. But ask any data scientist, they would rather the data show up immediately. So we're definitely seeing a lot of customers who want to do things like moving data live from a log file into Hadoop they can read immediately, in the order of minutes. We're trying to do our best to enable those kind of use cases. In particular we're seeing a lot of interest in the Spark arena, obviously that's kind of why we're here today. You know people want to add their event processing, or their aggregation, and analysis, like Spark, especially like Spark SQL. And they want that to be almost happening at the time of ingest. Not once it landed, but like when it's happening. So we're starting to build integration. We have kind of our foot in the door there, with our Spark processor. Which allows you to put a Spark workflow right in the middle of your data pipeline. Or as many of them as you want in fact. And we all sort of manage the lifecycle of that. And do all those connections as required to make your pipeline pretend to have a Spark processor in the middle. We really think that with that kind of workload, you can do your ingest, but you can also capture your real-time analytics along the way. And that doesn't replace batch reporting for say that'll happen after the fact. Our your daily reports or what have you. But it makes it that much easier for your data scientists to have, you know, a piece of intelligence that they had in flight. You know? >> I love talking to someone who's a practitioner now sort of working for a company that's selling technology. What do you see, from both perspectives, as Spark being good at? You know, what's the best fit? And what's it not good at? >> Well I think that Spark is following the arc of like Hadoop basically. It started out as infrastructure for engineers, for building really big scary things. But it's becoming more and more a productivity tool for analysts, data scientist, machine-learning experts. And we see that popping up all the time. And it's really exciting frankly, to think about these streaming analytics that can happen. These scoring machine-learning models. Really bringing a lot more power into the hands of these people who are not engineers. People who are much more focused on the semantic value of the data. And not the garbage in garbage out value of the data. >> You were talking before about it's really hard, data movement and the data's not always right. Data quality continues to be a challenge. >> Bryan: Yeah. >> Maybe comment on that. State the data quality and how the industry is dealing with that problem. >> It is hard, it is hard. I think that the traditional approach to data quality is to try and specify a quality up front. We take the opposite approach. We basically say that it's impossible to know that your data will be correct at all times. So we have what we call schema drift tools. So we try to go, we say like intent-driven approach. We're interacting with your data. Rather then a schema driven approach. So of course your data has an implicit schema as it's passing through the pipeline. Rather than saying, let's transform com three, we want you to use the name. We want you to be aware of what it is you're trying to actually change and affect. And the rest just kind of flows along with it. There's no magic bullet for every kind of data-quality issue or schema change that could possibly come into your pipeline. We try to do the best to make it easy for you to do effectively the best practice. The easiest thing that will survive the future, build robust data pipelines. This is one of the biggest challenges I think with like home-grown solutions. Is that it's really easy to build something that works. It's not easy to build something that works all the time. It's very easy to not imagine the edge cases. 'Cause it might take you a year until you've actually encountered you know, the first big problem. The real, the gotcha that you didn't consider when you were building your own thing. And those of us at StreamSets who have been in the industry and on the user side, we've had some of these experiences. So we're trying to export that knowledge in the product. >> Dave: Who do you guys sell to? >> Everybody. (laughing) We see a lot of success today with, we call it Hadoop replatforming. Which is people who are moving from their huge variety of data sources environment into like a Hadoop data-like kind of environment. Also Cloud, people are moving into the Cloud. The need a way for their data to get from wherever it is to where they want it to be. And certainly people could script these things manually. They could build their own tools for this. But it's just so much more productive to do it quickly in a UI. >> Is it an architect who's buying your product? Is it a developer? >> It's a variety. So I think our product resonates greatly with a developer. But also people who are higher up in the chain. People who are trying to design their whole topology. I think the thing I love to talk about is everyone, when they start on a data project, they sit down and they draw this beautiful diagram with boxes and arrows that says here's where the data's going to go. But a month later, it works, kind of, but it's never that thing. >> Dave: Yeah because the data is just everywhere. >> Exactly. And the reality is that what you have to do to make it work correctly within SLA guidelines and things like that is so not what you imagined. But then you can almost never go backwards. You can never say based on what I have, give me the box scenarios, because it's a systems analysis effort that no one has the time to engage in. But since StreamSets is actually instruments, every step of the pipeline, and we have a view into how all your pipelines actually fit together. We can give you that. We can just generate it. So we actually have a product. We've been talking about the StreamSet data collector which is the core like data movement product. We have like our enterprise edition, which is called the Dataflow Performance Manager, or DPM, It basically gives you a lot of collaboration and enterprise grade authentication. And access control, and the commander control features. So it aggregates your metrics across all your data collectors. It helps you visualize your topology. So people like your director of analytics, or your CIO, who want to know is everything okay? We have a dashboard for them now. And that's really powerful. It's a beautiful UI. And it's really a platform for us to build visualizations with more intelligence. That looks across your whole infrastructure. >> Dave: That's good. >> Yeah. And then the thing is this is strangely kind of unprecedented. Because, you know, again, the engineer who wants to build this himself would say, I could just deploy Graphite. And all of a sudden I've got graphs it's fine right. But they're missing the details. What about the systems that aren't under your control? What about the failure cases? All these things, these are the things we tackle. 'Cause it's our business we can afford to invest massively and make this a really first-class data engineering environment. >> Would it be fair to say that Kafka sort of as it exists today is just data movement built on a log, but that it doesn't do the analytics. And it doesn't really yet, maybe it's just beginning to do some of the monitoring you know, with a dashboard, or that's a statement of direction. Would it be fair to say that you can layer on top of that? Or you can substitute on top of it with all the analytics? And then when you want the really fancy analytic soup, you know, call out to Spark. >> Sure, I would say that for one thing we definitely want to stay out of the analytics base. We think there's many great analytics tools out there like Spark. We also are not a storage tool. In fact, we're kind of like, we're queue-like but we view ourselves more like, if there's a pipe and a pump, we're the pump. And Kafka is the pipe. I think that from like a monitoring perspective, we monitor Kafka indirectly. 'Cause if we know what's coming out, and we know what's going in later, we can give you the stats. And that's actually what's important. This is actually one of the challenges of having sort of a home-grown or disconnected solution, is that stitching together so you understand the end to end is extremely difficult. 'Cause if you have a relational database, and a Kafka, and a Hadoop, and a Spark job, sure you can monitor all those things. They all have their own UIs. But if you can't understand what the is on the whole system you're left like with four windows open trying to figure out where things connect. And it's just too difficult. >> So just on a sort of a positioning point of view for someone who's trying to make sense out of all the choices they have, to what extent would you call yourself a management framework for someone who's building these pipelines, whether from Scratch, or buying components. And to what extent is it, I guess, when you talk about a pump, that would be almost like the run time part of it. >> Bryan: Yeah, yeah. >> So you know there's a control plane and then there's a data plane. >> Bryan: Sure. >> What's the mix? >> Yeah well we do both for sure. I mean I would say that the data point for us is StreamSet's data collector. We move data, we physically move the data. We have our own internal pipeline execution engine. So it doesn't presuppose any other existing technologies, not dependent on Hadoop or Spark or Kafka or anything. You know to some degree data collector is also the control plane for small deployments. Because it does give you start to stop commanding control. Some metrics monitoring, things like that. Now, what people need to expand beyond the realm of single data collector, when they have enterprises with more than one business unit, or data center, or security zone, things like that. You don't just deploy one data collector, you deploy a bunch, dozens or hundreds. And in that case, that's where dataflow performance manager again comes in, as that control plane. Now dataflow performance manager has no data in it. It does not pass your actual business data. But it does again aggregate all of your metrics from all your data collectors and gives you a unified view across your whole enterprise. >> And one more follow-up along those lines. When you have a multi-vendor stack, or a multi-vendor pipeline. >> Bryan: Yeah. >> What gives you the meta view? >> Well we're at the ins and outs. We see the interfaces. So in theory if someone were to consume data out of Kafka do something right. Then there's another job later, like a Spark job. >> George: Yeah. >> So we don't automatic visibility for that. But our plan in the future is to expand as dataflow performance manager to take third party metric sources effectively. To broaden the view of your entire enterprise. >> You've got a bunch of stuff on your website here which is kind of interesting. Talking about some of the things we talked about. You know taming data drift is one of your papers. The silent killer of data integrity. And some other good resources. So just in sort of closing, how do we learn more? What would you suggest? >> Sure, yeah please visit the website. The product is open source and free to download. Data collector is free to download. I would encourage people to try it out. It's really easy to take for a spin. And if you love it you should check out our community. We have a very active Slack channel and Google group, which you can find from the website as well. And there's also a blog full of tutorials. >> Yeah well you're solving gnarly problems that a lot of companies just don't want to deal with. That's good thanks for doing the dirty work, we appreciate it. >> Yeah my pleasure. >> Alright Bryan thanks for coming on "The Cube." >> Thanks for having me. >> Good to see you. You're welcome. Keep right there buddy we'll be back with our next guest. This is "The Cube" we're live from Boston Spark Summit. Spark Summit East #SparkSummit right back. >> Narrator: Since the dawn.

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. He's the vice president of engineering at StreamSets. and some of the integrations you're doing. And it break all the time. And it had to be integrated. all the different environments you can imagine. generally and specifically Spark coming to the enterprise. And move away from the ingest part of the business. When the Spark guys sort of aspire to that But all those sources of data and all those targets, of the cost it might take you to build that connector to maintain the existing connectors. like the Kafka connect type stuff. And really be part of the community. about the badge workloads, interactive workloads, We have kind of our foot in the door there, What do you see, from both perspectives, And not the garbage in garbage out value of the data. data movement and the data's not always right. and how the industry is dealing with that problem. The real, the gotcha that you didn't consider Also Cloud, people are moving into the Cloud. I think the thing I love to talk about is And the reality is that what you have to do What about the systems that aren't under your control? And then when you want the really fancy And Kafka is the pipe. to what extent would you call yourself So you know there's a control plane and gives you a unified view across your whole enterprise. When you have a multi-vendor stack, We see the interfaces. But our plan in the future is to expand Talking about some of the things we talked about. And if you love it you should check out our community. That's good thanks for doing the dirty work, Good to see you.

ENTITIES

Entity	Category	Confidence
Bryan	PERSON	0.99+
Dave	PERSON	0.99+
Dave Volante	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Bryan Duxbury	PERSON	0.99+
StreamSets	ORGANIZATION	0.99+
first mile	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
dozens	QUANTITY	0.99+
Spark	TITLE	0.99+
500 tables	QUANTITY	0.99+
first	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
20 tables	QUANTITY	0.99+
Kafka	TITLE	0.99+
hundreds	QUANTITY	0.99+
One	QUANTITY	0.99+
more than one business unit	QUANTITY	0.98+
Boston	LOCATION	0.98+
a year	QUANTITY	0.98+
Spark SQL	TITLE	0.98+
today	DATE	0.98+
first collection	QUANTITY	0.98+
one	QUANTITY	0.98+
a month later	DATE	0.98+
both	QUANTITY	0.98+
two	QUANTITY	0.98+
SQL	TITLE	0.98+
StreamSets	TITLE	0.98+
Today	DATE	0.97+
Databricks	ORGANIZATION	0.97+
Spark Summit East	LOCATION	0.97+
one data collector	QUANTITY	0.97+
Boston Spark Summit	LOCATION	0.97+
Spark Summit East 2017	EVENT	0.97+
Spark Summit East	EVENT	0.96+
one step	QUANTITY	0.96+
Cleveland	LOCATION	0.95+
both perspectives	QUANTITY	0.95+
StreamSet	ORGANIZATION	0.95+
Slack	ORGANIZATION	0.95+
Square	ORGANIZATION	0.95+
Hadoop	TITLE	0.94+
four windows	QUANTITY	0.93+
first two	QUANTITY	0.93+
Spark Summit	EVENT	0.93+
single data collector	QUANTITY	0.92+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Boston Spark Summit: