Mark Grover & Jennifer Wu | Spark Summit 2017

>> Announcer: Live from San Francisco, it's the Cube covering Spark Summit 2017, brought to you by databricks. >> Hi, we're back here where the Cube is live, and I didn't even know it Welcome, we're at Spark Summit 2017. Having so much fun talking to our guests I didn't know the camera was on. We are doing a talk with Cloudera, a couple of experts that we have here. First is Mark Grover, who's a software engineer and an author. He wrote the book, "Dupe Application Architectures." Mark, welcome to the show. >> Mark: Thank you very much. Glad to be here. And just to his left we also have Jennifer Wu, and Jennifer's director of product management at Cloudera. Did I get that right? >> That's right. I'm happy to be here, too. >> Alright, great to have you. Why don't we get started talking a little bit more about what Cloudera is maybe introducing new at the show? I saw a booth over here. Mark, do you want to get started? >> Mark: Yeah, there are two exciting things that we've launched at least recently. There Cloudera Altus, which is for transient work loads and being able to do ETL-Like workloads, and Jennifer will be happy to talk more about that. And then there's Cloudera data science workbench, which is this tool that allows folks to use data science at scale. So, get away from doing data science in silos on your personal laptops, and do it in a secure environment on cloud. >> Alright, well, let's jump into Data Science Workbench first. Tell me a little bit more about that, and you mentioned it's for exploratory data science. So give us a little more detail on what it does. >> Yeah, absolutely. So, there was private beta for Cloudera Data Science Workbench earlier in the year and then it was GA a few months ago. And it's like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously people used to have, it was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel, and I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy are the IT organization of the organization, where if they want to make sure that all tools are compliant and that your clusters are secure, and your data is not going into places that are not secured by state of the art security solutions, like Kerberos for example, right? And of course if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So, that was one problem. And the other one was if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6. Another one maybe want to use 3.2, right? And so Cloudera Data Science Workbench is a new product that allows data scientists to visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of their colleagues in the organization, but also allows you to keep your clusters secure. So it allows you to run against a Kerberized cluster, allows single sign on to your web interface to Data Science Workbench, and provides a really nice developer experience in the sense that My workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need, and they don't interfere with each other. We're going to go to Jennifer on Altus in just a few minutes, but George first give you a chance to maybe dig in on Data Science workshop. >> Two questions on the data science side: some of the really toughest nuts to crack have been Sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them, and manage the lifecycle across teams, you know? Like, challenger champion, promote something, or even before that doing the ab testing, and then sort of what's in production is typically in a different language from what, you know, it was designed in and sort of integrating it with the apps. Where is that on the road map? Cause no one really has a good answer for that. >> Yeah, that's an excellent question. In general I think it's the problem to crack these days. How do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. Have someone rewrite that in the language you're going to make in production, right? I don't see that to be the more common part. I think the more widespread problem is even when the language is production, how do you go making the part that the data scientist wrote, the model or whatever that would be, into a prodution cluster? And so, Data Science Workbench in particular runs on the same cluster that is being managed by Cloudera manager, right? So this is a tool that you install, but that is available to you as a web server, as a web interface, and so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier, because it's all running on the same hardware and same systems. There's no separate Cloudera managers that you have to use to manage the workbench compared to your actual cluster. >> Okay. A tangential question, but one of the, the difficulties of doing machine learning is finding all the training data and, and sort of data science expertise to sit with the domain expert to, you know, figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge datasets in terms of voice, you know, images. They do the natural language understanding, speech or rather text to speech, you know, facial recognition. Cause they have such huge datasets they can train on. We're hearing noises that they'd going to take that down to the more mundane statistical kind of machine learning algorithms, so that you wouldn't be, like, here's a algorithm to do churn, you know, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle, too? >> I can't speak for the road map in that sense, but I think some of that problem needs to be tackled by projects like Spark for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloudera. >> George: That's interesting. >> Yeah >> Okay >> Alright, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on the Cube show before, right? >> I have not. >> Okay, well, familiar with your work. Tell us again, you're the product manager for Altus. What does it do, and what was the motivation to build it? >> Yeah, we're really excited about Cloudera Altus. So, we released Cloudera Altus in its first GA form in April, and we launched Cloudera Altus in a public environment in Strata London about two weeks ago, so we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage, basically, the agility and the scale of cloud, and make a very easy to use type of experience to expose Cloudera capacity for, in particular for data engineering type of workloads. So the end user will be able to very easily, in a very agile manner, get data engineering capacity on Cloudera in the cloud, and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud, and cluster operations, and make the end user a really, the end user experience very easy. So, jobs and workloads as first class objects. You can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. >> It does sound like you've sort of abstracted away a lot of the infrastructure that you would associate with on-prem, and sort of almost make it, like, programmable and invisible. But, um, I guess my, one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restrictive, because you are the standalone platform. But when you go on the cloud, someone might say, "I want to use red shift on Amazon," or Snowflake, you know, as the MPP sequel database at the end of a pipeline. And it's not just, I'm using those as examples. There's, you know, dozens, hundreds, thousands of other services to choose from. >> Yes. >> What happens to the integrity of that platform if someone carves off one piece? >> Right. So, interoperability and a unified data pipeline is very important to us, so we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution, and we have committers across the entire stack. We know the application, and we want to make sure that everything's interoperable, no matter how you deploy the cluster. So if you deploy data engineering clusters through Cloudera Altus, but you deployed Impala clusters for data marks in the cloud through Cloudera Director or through any other format, we want all these clusters to be interoperable, and we've taken great pains in order to make everything work together well. >> George: Okay. So how do Altus and Sata Science Workbench interoperate with Spark? Maybe start with >> You want to go first with Altus? >> Sure, so, we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools, and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing, you can use any of the other analytic tools and then, and this includes Data Science Workbench. >> Right, and for Data Science Workbench runs, for example, with the latest version of Spark you could pick, the currently latest released version of Spark, Spark 2.1, Spark 2.2 is being boarded of course, and that will soon be integrated after its release. For example you could use Data Science Workbench with your flavor of Spark two's version and you can run PySpark or Scala jobs on this notebook-like interface, be able to share your work, and because you're using Spark Underneath the hood it uses yarn for resource management, the Data Science Workbench itself uses Docker for configuration management, and Kubernetes for resource managing these Docker containers. >> What would be, if you had to describe sort of the edge conditions and the sweet spot of the application, I mean you talked about data engineering. One thing, we were talking to Matei Zaharia and Ronald Chin about was, and Ali Ghodsi as well was if you put Spark on a database, or at least a, you know, sophisticated storage manager, like Kudu, all of a sudden there're a whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future, and what new applications you would tackle? >> I think a lot of that benefit, for example, could be coming from the underlying storage engine. So let's take Spark on Kudu, for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like Hbase, or the crappy performance of dealing HDFS compactions, right? So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today, but imagine putting something like Spark and being able to use the machine learning libraries and, we have been limited so far in the machine learning algorithms that we have implemented in Spark by the storage system sometimes, and, for example new machine learning algorithms or the existing ones could rewritten to make use of the update features for example, in Kudu. >> And so, it sounds like it makes it, the machine learning pipeline might get richer, but I'm not hearing that, and maybe this isn't sort of in the near term sort of roadmap, the idea that you would build sort of operational apps that have these sophisticated analytics built in, you know, where the analytics, um, you've done the training but at run time, you know, the inferencing influences a transaction, influences a decision. Is that something that you would foresee? >> I think that's totally possible. Again, at the core of it is the part that now you have one storage system that can do scans really well, and it can also do random reads and writes any place, right? So as your, and so that allows applications which were previously siloed because one appication that ran off of HDFS, another application that ran out of Hbase, and then so you had to correlate them to just being one single application that can use to train and then also use their trained data to then make decisions on the new transactions that come in. >> So that's very much within the sort of scope of imagination, or scope. That's part of sort of the ultimate plan? >> Mark: I think it's definitely conceivable now, yeah. >> Okay. >> We're up against a hard break coming up in just a minute, so you each get a 30-second answer here, so it's the same question. You've been here for a day and a half now. What's the most surprising thing you've learned that you thing should be shared more broadly with the Spark community? Let's start with you. >> I think one of the great things that's happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday, you would see that Spark is making forays into reducing that latency. And if you are interested in Spark, using Spark, it's very exciting news. You should keep tabs on it. We hope to deliver lower latency as a community sooner. >> How long is one millisecond? (Mark laughs) >> Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that, like, many many people are very much prepared to actually start taking more, you know, more POCs and more interest in cloud and the response in terms of all of this in Altus has been very encouraging. >> Great. Well, Jennifer, Mark, thank you so much for spending some time here on the Cube with us today. We're going to come by your booth and chat a little bit more later. It's some interesting stuff. And thank you all for watching the Cube today here at Spark Summit 2017, and thanks to Cloudera for bringing us these two experts. And thank you for watching. We'll see you again in just a few minutes with our next interview.

Published Date : Jun 7 2017

SUMMARY :

covering Spark Summit 2017, brought to you by databricks. I didn't know the camera was on. And just to his left we also have Jennifer Wu, I'm happy to be here, too. Mark, do you want to get started? and being able to do ETL-Like workloads, and you mentioned it's for exploratory data science. And the other one was if you were to bring them all together and manage the lifecycle across teams, you know? and so that allows you to move your development machine the domain expert to, you know, I can't speak for the road map in that sense, and talk about Altus a little bit. to build it? on Cloudera in the cloud, and they'll be able to do things a lot of the infrastructure that you would associate with We know the application, and we want to make sure Maybe start with so that the data that you use for your entire data lake and you can run PySpark in the future, and what new applications you would tackle? or the existing ones could rewritten to make use the idea that you would build sort of operational apps Again, at the core of it is the part that now you have That's part of sort of the ultimate plan? that you thing should be shared more broadly So if you saw the keynote yesterday, you would see that and the response in terms of all of this on the Cube with us today.

ENTITIES

Entity	Category	Confidence
Jennifer	PERSON	0.99+
Mark Grover	PERSON	0.99+
Jennifer Wu	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
George	PERSON	0.99+
Mark	PERSON	0.99+
April	DATE	0.99+
Ronald Chin	PERSON	0.99+
San Francisco	LOCATION	0.99+
Matei Zaharia	PERSON	0.99+
30-second	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
Dupe Application Architectures	TITLE	0.99+
dozens	QUANTITY	0.99+
Python	TITLE	0.99+
yesterday	DATE	0.99+
Two questions	QUANTITY	0.99+
today	DATE	0.99+
Spark	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
two experts	QUANTITY	0.99+
a day and a half	QUANTITY	0.99+
First	QUANTITY	0.99+
one problem	QUANTITY	0.99+
Python 2.6	TITLE	0.99+
Strata London	LOCATION	0.99+
one piece	QUANTITY	0.99+
first	QUANTITY	0.98+
Spark Summit 2017	EVENT	0.98+
Cloudera Altus	TITLE	0.98+
Scala	TITLE	0.98+
Docker	TITLE	0.98+
One	QUANTITY	0.97+
Kudu	ORGANIZATION	0.97+
one millisecond	QUANTITY	0.97+
PySpark	TITLE	0.96+
R	TITLE	0.95+
one	QUANTITY	0.95+
two weeks ago	DATE	0.93+
Data Science Workbench	TITLE	0.92+
Cloudera	TITLE	0.91+
hundreds	QUANTITY	0.89+
Hbase	TITLE	0.89+
each	QUANTITY	0.89+
24 different open source components	QUANTITY	0.89+
few months ago	DATE	0.89+
single	QUANTITY	0.88+
kernel	TITLE	0.88+
Altus	TITLE	0.88+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for hyperflex: