Day One Wrap - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's the CUBE covering Spark Summit 2017, brought to by Databricks. (energetic music plays) >> And what an exciting day we've had here at the CUBE. We've been at Spark Summit 2017, talking to partners, to customers, to founders, technologists, data scientists. It's been a load of information, right? >> Yeah, an overload of information. >> Well, George, you've been here in the studio with me talking with a lot of the guests. I'm going to ask you to maybe recap some of the top things you've heard today for our guests. >> Okay so, well, Databricks laid down, sort of, three themes that they wanted folks to take away. Deep learning, Structured Streaming, and serverless. Now, deep learning is not entirely new to Spark. But they've dramatically improved their support for it. I think, going beyond the frameworks that were written specifically for Spark, like Deeplearning4j and BigDL by Intel And now like TensorFlow, which is the opensource framework from Google, has gotten much better support. Structured Streaming, it was not clear how much more news we were going to get, because it's been talked about for 18 months. And they really, really surprised a lot of people, including me, where they took, essentially, the processing time for an event or a small batch of events down to 1 millisecond. Whereas, before, it was in the hundreds if not higher. And that changes the type of apps you can build. And also, the Databricks guys had coined the term continuous apps, which means they operate on a never-ending stream of data, which is different from what we've had in the past where it's batch or with a user interface, request-response. So they definitely turned up the volume on what they can do with continuous apps. And serverless, they'll talk about more tomorrow. And Jim, I think, is going to weigh in. But it, basically, greatly simplifies the ability to run this infrastructure, because you don't think of it as a cluster of resources. You just know that it's sort of out there, and you ask requests of it, and it figures out how to fulfill it. I will say, the other big surprise for me was when we have Matei, who's the creator of Spark and the chief technologist at Databricks, come on the show and say, when we asked him about how Spark was going to deal with, essentially, more advanced storage of data so that you could update things, so that you could get queries back, so that you could do analytics, and not just of stuff that's stored in Spark but stuff that Spark stores essentially below it. And he said, "You know, Databricks, you can expect to see come out with or partner with a database to do these advanced scenarios." And I got the distinct impression, and after listen to the tape again, that he was talking about for Apache Spark, which is separate from Databricks, that they would do some sort of key-value store. So in other words, when you look at competitors or quasi-competitors like Confluent Kafka or a data artist in Flink, they don't, they're not perfect competitors. They overlap some. Now Spark is pushing its way more into overlapping with some of those solutions. >> Alright. Well, Jim Kobielus. And thank you for that, George. You've been mingling with the masses today. (laughs) And you've been here all day as well. >> Educated masses, yeah, (David laughs) who are really engaged in this stuff, yes. >> Well, great, maybe give us some of your top takeaways after all the conversations you've had today. >> They're not all that dissimilar from George's. What Databricks, Databricks of course being the center, the developer, the primary committer in the Spark opensource community. They've done a number of very important things in terms of the announcements today at this event that push Spark, the Spark ecosystem, where it needs to go to expand the range of capabilities and their deployability into production environments. I feel the deep-learning side, announcement in terms of the deep-learning pipeline API very, very important. Now, as George indicated, Spark has been used in a fair number of deep-learning development environments. But not as a modeling tool so much as a training tool, a tool for In Memory distributed training of deep-learning models that we developed in TensorFlow, in Caffe, and other frameworks. Now this announcement is essentially bringing support for deep learning directly into the Spark modeling pipeline, the machine-learning modeling pipeline, being able to call out to deep learning, you know, TensorFlow and so forth, from within MLlib. That's very important. That means that Spark developers, of which there are many, far more than there are TensorFlow developers, will now have an easy pass to bring more deep learning into their projects. That's critically important to democratize deep learning. I hope, and from what I've seen what Databricks has indicated, that they have support currently in API reaching out to both TensorFlow and Keras, that they have plans to bring in API support for access to other leading DL toolkits such as Caffe, Caffe 2, which is Facebook-developed, such as MXNet, which is Amazon-developed, and so forth. That's very encouraging. Structured Streaming is very important in terms of what they announced, which is an API to enable access to faster, or higher-throughput Structured Streaming in their cloud environment. And they also announced that they have gone beyond, in terms of the code that they've built, the micro-batch architecture of Structured Streaming, to enable it to evolve into a more true streaming environment to be able to contend credibly with the likes of Flink. 'Cause I think that the Spark community has, sort of, had their back against the wall with Structured Streaming that they couldn't fully provide a true sub-millisecond en-oo-en latency environment heretofore. But it sounds like with this R&D that Databricks is addressing that, and that's critically important for the Spark community to continue to evolve in terms of continuous computation. And then the serverless-apps announcement is also very important, 'cause I see it as really being, it's a fully-managed multi-tenant Spark-development environment, as an enabler for continuous Build, Deploy, and Testing DevOps within a Spark machine-learning and now deep-learning context. The Spark community as it evolves and matures needs robust DevOps tools to production-ize these machine-learning and deep-learning models. Because really, in many ways, many customers, many developers are now using, or developing, Spark applications that are real 24-by-7 enterprise application artifacts that need a robust DevOps environment. And I think that Databricks has indicated they know where this market needs to go and they're pushing it with R&D. And I'm encouraged by all those signs. >> So, great. Well thank you, Jim. I hope both you gentlemen are looking forward to tomorrow. I certainly am. >> Oh yeah. >> And to you out there, tune in again around 10:00 a.m. Pacific Time. We're going to be broadcasting live here. From Spark Summit 2017, I'm David Goad with Jim and George, saying goodbye for now. And we'll see you in the morning. (sparse percussion music playing) (wind humming and waves crashing).
SUMMARY :
Announcer: Live from San Francisco, it's the CUBE to customers, to founders, technologists, data scientists. I'm going to ask you to maybe recap And that changes the type of apps you can build. And thank you for that, George. after all the conversations you've had today. for the Spark community to continue to evolve I hope both you gentlemen are looking forward to tomorrow. And to you out there, tune in again
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Matei | PERSON | 0.99+ |
tomorrow | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
ORGANIZATION | 0.98+ | |
Intel | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
18 months | QUANTITY | 0.98+ |
Flink | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.97+ | |
Confluent Kafka | ORGANIZATION | 0.97+ |
Caffe | ORGANIZATION | 0.96+ |
today | DATE | 0.96+ |
TensorFlow | TITLE | 0.94+ |
three themes | QUANTITY | 0.94+ |
10:00 a.m. Pacific Time | DATE | 0.94+ |
CUBE | ORGANIZATION | 0.94+ |
Deeplearning4j | TITLE | 0.94+ |
Spark | ORGANIZATION | 0.93+ |
1 millisecond | QUANTITY | 0.93+ |
Keras | ORGANIZATION | 0.91+ |
Day One | QUANTITY | 0.81+ |
BigDL | TITLE | 0.79+ |
TensorFlow | ORGANIZATION | 0.79+ |
7 | QUANTITY | 0.77+ |
MLlib | TITLE | 0.73+ |
Caffe 2 | ORGANIZATION | 0.7+ |
Caffe | TITLE | 0.7+ |
24- | QUANTITY | 0.68+ |
MXNet | ORGANIZATION | 0.67+ |
Apache Spark | ORGANIZATION | 0.54+ |
Arik Pelkey, Pentaho - BigData SV 2017 - #BigDataSV - #theCUBE
>> Announcer: Live from Santa Fe, California, it's the Cube covering Big Data Silicon Valley 2017. >> Welcome, back, everyone. We're here live in Silicon Valley in San Jose for Big Data SV in conjunct with stratAHEAD Hadoop part two. Three days of coverage here in Silicon Valley and Big Data. It's our eighth year covering Hadoop and the Hadoop ecosystem. Now expanding beyond just Hadoop into AI, machine learning, IoT, cloud computing with all this compute is really making it happen. I'm John Furrier with my co-host George Gilbert. Our next guest is Arik Pelkey who is the senior director of product marketing at Pentaho that we've covered many times and covered their event at Pentaho world. Thanks for joining us. >> Thank you for having me. >> So, in following you guys I'll see Pentaho was once an independent company bought by Hitachi, but still an independent group within Hitachi. >> That's right, very much so. >> Okay so you guys some news. Let's just jump into the news. You guys announced some of the machine learning. >> Exactly, yeah. So, Arik Pelkey, Pentaho. We are a data integration and analytics software company. You mentioned you've been doing this for eight years. We have been at Big Data for the past eight years as well. In fact, we're one of the first vendors to support Hadoop back in the day, so we've been along for the journey ever since then. What we're announcing today is really exciting. It's a set of machine learning orchestration capabilities, which allows data scientists, data engineers, and data analysts to really streamline their data science processes. Everything from ingesting new data sources through data preparation, feature engineering which is where a lot of data scientists spend their time through tuning their models which can still be programmed in R, in Weka, in Python, and any other kind of data science tool of choice. What we do is we help them deploy those models inside of Pentaho as a step inside of Pentaho, and then we help them update those models as time goes on. So, really what this is doing is it's streamlining. It's making them more productive so that they can focus their time on things like model building rather than data preparation and feature engineering. >> You know, it's interesting. The market is really active right now around machine learning and even just last week at Google Next, which is their cloud event, they had made the acquisition of Kaggle, which is kind of an open data science. You mentioned the three categories: data engineer, data science, data analyst. Almost on a progression, super geek to business facing, and there's different approaches. One of the comments from the CEO of Kaggle on the acquisition when we wrote up at Sylvan Angle was, and I found this fascinating, I want to get your commentary and reaction to is, he says the data science tools are as early as generations ago, meaning that all the advances and open source and tooling and software development is far along, but now data science is still at that early stage and is going to get better. So, what's your reaction to that, because this is really the demand we're seeing is a lot of heavy lifing going on in the data science world, yet there's a lot of runway of more stuff to do. What is that more stuff? >> Right. Yeah, we're seeing the same thing. Last week I was at the Gardener Data and Analytics conference, and that was kind of the take there from one of their lead machine learning analysts was this is still really early days for data science software. So, there's a lot of Apache projects out there. There's a lot of other open source activity going on, but there are very few vendors that bring to the table an integrated kind of full platform approach to the data science workflow, and that's what we're bringing to market today. Let me be clear, we're not trying to replace R, or Python, or MLlib, because those are the tools of the data scientists. They're not going anywhere. They spent eight years in their phD program working with these tools. We're not trying to change that. >> They're fluent with those tools. >> Very much so. They're also spending a lot of time doing feature engineering. Some research reports, say between 70 and 80% of their time. What we bring to the table is a visual drag and drop environment to do feature engineering a much faster, more efficient way than before. So, there's a lot of different kind of desperate siloed applications out there that all do interesting things on their own, but what we're doing is we're trying to bring all of those together. >> And the trends are reduce the time it takes to do stuff and take away some of those tasks that you can use machine learning for. What unique capabilities do you guys have? Talk about that for a minute, just what Pentaho is doing that's unique and added value to those guys. >> So, the big thing is I keep going back to the data preparation part. I mean, that's 80% of time that's still a really big challenge. There's other vendors out there that focus on just the data science kind of workflow, but where we're really unqiue is around being able to accommodate very complex data environments, and being able to onboard data. >> Give me an example of those environments. >> Geospatial data combined with data from your ERP or your CRM system and all kinds of different formats. So, there might be 15 different data formats that need to be blended together and standardized before any of that can really happen. That's the complexity in the data. So, Pentaho, very consistent with everything else that we do outside of machine learning, is all about helping our customers solve those very complex data challenges before doing any kind of machine learning. One example is one customer is called Caterpillar Machine Asset Intelligence. So, their doing predictive maintenance onboard container ships and on ferry's. So, they're taking data from hundreds and hundreds of sensors onboard these ships, combining that kind of operational sensor data together with geospatial data and then they're serving up predictive maintenance alerts if you will, or giving signals when it's time to replace an engine or complace a compressor or something like that. >> Versus waiting for it to break. >> Versus waiting for it to break, exactly. That's one of the real differentiators is that very complex data environment, and then I was starting to move toward the other differentiator which is our end to end platform which allows customers to deliver these analytics in an embedded fashion. So, kind of full circle, being able to send that signal, but not to an operational system which is sometimes a challenge because you might have to rewrite the code. Deploying models is a really big challenge within Pentaho because it is this fully integrated application. You can deploy the models within Pentaho and not have to jump out into a mainframe environment or something like that. So, I'd say differentiators are very complex data environments, and then this end to end approach where deploying models is much easier than ever before. >> Perhaps, let's talk about alternatives that customers might see. You have a tool suite, and others might have to put together a suite of tools. Maybe tell us some of the geeky version would be the impendent mismatch. You know, like the chasms you'd find between each tool where you have to glue them together, so what are some of those pitfalls? >> One of the challenges is, you have these data scientists working in silos often times. You have data analysts working in silos, you might have data engineers working in silos. One of the big pitfalls is not really collaborating enough to the point where they can do all of this together. So, that's a really big area that we see pitfalls. >> Is it binary not collaborating, or is it that the round trip takes so long that the quality or number of collaborations is so drastically reduced that the output is of lower quality? >> I think it's probably a little bit of both. I think they want to collaborate but one person might sit in Dearborn, Michigan and the other person might sit in Silicon Valley, so there's just a location challenge as well. The other challenge is, some of the data analysts might sit in IT and some of the data scientists might sit in an analytics department somewhere, so it kind of cuts across both location and functional area too. >> So let me ask from the point of view of, you know we've been doing these shows for a number of years and most people have their first data links up and running and their first maybe one or two use cases in production, very sophisticated customers have done more, but what seems to be clear is the highest value coming from those projects isn't to put a BI tool in front of them so much as to do advanced analytics on that data, apply those analytics to inform a decision, whether a person or a machine. >> That's exactly right. >> So, how do you help customers over that hump and what are some other examples that you can share? >> Yeah, so speaking of transformative. I mean, that's what machine learning is all about. It helps companies transform their businesses. We like to talk about that at Pentaho. One customer kind of industry example that I'll share is a company called IMS. IMS is in the business of providing data and analytics to insurance companies so that the insurance companies can price insurance policies based on usage. So, it's a usage model. So, IMS has a technology platform where they put sensors in a car, and then using your mobile phone, can track your driving behavior. Then, your insurance premium that month reflects the driving behavior that you had during that month. In terms of transformative, this is completely upending the insurance industry which has always had a very fixed approach to pricing risk. Now, they understand everything about your behavior. You know, are you turning too fast? Are you breaking too fast, and they're taking it further than that too. They're able to now do kind of a retroactive look at an accident. So, after an accident, they can go back and kind of decompose what happened in the accident and determine whether or not it was your fault or was in fact the ice on the street. So, transformative? I mean, this is just changing things in a really big way. >> I want to get your thoughts on this. I'm just looking at some of the research. You know, we always have the good data but there's also other data out there. In your news, 92% of organizations plan to deploy more predictive analytics, however 50% of organizations have difficulty integrating predictive analytics into their information architecture, which is where the research is shown. So my question to you is, there's a huge gap between the technology landscapes of front end BI tools and then complex data integration tools. That seems to be the sweet spot where the value's created. So, you have the demand and then front end BI's kind of sexy and cool. Wow, I could power my business, but the complexity is really hard in the backend. Who's accessing it? What's the data sources? What's the governance? All these things are complicated, so how do you guys reconcile the front end BI tools and the backend complexity integrations? >> Our story from the beginning has always been this one integrated platform, both for complex data integration challenges together with visualizations, and that's very similar to what this announcement is all about for the data science market. We're very much in line with that. >> So, it's the cart before the horse? Is it like the BI tools are really driven by the data? I mean, it makes sense that the data has to be key. Front end BI could be easy if you have one data set. >> It's funny you say that. I presented at the Gardner conference last week and my topic was, this just in: it's not about analytics. Kind of in jest, but it drove a really big crowd. So, it's about the data right? It's about solving the data problem before you solve the analytics problem whether it's a simple visualization or it's a complex fraud machine learning problem. It's about solving the data problem first. To that quote, I think one of the things that they were referencing was the challenging information architectures into which companies are trying to deploy models and so part of that is when you build a machine learning model, you use R and Python and all these other ones we're familiar with. In order to deploy that into a mainframe environment, someone has to then recode it in C++ or COBOL or something else. That can take a really long time. With our integrated approach, once you've done the feature engineering and the data preparation using our drag and drop environment, what's really interesting is that you're like 90% of the way there in terms of making that model production ready. So, you don't have to go back and change all that code, it's already there because you used it in Pentaho. >> So obviously for those two technologies groups I just mentioned, I think you had a good story there, but it creates problems. You've got product gaps, you've got organizational gaps, you have process gaps between the two. Are you guys going to solve that, or are you currently solving that today? There's a lot of little questions in there, but that seems to be the disconnect. You know, I can do this, I can do that, do I do them together? >> I mean, sticking to my story of one integrated approach to being able to do the entire data science workflow, from beginning to end and that's where we've really excelled. To the extent that more and more data engineers and data analysts and data scientists can get on this one platform even if their using R and WECCA and Python. >> You guys want to close those gaps down, that's what you guys are doing, right? >> We want to make the process more collaborative and more efficient. >> So Dave Alonte has a question on CrowdChat for you. Dave Alonte was in the snowstorm in Boston. Dave, good to see you, hope you're doing well shoveling out the driveway. Thanks for coming in digitally. His question is HDS has been known for mainframes and storage, but Hitachi is an industrial giant. How is Pentaho leveraging Hitatchi's IoT chops? >> Great question, thanks for asking. Hitatchi acquired Pentaho about two years ago, this is before my time. I've been with Pentaho about ten months ago. One of the reasons that they acquired Pentaho is because a platform that they've announced which is called Lumata which is their IoT platform, so what Pentaho is, is the analytics engine that drives that IoT platform Lumata. So, Lumata is about solving more of the hardware sensor, bringing data from the edge into being able to do the analytics. So, it's an incredibly great partnership between Lumata and Pentaho. >> Makes an eternal customer too. >> It's a 90 billion dollar conglomerate so yeah, the acquisition's been great and we're still very much an independent company going to market on our own, but we now have a much larger channel through Hitatchi's reps around the world. >> You've got IoT's use case right there in front of you. >> Exactly. >> But you are leveraging it big time, that's what you're saying? >> Oh yeah, absolutely. We're a very big part of their IoT strategy. It's the analytics. Both of the examples that I shared with you are in fact IoT, not by design but it's because there's a lot of demand. >> You guys seeing a lot of IoT right now? >> Oh yeah. We're seeing a lot of companies coming to us who have just hired a director or vice president of IoT to go out and figure out the IoT strategy. A lot of these are manufacturing companies or coming from industries that are inefficient. >> Digitizing the business model. >> So to the other point about Hitachi that I'll make, is that as it relates to data science, a 90 billion dollar manufacturing and otherwise giant, we have a very deep bench of phD data scientists that we can go to when there's very complex data science problems to solve at customer sight. So, if a customer's struggling with some of the basic how do I get up and running doing machine learning, we can bring our bench of data scientist at Hitatchi to bear in those engagements, and that's a really big differentiator for us. >> Just to be clear and one last point, you've talked about you handle the entire life cycle of modeling from acquiring the data and prepping it all the way through to building a model, deploying it, and updating it which is a continuous process. I think as we've talked about before, data scientists or just the DEV ops community has had trouble operationalizing the end of the model life cycle where you deploy it and update it. Tell us how Pentaho helps with that. >> Yeah, it's a really big problem and it's a very simple solution inside of Pentaho. It's basically a step inside of Pentaho. So, in the case of fraud let's say for example, a prediction might say fraud, not fraud, fraud, not fraud, whatever it is. We can then bring that kind of full lifecycle back into the data workflow at the beginning. It's a simple drag and drop step inside of Pentaho to say which were right and which were wrong and feed that back into the next prediction. We could also take it one step further where there has to be a manual part of this too where it goes to the customer service center, they investigate and they say yes fraud, no fraud, and then that then gets funneled back into the next prediction. So yeah, it's a big challenge and it's something that's relatively easy for us to do just as part of the data science workflow inside of Pentaho. >> Well Arick, thanks for coming on The Cube. We really appreciate it, good luck with the rest of the week here. >> Yeah, very exciting. Thank you for having me. >> You're watching The Cube here live in Silicon Valley covering Strata Hadoop, and of course our Big Data SV event, we also have a companion event called Big Data NYC. We program with O'Reilley Strata Hadoop, and of course have been covering Hadoop really since it's been founded. This is The Cube, I'm John Furrier. George Gilbert. We'll be back with more live coverage today for the next three days here inside The Cube after this short break.
SUMMARY :
it's the Cube covering Big Data Silicon Valley 2017. and the Hadoop ecosystem. So, in following you guys I'll see Pentaho was once You guys announced some of the machine learning. We have been at Big Data for the past eight years as well. One of the comments from the CEO of Kaggle of the data scientists. environment to do feature engineering a much faster, and take away some of those tasks that you can use So, the big thing is I keep going back to the data That's the complexity in the data. So, kind of full circle, being able to send that signal, You know, like the chasms you'd find between each tool One of the challenges is, you have these data might sit in IT and some of the data scientists So let me ask from the point of view of, the driving behavior that you had during that month. and the backend complexity integrations? is all about for the data science market. I mean, it makes sense that the data has to be key. It's about solving the data problem before you solve but that seems to be the disconnect. To the extent that more and more data engineers and more efficient. shoveling out the driveway. One of the reasons that they acquired Pentaho the acquisition's been great and we're still very much Both of the examples that I shared with you of IoT to go out and figure out the IoT strategy. is that as it relates to data science, from acquiring the data and prepping it all the way through and feed that back into the next prediction. of the week here. Thank you for having me. for the next three days here inside The Cube
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Hitachi | ORGANIZATION | 0.99+ |
Dave Alonte | PERSON | 0.99+ |
Pentaho | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
Arik Pelkey | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Hitatchi | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
50% | QUANTITY | 0.99+ |
eight years | QUANTITY | 0.99+ |
Arick | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
Lumata | ORGANIZATION | 0.99+ |
Last week | DATE | 0.99+ |
two technologies | QUANTITY | 0.99+ |
15 different data formats | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
92% | QUANTITY | 0.99+ |
One example | QUANTITY | 0.99+ |
Both | QUANTITY | 0.99+ |
Three days | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
Kaggle | ORGANIZATION | 0.99+ |
one customer | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
eighth year | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
Santa Fe, California | LOCATION | 0.99+ |
two | QUANTITY | 0.99+ |
each tool | QUANTITY | 0.99+ |
90 billion dollar | QUANTITY | 0.99+ |
80% | QUANTITY | 0.99+ |
Caterpillar | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
NYC | LOCATION | 0.98+ |
first data | QUANTITY | 0.98+ |
Pentaho | LOCATION | 0.98+ |
San Jose | LOCATION | 0.98+ |
The Cube | TITLE | 0.98+ |
Big Data SV | EVENT | 0.97+ |
COBOL | TITLE | 0.97+ |
70 | QUANTITY | 0.97+ |
C++ | TITLE | 0.97+ |
IMS | TITLE | 0.96+ |
MLlib | TITLE | 0.96+ |
one person | QUANTITY | 0.95+ |
R | TITLE | 0.95+ |
Big Data | EVENT | 0.95+ |
Gardener Data and Analytics | EVENT | 0.94+ |
Gardner | EVENT | 0.94+ |
Strata Hadoop | TITLE | 0.93+ |
Mike Gualtieri, Forrester Research - Spark Summit East 2017 - #sparksummit - #theCUBE
>> Narrator: Live from Boston, Massachusetts, this is the Cube, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Boston, everybody, where the town is still euphoric. Mike Gualtieri is here, he's the principal analyst at Forrester Research, attended the parade yesterday. How great was that, Mike? >> Yes. Yes. It was awesome. >> Nothing like we've ever seen before. All right, the first question is what was the bigger shocking surprise, upset, greatest win, was it the Red Sox over the Yankees or was it the Superbowl this weekend? >> That's the question, I think it's the Superbowl. >> Yeah, who knows, right? Who knows. It was a lot of fun. So how was the parade yesterday? >> It was magnificent. I mean, it was freezing. No one cared. I mean--but it was, yeah, it was great. Great to see that team in person. >> That's good, wish we could talk, We can, but we'll get into it. So, we're here at Spark Summit, and, you know, the show's getting bigger, you're seeing more sponsors, still heavily a technical audience, but what's your take these days? We were talking off-camera about the whole big data thing. It used to be the hottest thing in the world, and now nobody wants to have big data in their title. What's Forrester's take on that? >> I mean, I think big data-- I think it's just become mainstream, so we're just back to data. You know, because all data is potentially big. So, I don't think it's-- it's not the thing anymore. I mean, what do you do with big data? You analyze it, right? And part of what this whole Spark Summit is about-- look at all the sessions. Data science, machine learning, streaming analytics, so it's all about sort of using that data now, so big data is still important, but the value of big data comes from all this advanced analytics. >> Yeah, and we talked earlier, I mean, a lot of the value of, you know, Hadoop was cutting costs. You know, you've mentioned commodity components and reduction in denominator, and breaking the need for some kind of big storage container. OK, so that-- we got there. Now, shifting to new sources of value, what are you spending your time on these days in terms of research? >> Artificial intelligence, machine learning, so those are really forms of advanced analytics, so that's been-- that's been very hot. We did a survey last year, an AI survey, and we asked a large group of people, we said, oh, you know, what are you doing with AI? 58% said they're researching it. 19% said they're training a model. Right, so that's interesting. 58% are researching it, and far fewer are actually, you know, actually doing something with it. Now, the reality is, if you phrase that a little bit differently, and you said, oh, what are you doing with machine learning? Many more would say yes, we're doing machine learning. So it begs the question, what do enterprises think of AI? And what do they think it is? So, a lot of my inquiries are spent helping enterprises understand what AI is, what they should focus on, and the other part of it is what are the technologies used for AI, and deep learning is the hottest. >> So, you wrote a piece late last year, what's possible today in AI. What's possible today in AI? >> Well, you know, before understanding was possible, it's important to understand what's not possible, right? And so we sort of characterize it as there's pure AI, and there's pragmatic AI. So it's real simple. Pure AI is the sci-fi stuff, we've all seen it, Ex Machina, Star Wars, whatever, right? That's not what we're talking about. That's not what enterprises can do today. We're talking about pragmatic AI, and pragmatic AI is about building predictive models. It's about conversational APIs, to interact in a natural way with humans, it's about image analysis, which is something very hot because of deep learning. So, AI is really about the building blocks that companies have been using, but then using them in combination to create even more intelligent solutions. And they have more options on the market, both from open source, both from cloud services that-- from Google, Microsoft, IBM, and now Amazon, at their re-- Were you guys at their reinvent conference? >> I wasn't, personally, but we were certainly there. >> Yeah, they announced the Amazon AI, which is a set of three services that developers can use without knowing anything about AI or being a data scientist. But, I mean, I think the way to think about AI is that it is data science. It requires the expertise of a data scientist to do AI. >> Following up on that comment, which was really interesting, is we try and-- whereas vendors try and democratize access to machine learning and AI, and I say that with two terms because usually the machine learning is the stuff that's sort of widely accessible and AI is a little further out, but there's a spectrum when you can just access an API, which is like a pre-trained model-- >> Pre-trained model, yep. >> It's developer-accessible, you don't need to be a data scientist, and then at the other end, you know, you need to pick your algorithms, you need to pick your features, you need to find the right data, so how do you see that horizon moving over time? >> Yeah, no, I-- So, these machine learning services, as you say, they're pre-trained models, totally accessible by anyone, anyone who can call an API or a restful service can access these. But their scope is limited, right? So, if, for example, you take the image API, you know, the imaging API that you can get from Google or now Amazon, you can drop an image in there and it will say, oh, there's a wine bottle on a picnic table on the beach. Right? It can identify that. So that's pretty cool, there might be a lot of use cases for that, but think of an enterprise use case. No. You can't do it, and let me give you this example. Say you're an insurance company, and you have a picture of a steel roof that's caved in. If you give that to one of these APIs, it might say steel roof, it may say damage, but what it's not going to do is it's not going to be able to estimate the damage, it's not going to be able to create a bill of materials on how to repair it, because Google hasn't trained it at that level. OK, so, enterprises are going to have to do this themselves, or an ISV is going to have to do it, because think about it, you've got 10 years worth of all these pictures taken of damage. And with all of those pictures, you've got tons of write-ups from an adjuster. Whoa, if you could shove that into a deep learning algorithm, you could potentially have consumers take pictures, or someone untrained, and have this thing say here's what the estimate damage is, this is the situation. >> And I've read about like insurance use cases like that, where the customer could, after they sort of have a crack up, take pictures all around the car, and then the insurance company could provide an estimate, tell them where the nearest repair shops are-- >> Yeah, but right now it's like the early days of e-commerce, where you could send an order in and then it would fax it and they'd type it in. So, I think, yes, insurance coverage is taking those pictures, and the question is can we automate it, and-- >> Well, let me actually iterate on that question, which is so who can build a more end-to-end solution, assuming, you know, there's a lot of heavy lifting that's got to go on for each enterprise trying to build a use case like that. Is it internal development and only at big companies that have a few of these data science gurus? Would it be like an IBM Global Services or an EXIN SURE, or would it be like a vertical ISV where it's semi-custom, semi-patent? >> I think it's both, but I also think it's two or three people walking around this conference, right, understanding Spark, maybe understanding how to use TensorFlow in conjunction with Spark that will start to come up with these ideas as well. So I think-- I think we'll see all of those solutions. Certainly, like IBM with their cognitive computing-- oh, and by the way, so we think that cognitive computing equals pragmatic AI, right, because it has similar characteristics. So, we're already seeing the big ISVs and the big application developers, SAP, Oracle, creating AI-infused applications or modules, but yeah, we're going to see small ISVs do it. There's one in Austin, Texas, called InteractiveTel. It's like 10 people. What they do is they use the Google-- so they sell to large car dealerships, like Ernie Boch. And they record every conversation, phone conversation with customers. They use the Google pre-trained model to convert the speech to text, and then they use their own machine learning to analyze that text to find out if there's a customer service problem or if there's a selling opportunity, and then they alert managers or other people in the organization. So, small company, very narrowly focused on something like car buying. >> So, I wonder if we could come back to something you said about pragmatic AI. We love to have someone like you on the Cube, because we like to talk about the horses on the track. So, if Watson is pragmatic AI, and we all-- well, I think you saw the 60 Minutes show, I don't know, whenever it was, three or four months ago, and IBM Watson got all the love. They barely mentioned Amazon and Google and Facebook, and Microsoft didn't get any mention. So, and there seems to be sentiment that, OK, all the real action is in Silicon Valley. But you've got IBM doing pragmatic AI. Do those two worlds come together in your view? How does that whole market shake up? >> I don't think they come together in the way I think you're suggesting. I think what Google, Microsoft, Facebook, what they're doing is they're churning out fundamental technology, like one of the most popular deep learning frameworks, TensorFlow, is a Google thing that they open sourced. And as I pointed out, those image APIs, that Amazon has, that's not going to work for insurance, that's not going to work for radiology. So, I don't think they're in-- >> George Gilbert: Facebook's going to apply it differently-- >> Yeah, I think what they're trying to do is they're trying to apply it to the millions of consumers that use their platforms, and then I think they throw off some of the technology for the rest of the world to use, fundamentally. >> And then the rest of the world has to apply those. >> Yeah, but I don't think they're in the business of building insurance solutions or building logistical solutions. >> Right. >> But you said something that was really, really potentially intriguing, which was you could take the horizontal Google speech to text API, and then-- >> Mike Gualtieri: And recombine it. >> --put your own model on top of that. And that's, techies call that like ensemble modeling, but essentially you're taking, almost like an OS level service, and you're putting in a more vertical application on top of it, to relate it to our old ways of looking at software, and that's interesting. >> Yeah, because what we're talking about right now, but this conversation is now about applications. Right, we're talking about applications, which need lots of different services recombined, whereas mostly the data science conversation has been narrowly about building one customer lifetime value model or one churn model. Now the conversation, when we talk about AI, is becoming about combining many different services and many different models. >> Dave Vellante: And the platform for building applications is really-- >> Yeah, yeah. >> And that platform, the richest platform, or the platform that is, that is most attractive has the most building blocks to work with, or the broadest ones? >> The best ones, I would say, right now. The reason why I say it that way is because this technology is still moving very rapidly. So for an image analysis, deep learning, very good for image, nothing's better than deep learning for image analysis. But if you're doing business process models or like churn models, well, deep learning hasn't played out there yet. So, right now I think there's some fragmentation. There's so much innovation. Ultimately it may come together. What we're seeing is, many of these companies are saying, OK, look, we're going to bring in the open source. It's pretty difficult to create a deep learning library. And so, you know, a lot of the vendors in the machine learning space, instead of creating their own, they're just bringing in MXNet or TensorFlow. >> I might be thinking of something from a different angle, which is not what underlying implementation they're using, whether it's deep learning or whether it's just random forest, or whatever the terminology is, you know, the traditional statistical stuff. The idea, though, is you want a platform-- like way, way back, Windows, with the Win32 API had essentially more widgets for helping you build graphical applications than any other platform >> Mike Gualtieri: Yeah, I see where you're going. >> And I guess I'm thinking it doesn't matter what the underlying implementation is, but how many widgets can you string together? >> I'm totally with you there, yeah. And so I think what you're saying is look, a platform that has the most capabilities, but abstracts, the implementations, and can, you know, can be somewhat pluggable-- right, good, to keep up with the innovation, yeah. And there's a lot of new companies out there, too, that are tackling this. One of them's called Bonsai AI, you know, small startup, they're trying to abstract deep learning, because deep learning right now, like TensorFlow and MXNet, that's a little bit of a challenge to learn, so they're abstracting it. But so are a lot of the-- so is SAS, IBM, et cetera. >> So, Mike, we're out of time, but I want to talk about your talk tomorrow. So, AI meets Spark, give us a little preview. >> AI meets Spark. Basically, the prerequisite to AI is a very sophisticated and fast data pipeline, because just because we're talking about AI doesn't mean we don't need data to build these models. So, I think Spark gives you the best of both worlds, right? It's designed for these sort of complex data pipelines that you need to prep data, but now, with MLlib for more traditional machine learning, and now with their announcement of TensorFrames, which is going to be an interface for TensorFlow, now you've got deep learning, too. And you've got it in a cluster architecture, so it can scale. So, pretty cool. >> All right, Mike, thanks very much for coming on the Cube. You know, way to go Pats, awesome. Really a pleasure having you back. >> Thanks. >> All right, keep right there, buddy. We'll be back with our next guest right after this short break. This is the Cube. (peppy music)
SUMMARY :
brought to you by Databricks. Mike Gualtieri is here, he's the principal analyst It was awesome. All right, the first question is So how was the parade yesterday? Great to see that team in person. and, you know, the show's getting bigger, I mean, what do you do with big data? what are you spending your time on Now, the reality is, if you phrase that So, you wrote a piece late last year, So, AI is really about the building blocks It requires the expertise of a data scientist to do AI. So, if, for example, you take the image API, of e-commerce, where you could send an order in assuming, you know, there's a lot of heavy lifting and the big application developers, SAP, Oracle, We love to have someone like you on the Cube, that Amazon has, that's not going to work for insurance, Yeah, I think what they're trying to do Yeah, but I don't think they're in the business and you're putting in a more vertical application Yeah, because what we're talking about right now, And so, you know, a lot of the vendors you know, the traditional statistical stuff. and can, you know, can be somewhat pluggable-- So, Mike, we're out of time, So, I think Spark gives you the best of both worlds, right? Really a pleasure having you back. This is the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
Mike Gualtieri | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Microsoft | ORGANIZATION | 0.99+ |
two | QUANTITY | 0.99+ |
Mike | PERSON | 0.99+ |
Red Sox | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Boston | LOCATION | 0.99+ |
Star Wars | TITLE | 0.99+ |
10 years | QUANTITY | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
two terms | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Yankees | ORGANIZATION | 0.99+ |
10 people | QUANTITY | 0.99+ |
Superbowl | EVENT | 0.99+ |
last year | DATE | 0.99+ |
IBM Global Services | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Ex Machina | TITLE | 0.99+ |
Boston, Massachusetts | LOCATION | 0.99+ |
Win32 | TITLE | 0.99+ |
first question | QUANTITY | 0.99+ |
Austin, Texas | LOCATION | 0.99+ |
19% | QUANTITY | 0.99+ |
millions | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
three | DATE | 0.99+ |
58% | QUANTITY | 0.99+ |
Forrester Research | ORGANIZATION | 0.99+ |
three people | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
SAS | ORGANIZATION | 0.98+ |
tomorrow | DATE | 0.98+ |
three services | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
Spark Summit | EVENT | 0.98+ |
both worlds | QUANTITY | 0.98+ |
TensorFrames | TITLE | 0.97+ |
MLlib | TITLE | 0.97+ |
SAP | ORGANIZATION | 0.97+ |
today | DATE | 0.96+ |
each enterprise | QUANTITY | 0.96+ |
TensorFlow | TITLE | 0.96+ |
four months ago | DATE | 0.95+ |
two worlds | QUANTITY | 0.95+ |
Windows | TITLE | 0.95+ |
Cube | COMMERCIAL_ITEM | 0.94+ |
late last year | DATE | 0.93+ |
Ernie Boch | PERSON | 0.91+ |