Adam Wilson & Joe Hellerstein, Trifacta - Big Data SV 17 - #BigDataSV - #theCUBE

>> Commentator: Live from San Jose, California. It's theCUBE covering Big Data Silicon Valley 2017. >> Okay, welcome back everyone. We are here live in Silicon Valley for Big Data SV (mumbles) event in conjunction with Strata + Hadoop. Our companion event, the Big Data NYC and we're here breaking down the Big Data world as it evolves and goes to the next level up on the step function, AI machine learning, IOT really forcing people to really focus on a clear line of the side of the data. I'm John Furrier with our announcer from Wikibon, George Gilbert and our next guest, our two executives from Trifacta. The founder and Chief Strategy Officer, Joe Hellerstein and Adam Wilson, the CEO. Guys, welcome to theCUBE. Welcome back. >> Great to be here. >> Good to be here. >> Founder, co-founder? >> Co-founder. >> Co-founder. He's a multiple co-founders. I remember it 'cause you guys were one of the first sites that have the (mumbles) in the about section on all the management team. Just to show you how technical you guys are. Welcome back. >> And if you're Trifacta, you have to have three founders, right? So that's part of the tri, right? >> The triple threat, so to speak. Okay, so a big year for you guys. Give us the update. I mean, also we had Alation announce this partnering going on and some product movement. >> Yup. >> But there's a turbulent time right now. You have a lot of things happening in multiple theaters to technical theater to business theater. And also within the customer base. It's a land grand, it seems to be on the metadata and who's going to control what. What's happening? What's going on in the market place and what's the update from you guys? >> Yeah, yeah. Last year was an absolutely spectacular year for Trifacta. It was four times growth in bookings, three times growth in customers. You know, it's been really exciting for us to see the technology get in the hands of some of the largest companies on the planet and to see what they're able to do with it. From the very beginning, we really believed in this idea of self service and democratization. We recognize that the wrangling of the data is often where a lot of the time and the effort goes. In fact, up to 80% of the time and effort goes in a lot of these analytic projects and to the extent that we can help take the data from (mumbles) in a more productive way and to allow more people in an organization to do that. That's going to create information agility that that we feel really good about and there are customers and they are telling us is having an impact on their use of Big Data and Hadoop. And I think you're seeing that transition where, you know, in the very beginning there was a lot of offloading, a lot of like, hey we're going to grab some cost savings but then in some point, people scratch their heads and said, well, wait a minute. What about the strategic asset that we were building? That was going to change the way people work with the data. Where is that piece of it? And I think as people started figuring out in order to get our (mumbles), we got to have users and use cases on these clusters and the data like itself is not a used case. Tools like Trifacta have been absolutely instrumental and really fueling that maturity in the market and we feel great about what's happening there. >> I want to get some more drilled out before we get to some of these questions for Joe too because I think you mentioned, you got some quotes. I just want to double up a click on that. It always comes up in the business model question for people. What's your business model? >> Sure. >> And doing democratization is really hard. Sometimes democratization doesn't appear until years later so it's one of those elusive things. You see it and you believe it but then making it happen are two different things. >> Yeah, sure. >> So. And appreciate that the vision they-- (mumbles) But ultimately, at the end of the day, that business model comes down to how you organized. Prove points. >> Yup. >> Customers, partnerships. >> Yeah. >> We had Alation on Stephanie (mumbles). Can you share just and connect the dots on the business model? >> Sure. >> With respect to the product, customers, partners. How was that specifically evolving? >> Adam: Sure. >> Give some examples. >> Sure, yeah. And I would say kind of-- we felt from the beginning that, you know, we wanted to turn what was traditionally a very complex messy problem dealing with data, you know, in the user experience problem that was powered by machine learning and so, a lot of it was down to, you know, how we were going to build and architect the technology needed (mumbles) for really getting the power in the hands of the people who know the data best. But it's important, and I think this is often lost in Silicon Valley where the focus on innovation is all around technology to recognize that the business model also has to support democritization so one of the first things we did coming in was to release a free version of the product. So Trifacta Wrangler that is now being used by over 4500 companies, ten of thousands of users and the power of that in terms of getting people something of value that they could start using right away on spreadsheets and files and small data and allowing them to get value but then also for us, the exchange is that we're actually getting a chance to curate at scale usage data across all of these-- >> Is this a (mumbles) product? >> It's a hybrid product. >> Okay. >> So the data stays local. It never leaves their local laptop. The metadata is hashed and put into the cloud and now we're-- >> (mumbles) to that. >> Absolutely. And so now we can use that as training data that actually has more people wrangle, the product itself gets smarter based on that. >> That's good. >> So that's creating real tangible value for customers and for us is a source of very strategic advantage and so we think that combination of the technology innovation but also making sure that we can get this in the hands of users and they can get going and as their problem grows up to be bigger and more complicated, not just spreadsheets and files on the desktop but something more complicated, then we're right there along with them for products that would have been modified. >> How about partnerships with Alation? How they (mumbles)? What are all the deals you got going on there? >> So Alation has been a great partner for us for a while and we've really deepened the integration with the announcements today. We think that cataloging and data wrangling are very complimentary and they're a natural fit. We've got customers like Munich Re, like eBay as well as MarketShare that are using both solutions in concert with one another and so, we really felt that it was natural to tighten that coupling and to help people go from inventorying what's going on in their data legs and their clusters to then cleansing, standardizing. Essentially making it fit for purpose and then ensuring that metadata can roundtrip back into the catalog. And so that's really been an extension of what we're doing also at the technical level with technologies like Cloudera Navigator with Atlas and with the project that Joe's involved with at Berkeley called Ground. So I don't know if you want to talk-- >> Yeah, tell him about Ground. >> Sure. So part of our outlook on this and this speaks to the kind of way that the landscape in the industry's shaping out is that we're not going to see customers buying until it's sort of lock in on the key components of the area for (mumbles). So for example, storage, HD (mumbles). This is open and that's key, I think, for all the players in this base at HTFS. It's not a product from a storage vendor. It's an open platform and you can change vendors along the way and you could role your own and so on. So metadata, to my mind, is going to move in the same direction. That the storage of metadata, the basic component tree that keeps the metadata, that's got to be open to give people the confidence that they're going to pour the basic descriptions of what's in their business and what their people are doing into a place that they know they can count on and it will be vendor neutral. So the catalog vendors are, in my mind, providing a functionality above that basic storage that relates to how do you search the catalog, what does the catalog do for you to suggest things, to suggest data sets that you should be looking at. So that's a value we have on top but below that what we're seeing is, we're seeing Horton and Cloudera coming out with either products re opensource and it's sort of the metadata space and what would be a shame is if the two vendors ended up kind of pointing guns inward and kind of killing the metadata storage. So one of the things that I got interested in as my dual role as a professor at Berkeley and also as a founder of a company in this space was we want to ensure that there's a free open vendor neutral metadata solution. So we began building out a project called Ground which is both a platform for metadata storage that can be sitting underneath catalog vendors and other metadata value adds. And it's also a platform for research much as we did with Spark previously at Berkeley. So Ground is a project in our new lab at Berkeley. The RISELab which is the successor to the AMPLab that gave us Spark. And Ground has now got, you know, collaboratives from Cloudera, from LinkedIn. Capital One has significantly invested in Ground and is putting engineers behind it and contributors are coming also from some startups to build out an open-sourced platform for metadata. >> How old has Ground been around? >> Joe: Ground's been around for about 12 months. It's very-- >> So it's brand new. How do people get involved? >> Brand new. >> Just standard similar to the way the AMPLab was? Just jump in and-- >> Yeah, you know-- >> Go away and-- >> It comes up on GitHub. There's (mumbles) to go download and play with. It's in alpha. And you know, we hope we (mumbles) and the usual opensource still. >> This is interesting. I like this idea because one thing you've been riffing on the cue ball of time is how do you make data addressable? Because ultimately, you know, real time you need to have access to data really really low (mumbles) to see the inside to make it work. Hence the data swamp problem right? So, how do you guys see that? 'Cause now I can just pop in. I can hear the objections. Oh, security! You know. How do you guys see the protections? I'd love to help get my data in there and get something back in return in a community model. Security? Is it the hashing? What's the-- How do you get any security (mumbles)? Or what are the issues? >> Yeah, so I mean the straightforward issues are the traditional issues of authorization and encryption and those are issues that are reasonably well-plumed out in the industry and you can go out and you can take the solutions from people like Clutter or from Horton and those solutions have plugin quite nicely actually to a variety of platforms. And I feel like that level of enterprise security is understood. It's work for vendors to work with that technology so when we went out, we make sure we were carburized in all the right ways at Trifacta to work with these vendors and that we integrated well with Navigator, we integrated with Atlas. That was, you know, there was some labor there but it's understood. There's also-- >> It's solvable basically. >> It's solvable basically and pluggable. There are research questions there which, you know, on another day we could talk about but for instance if you don't trust your cloud hosting service what do you do? And that's like an open area that we're working on at Berkeley. Intel SGX is a really interesting technology and that's based probably a topic for another day. >> But you know, I think it's important-- >> The sooner we get you out of the studio, Paolo Alto would love to drill on that. >> I think it's important though that, you know, when we talk about self service, the first question that comes up is I'm only going to let you self service as far as I can govern what's going on, right? And so I think those things-- >> Restrictions, guard rails-- >> Really going hand in here. >> About handcuffs. >> Yeah so, right. Because that's always a first thing that kind of comes out where people say, okay wait minute now is this-- if I've now got, you know-- you've got an increasing number of knowledge workers who think that is their-- and believe that it is their unalienable right to have access to data. >> Well that's the (mumbles) democratization. That's the top down, you know, governance control point. >> So how do you balance that? And I think you can't solve for one side of that equation without the other, right? And that's really really critical. >> Democratization is anarchization, right? >> Right, exactly. >> Yes, exactly. But it's hard though. I mean, and you look at all the big trends where there was, you know, web one data, web (mumbles), all had those democratization trends but they took six years to play out and I think there might be a more auxiliary with cloud when you point about this new stop. Okay George, go ahead. You might get in there. >> I wanted to ask you about, you know, what we were talking about earlier and what customers are faced with which is, you know, a lot of choice and specialization because building something end to end and having it fully functional is really difficult. So... What are the functional points where you start driving the guard rails in that Ikee cares about and then what are the user experience points where you have critical mass so that the end users then draw other compliant tools in. You with me? On sort of the IT side and the user side and then which tools start pulling those standards? >> Well, I would say at the highest level, to me what's been very interesting especially would be with that's happened in opensource is that people have now gotten accustomed to the idea that like I don't have to go buy a big monolithic stacks where the innovation moves only as fast as the slowest product in the stack or the portfolio. I can grab onto things and I can download them today and be using them tomorrow. And that has, I think, changed the entire approach that companies like Trifacta are taking to how we how we build and release product to market, how we inter operate with partners like Alation and Waterline and how we integrate with the platform vendors like Cloudera, MapR, and Horton because we recognize that we are going to have to be meniacal focused on one piece of this puzzle and to go very very deep but then play incredibly well both, you know, with all the rest of the ecosystem and so I think that is really colored our entire product strategy and how we go to market and I think customers, you know, they want the flexibility to change their minds and the subscription model is all about that, right? You got to earn it every single year. >> So what's the future of (mumbles)? 'Cause that brings up a good point we were kind of critical of Google and you mentioned you guys had-- I saw in some news that you guys were involved with Google. >> Yup. >> Being enterprise ready is not just, hey we have the great tech and you buy from us, damn it we're Google. >> Right. >> I mean, you have to have sales people. You have to have automation mechanism to create great product. Will the future of wrangling and data prep go into-- where does it end up? Because enterprises want, they want certain things. They're finicky of things. >> Right, right. >> As you guys know. So how does the future of data prep deal with the, I won't say the slowness of the enterprise, but they're more conservative, more SLA driven than they are price performance. >> But they're also more fragmented than ever before and you know, while that may not be a great thing for the customers for a company that's all about harmonizing data that's actually a phenomenal opportunity, right? Because we want to be the decision that customers make that guarantee that all their other decisions are changeable, right? And I go and-- >> Well they have legacy systems of record. This is the challenge, right? So I got the old oracle monolithic-- >> That's fine. And that's good-- >> So how do you-- >> The more the merrier, right? >> Does that impact you guys at all? How did you guys handle that situation? >> To me, to us that is more fragmentation which creates more need for wrangling because that introduces more complexity, right? >> You guys do well in that environment. >> Absolutely. And that, you know, is only getting bigger, worse, and more complicated. And especially as people go from (mumbles) to cloud as people start thinking about moving from just looking at transactions to interactions to now looking at behavior data and the IOT-- >> You're welcome in that environment. >> So we welcome that. In fact, that's where-- we went to solve this problem for Hadoop and Big Data first because we wanted to solve the problems at scale that were the most complicated and over time we can always move downstream to sort of more structured and smaller data and that's kind of what's happened with our business. >> I guess I want to circle back to this issue of which part of this value chain of refining data is-- if I'm understanding you right, the data wrangling is the anchor and once a company has made that choice then all the other tool choices have to revolve around it? Is that a-- >> Well think about this way, I mean, the bulk of the time when you talk to the analysts and also the bulk of the labor cost and these things isn't getting the data from its raw form into usage. That whole process of wrangling which is not really just data prep. It's all the things you do all day long to kind of massage these data sets and get 'em from here to there and make 'em work. That space is where the labor cost is. That also means that's spaces were the value add is because that's where your people power or your business context is really getting poured in to understand what do I have, what am I doing with it and what do I want to get out of it. As we move from bottom line IT to top line value generation with data, it becomes all the more so, right? Because now it's not just the matter of getting the reports out every month. It's also what did that brilliant in sales do to that dataset to get that much left? I need to learn from her and do a similar thing. Alright? So, that whole space is where the value is. What that means is that, you know, you don't want that space to be tied to a particular BI tool or a particular execution edge. So when we say that we want to make a decision in the middle of that enables all the other decisions, what you really want to make sure is that that work process in there is not tightly bound to the rest of the stack. Okay? And so you want to particularly pick technologies in that space that will play nicely with different storage, that play nicely with different execution environments. Today it's a dupe, tomorrow it's Amazon, the next day it's Google and they have different engines back there potentially. And you want it certainly makes your place with all the analytic and visualizations-- >> So decouple from all that? >> You want to decouple that and you want to not lock yourself in 'cause that's where the creativity's happening on the consumption side and that's where the mess that you talked about is just growing on the production side so data production is just getting more complicated. Data consumption's getting more interesting. >> That's actually a really really cool good point. >> Elaborating on that, does that mean that you have to open up interfaces with either the UI layer or at the sort of data definition layer? Or does that just mean other companies have to do the work to tie in to the styles? The styles and structures that you have already written? >> In fact it's sort of the opposite. We do the work to tie in to a lot of this, these other decisions in this infrastructure, you know. We don't pretend for a minute that people are going to sort of pick a solution like Trifacta and then build their organization around it. As your point, there's tons of legacy, technology out there. There is all kinds of things moving. Absolutely. So we, a big part of being the decoder ring for data for Trifacta and saying it's like listen, we are going to inter operate with your existing investments and we're going to make sure that you can always get at your data, you can always take it from whatever state its in to whatever state you need to be in, you can change your mind along the way. And that puts a lot of owners on us and that's the reason why we have to be so focused on this space and not jump into visualization and analytics and not jump in to its storage and processing and not try to do the other things to the right or left. Right? >> So final question. I'd like you guys both to take a stab at it. You know, just going to pivot off at what Joe was saying. Some of the most interesting things are happening in the data exploration kind of discovery area from creativity to insights to game changing stuff. >> Yup. >> Ventures potentially. >> Joe: Yup. >> The problem of the complexity, that's conflict. >> Yeah. >> So how does we resolve this? I mean, besides the Trifacta solution which you guys are taming, creating a platform for that, how do people in industry work together to solve that problem? What's the approach? >> So I think actually there's a couple sort of heartening trends on this front that make me pretty optimistic. One of these is that the inside of structures are in the enterprises we work with becoming quite aligned between IT and the line of business. It's no longer the case that the line of business that are these annoying people that they're distracting IT from their bottom line function. IT's bottom line function is being translated into a what's your value for the business question? And the answer for a savvy IT management person is, I will try to empower the people around me to be rabid fans and I will also try to make sure that they do their own works so I don't have to learn how to do it for them. Right? And so, that I think is happening-- >> Guys to this (mumbles) business guys, a bunch of annoying guys who don't get what I need, right? So it works both ways, right? >> It does, it does. And I see that that's improving sort of in the industry as the corporate missions around data change, right? So it's no longer that the IT guys really only need to take care of executives and everyone else doesn't matter. Their function really is to serve the business and I see that alignment. The other thing that I think is a huge opportunity and the part of who I-- we're excited to be so tightly coupled with Google and also have our stuff running in Amazon and at Microsoft. It's as people read platform to the cloud, a lot of legacy becomes a shed or at least become deprecated. And so there is a real-- >> Or containerized or some sort of microservice. >> Yeah. >> Right, right. >> And so, people are peeling off business function and as part of that cost savings to migrate it to the cloud, they're also simplified. And you know, things will get complicated again. >> What's (mumbles) solution architects out there that kind of re-boot their careers because the old way was, hey I got networks, I got apps and stacks and so that gives the guys who could be the new heroes coming in. >> Right. >> And thinking differently about enabling that creativity. >> In the midst of all that, everything you said is true. IT is a massive place and it always will be. And tools that can come in and help are absolutely going to be (mumbles). >> This is obvious now. The tension's obviously eased a bit in the sense that there's clear line of sight that top line and bottom line are working together now on. You mentioned that earlier. Okay. Adam, take a stab at it. (mumbling) >> I was just going to-- hey, I know it's great. I was just going to give an example, I think, that illustrates that point so you know, one of our customers is Pepsi. And Pepsi came to us and they said, listen we work with retailers all over the world and their reality is that, when they place orders with us, they often get it wrong. And sometimes they order too much and then they return it, it spoils and that's bad for us. Or they order too little and they stock out and we miss revenue opportunities. So they said, we actually have to be better at demand planning and forecasting than the orders that are literally coming in the door. So how do we do that? Well, we're getting all of the customers to give us their point of sale data. We're combining that with geospatial data, with weather data. We're like looking at historical data and industry averages but as you can see, they were like-- we're stitching together data across a whole variety of sources and they said the best people to do this are actually the category managers and the people responsible for the brands 'cause they literally live inside those businesses and they understand it. And so what happened was they-- the IT organization was saying, look listen, we don't want to be the people doing the janitorial work on the data. We're going to give that work over to people who understand it and they're going to be more productive and get to better outcomes with that information and that brings us up to go find new and interesting sources and I think that collaborative model that you're starting to see emerge where they can now be the data heroes in a different way by not being the ones beating the bottleneck on provisioning but rather can go out and figure out how do we share the best stuff across the organization? How do we find new sources of information to bring in that people can leverage to make better decisions? That's in incredibly powerful place to be and you know, I think that that model is really what's going to be driving a lot of the thinking at Trifacta and in the industry over the next couple of years. >> Great. Adam Wilson, CEO of Trifacta. Joe Hellestein, CTO-- Chief Strategy Officer of Trifacta and also a professor at Berkeley. Great story. Getting the (mumbles) right is hard but under the hood stuff's complicated and again, congratulations about sharing the Ground project. Ground open source. Open source lab kind of thing at-- in Berkeley. Exciting new stuff. Thanks so much for coming on theCUBE. I appreciate great conversation. I'm John Furrier, George Gilbert. You're watching theCUBE here at Big Data SV in conjunction with Strata and Hadoop. Thanks for watching. >> Great. >> Thanks guys.

Published Date : Mar 16 2017

SUMMARY :

It's theCUBE covering Big Data Silicon Valley 2017. and Adam Wilson, the CEO. that have the (mumbles) in the about section Okay, so a big year for you guys. and what's the update from you guys? and really fueling that maturity in the market in the business model question for people. You see it and you believe it but then that business model comes down to how you organized. on the business model? With respect to the product, customers, partners. that the business model also has to support democritization So the data stays local. the product itself gets smarter and files on the desktop but something more complicated, and to help people go from inventorying that relates to how do you search the catalog, It's very-- So it's brand new. and the usual opensource still. I can hear the objections. and that we integrated well with Navigator, There are research questions there which, you know, The sooner we get you out and believe that it is their unalienable right That's the top down, you know, governance control point. And I think you can't solve for one side of that equation and I think there might be a more auxiliary with cloud so that the end users then draw other compliant tools in. and how we go to market and I think customers, you know, I saw in some news that you guys hey we have the great tech and you buy from us, I mean, you have to have sales people. So how does the future of data prep deal with the, So I got the old oracle monolithic-- And that's good-- in that environment. and the IOT-- You're welcome in that and that's kind of what's happened with our business. the bulk of the time when you talk to the analysts and you want to not lock yourself in and that's the reason why we have to be in the data exploration kind of discovery area The problem of the complexity, in the enterprises we work with becoming quite aligned And I see that that's improving sort of in the industry as or some sort of microservice. and as part of that cost savings to migrate it to the cloud, so that gives the guys who could be In the midst of all that, everything you said is true. in the sense that there's clear line of sight and in the industry over the next couple of years. and again, congratulations about sharing the Ground project.

ENTITIES

Entity	Category	Confidence
Joe Hellerstein	PERSON	0.99+
George	PERSON	0.99+
Joe	PERSON	0.99+
George Gilbert	PERSON	0.99+
Joe Hellestein	PERSON	0.99+
John Furrier	PERSON	0.99+
Trifacta	ORGANIZATION	0.99+
Pepsi	ORGANIZATION	0.99+
Adam Wilson	PERSON	0.99+
Adam	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Waterline	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Berkeley	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
Alation	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Stephanie	PERSON	0.99+
Horton	ORGANIZATION	0.99+
LinkedIn	ORGANIZATION	0.99+
six years	QUANTITY	0.99+
one	QUANTITY	0.99+
MapR	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Capital One	ORGANIZATION	0.99+
first question	QUANTITY	0.99+
Today	DATE	0.99+
One	QUANTITY	0.99+
Last year	DATE	0.99+
two executives	QUANTITY	0.99+
Trifacta	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
one piece	QUANTITY	0.98+
both solutions	QUANTITY	0.98+
today	DATE	0.98+
over 4500 companies	QUANTITY	0.98+
Intel	ORGANIZATION	0.98+
both ways	QUANTITY	0.98+
both	QUANTITY	0.98+
three founders	QUANTITY	0.97+
two vendors	QUANTITY	0.97+
first sites	QUANTITY	0.97+
Ground	ORGANIZATION	0.97+
Munich Re	ORGANIZATION	0.97+
about 12 months	QUANTITY	0.97+
NYC	LOCATION	0.96+
first thing	QUANTITY	0.96+
four times	QUANTITY	0.96+
eBay	ORGANIZATION	0.95+
Wikibon	ORGANIZATION	0.95+
Paolo Alto	PERSON	0.95+
next day	DATE	0.95+
three times	QUANTITY	0.94+
ten of thousands of users	QUANTITY	0.93+
one side	QUANTITY	0.93+
years later	DATE	0.92+

Josh Rogers, Syncsort - Big Data SV 17 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's The Cube covering Big Data Silicon Valley 2017. (innovative music) >> Welcome back, everyone, Live in Silicon Valley is The Cube's coverage of Big Data SV, our event in Silicon Valley in conjunction with our Big Data NYC for New York City. Every year, twice a year, we get our event going around Strata Hadoop in conjunction with those guys. I'm John Furrier with SiliconANGLE with George Gilbert, our Wikibon (mumbles). Our next guest is Josh Rogers, the CEO of Syncsort, but on many times, Cube alumni, that firm that acquired Trillium, which we talked about yesterday. Welcome back to The Cube, good to see you. >> Good to see you, how are ya? >> So Syncsort is just one of those companies that's really interesting. We were talking about this. I want to get your thoughts on this because I'm not sure if it was in the plan or not, or really ingenius moves by you guys on the manager's side, but Legacy Business, lockdown legacy environments, like the mainframe, and then transform into a modern data company. Was that part of the plan or kind of on purpose by accident? Or what's-- >> Part of the plan. You think about what we've been doing for the last 40 years. We had specific capabilities around managing data at scale and around helping customers who process that data to give more value out of it through analytics, and we've just continually moved through the various kind of generations of technology to apply that same discipline in new environments and big data is frankly been a terrific opportunity for us to apply that same technical and talented DNA in that new environment. It's kind of been running the same game plan. (talking over each other) >> You guys have a good execution, but I think one of the things we were point out, and this is one of those things where, certainly, I live in Palo Alto in Silicon Valley. We love innovation. We love all the shiny, new toys, but you get tempted to go after something really compelling, cool, and relevant, and then go, "Whoa, I forgot about locking down "some of the legacy data stuff," and then you're kind of working down and you guys took a different approach. You going in to the trends from a solid foundation. That's a different execution approach and, like you said, by design, so that's working. >> Yeah, it's definitely working and I think it's also kind of focused on an element that maybe is under-reported, which is a lot of these legacy systems aren't going away, and so one of the big challenges-- >> And this is for record, by the way. >> Right (talking over each other). How do I integrate those legacy environments with these next-generation environments and to do that you have to have expertise on both side, and so one of the things I think we've done a good job is developing that big data expertise and then turning around and saying we can solve that challenge for you, and obviously, the big iron, the big data solutions we bring to market are a perfect example of that, but there's additional solutions that we can provide customers, and we'll talk more about those in a few-- >> Talk about the Trillium acquisition. I want to just, you take a minute to describe that you bought a company called Trillium. What is it, just take a minute to explain what it is and why is it relevant? >> Trillium is a really special company. They are the independent leader in data quality and have been for many years. They've been in the top-right of the gartner magic quadrant for more than a decade, and really, when you look at large, complex, global enterprises, they are the kind of gold-standard in data quality, and when I say data quality, what I mean is an ability to take a dataset, understand the issues with that dataset, and then establish business rules to improve the quality of that data so you can actually trust that data. Obviously that's relevant in a near-adjacency to the data movement and transformation that Syncsort's been known for for so long. What's interesting about it is you think about the development and the maturity of big data environments, specifically Hadoop, you know, people have a desire to obviously do analytics in that data and implicit in that is the ability to trust that data and the way you get there is being able to apply profiling equality rules in that environment, and that's an underserved market today. When we thought about the Trillium acquisition, it was partly, "Hey, this is a great firm "that has so much respect and the space, "and so much talented capability, a powerful capability "and market-leading data quality talent, "but also, we have an ability to apply it "in this next generation environment "much like we did on the ETL and data movement space." And I think that the industry is at a point where enterprises are realizing, "I'm going to need to apply the same "data management disciplines to make use of my data "in my next generation analytics environment "that I did in my data warehouse environment." Obviously, there's different technologies involved. There's different types of data involved. But those disciplines don't go away and being able to improve the quality and be able to kind of build integrity in your datasets is critical, and Trillium is best in market capabilities in that respect. >> Josh, you were telling us earlier about sort of the strategy of knocking down the pins one by one as, you know, it's become clear that we sort of took, first the archive from the data warehouse, and then ETL off-loaded, now progressively more of the business intelligence. What are some of the, besides data quality, what are some of the other functions you have to-- >> There's the whole notion of metadata management, right? And that's incredibly important to support a number of key business initiatives that people want to leverage. There's different styles of movement of data so a thing you'll hear a lot about is change data capture, right, so if I'm moving datasets from source systems into my Hadoop environment, I can move the whole set, but how do I move the incremental changes on a ongoing basis at the speed of business. There's notions of master data management, right? So how do I make sure that I understand and have a gold kind of standard of reference data that I can use to try my own analytic capabilities, and then of course, there's all the analytics that people want to do both in terms of visualization and predictive analytics, but you can think about all these is various engines that I need to apply the data to get maximum value. And it's not so much that these engines aren't important anymore. It's I can now apply them in a different environment that gives me a lot more flexibility, a lot more scale, a better cost structure, and an ability to kind of harness broader datasets. And so that's really our strategy is bring those engines to this new environment. There's two ways to do that. One is build it from scratch, which is kind of a long process to get it right when you're thinking about complex, global, large enterprise requirements. The other is to take existing, tested, proven, best-in-market engines and integrate it deeply in this environment and that's the strategy we've taken. We think that offers a much faster time to value for customers to be able to maximize their investments in this next generation analytics infrastructure. >> So who shares that vision and sort of where are we in the race? >> I think we're fairly unique in our approach of taking that approach. There's certainly other large platform players. They have a broad (mumbles) ability and I think they're working on, "How do I kind of take that architecture and make it relevant?" It ends up creating a co-generation approach. I think that approach has limitations, and I think if you think about taking the core engine and integrate it deeply within the Hadoop ecosystem and Hadoop capabilities, you get a faster time to market and a more manageable solution going forward, and also one that gives you kind of a future pre-shoot from underlying changes that we'll continue to see in the Hadoop component, sort of the big data components, I guess is a better articulation. >> Josh, what's the take on the show this year and the trends, (mumbles) will become a machine learning, and I've seen that. You guys look at your execution plan. What's the landscape happening out there in the show this year? I mean, we're starting to see more business outcome conversations about machine-learning in AI. It's really putting pressure on the companies, and certainly IOT in the cloud-growth as a forcing function. Do you see the same thing? What's your thoughts? >> So machine-learning's a really powerful capability and I think as it relates to the data integration kind of space, there's a lot of benefit to be had. Think about quality. If I have to establish a set of business rules to improve the quality of my data, wouldn't it be great if those little rules could learn as they actually process datasets and see how they change over time, so there's really interesting opportunities there. We're seeing a lot of adoption of cloud. More and more customers are looking at "How do I live in a world where I've got a piece "of my operations on premise, "I've got a piece of operations in cloud, "manage those together and gradually "probably shift more into cloud over time." So I'm doing a lot of work in that space. There's some basic fundamental recognitions that have happened, which is, if I stand up a Hadoop cluster, I am going to have to buy a series of tools to make to get value out of that data in that cluster. That's a good step forward in my perspective because this notion of I'm going to stand up a team off-shore and they're just going to build all these things. >> Cost of ownership goes through the roof. >> Yeah, so I think the industry's moved past this concept of "I make an investment in Hadoop. "I don't need additional solutions." >> It highlights something that we were talking about at Google Next last week about enterprise-ready, and I want to get your thoughts 'cause you guys have a lot of experience, something that's, get in your wheelhouse, how you guys have attacked the market's been pretty impressive and not obvious, and on paper, it looks pretty boring, but you're doing great! I mean, you've done the right strategy, it works. Mainframe, locking in the mainframe, system of record. We've talked this on The Cube. Lots of videos going back three years, but enterprise-ready is a term now that's forcing people, even the best at Google, to be like like, look in the mirror and saying, "Wait a minute. "We have a blind spot." Best tech doesn't always win. You've got table steps; you've got SLAs; you've got mission data quality. One piece of bad data that should be clean could really screw up something. So what's your thoughts on enterprise-ready right now? >> I think that people are recognizing that to get a payoff on a lot of these investments in next generation analytic infrastructure, they're going to need to build, run mission-critical workloads there and take on mission-critical kind of business initiatives and prove out the value. To do that you have to be able to manage the environment, achieve the up-times, have the reliability resiliency that, quite frankly, we've been delivering for four years, and so I think that's another kind of point in our value proposition that frankly seems to be so unique, which is hey, we've been doing this for thousands of customers, the most sophisticated-- >> What are one of the ones that are going to be fatal flaws for people if they don't pay attention to? >> Well, security is huge. I think the manageability, right. So look, if I have to upgrade 25 components in my Hadoop cluster to get to the next version and I need to upgrade all the tools, I've got to have a way to do that that allows me to not only get to the next level of capability that the vendors are providing, but also to do that in a way that doesn't maybe bring down all these mission-critical workloads that have to be 24 by seven. Those pieces are really important and having both the experience and understanding of what that means, and also being able to invest the engineering resources to be able to-- >> And don't forget the sales force. You've got the DNA and the people on the streets. Josh, thanks for coming to The Cube, really appreciate it, great insight. You guys have, just to give you a compliment, great strategy, and again, good execution on your side and as you guys, you're in new territory. Every time we talk to you, you're entering in something new every time, so great to see you. Syncsort here inside The Cube. Always back at sharing commentary on what's going on in the marketplace: AI machine-learning with the table stakes in the enterprise security and what not, still critical for execution and again, IOT is really forcing the function of (mumbles). You've got to focus on the data. Thanks so much. I'm (mumbles). We'll be back with more live coverage after this break. (upbeat innovative music)

Published Date : Mar 16 2017

SUMMARY :

Announcer: Live from Welcome back to The Cube, good to see you. Was that part of the plan or kind of generations of technology to apply You going in to the trends and to do that you have to a minute to describe and implicit in that is the from the data warehouse, and have a gold kind of and also one that gives you and certainly IOT in the cloud-growth lot of benefit to be had. Cost of ownership Yeah, so I think the even the best at Google, to be like like, and so I think that's of capability that the in the marketplace: AI

ENTITIES

Entity	Category	Confidence
Tristan	PERSON	0.99+
George Gilbert	PERSON	0.99+
John	PERSON	0.99+
George	PERSON	0.99+
Steve Mullaney	PERSON	0.99+
Katie	PERSON	0.99+
David Floyer	PERSON	0.99+
Charles	PERSON	0.99+
Mike Dooley	PERSON	0.99+
Peter Burris	PERSON	0.99+
Chris	PERSON	0.99+
Tristan Handy	PERSON	0.99+
Bob	PERSON	0.99+
Maribel Lopez	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Mike Wolf	PERSON	0.99+
VMware	ORGANIZATION	0.99+
Merim	PERSON	0.99+
Adrian Cockcroft	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Brian	PERSON	0.99+
Brian Rossi	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Chris Wegmann	PERSON	0.99+
Whole Foods	ORGANIZATION	0.99+
Eric	PERSON	0.99+
Chris Hoff	PERSON	0.99+
Jamak Dagani	PERSON	0.99+
Jerry Chen	PERSON	0.99+
Caterpillar	ORGANIZATION	0.99+
John Walls	PERSON	0.99+
Marianna Tessel	PERSON	0.99+
Josh	PERSON	0.99+
Europe	LOCATION	0.99+
Jerome	PERSON	0.99+
Google	ORGANIZATION	0.99+
Lori MacVittie	PERSON	0.99+
2007	DATE	0.99+
Seattle	LOCATION	0.99+
10	QUANTITY	0.99+
five	QUANTITY	0.99+
Ali Ghodsi	PERSON	0.99+
Peter McKee	PERSON	0.99+
Nutanix	ORGANIZATION	0.99+
Eric Herzog	PERSON	0.99+
India	LOCATION	0.99+
Mike	PERSON	0.99+
Walmart	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Kit Colbert	PERSON	0.99+
Peter	PERSON	0.99+
Dave	PERSON	0.99+
Tanuja Randery	PERSON	0.99+

Donna Prlich, Pentaho, Informatica - Big Data SV 17 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's theCUBE. Covering Big Data Silicon Valley 2017. >> Okay, welcome back everyone. Here live in Silicon Valley this is theCUBE. I'm John Furrier, covering our Big Data SV event, #BigDataSV. Our companion event to Big Data NYC, all in conjunction Strata Hadoop, the Big Data World comes together, and great to have guests come by. Donna Prlich, who's the senior VP of products and solutions at Pentaho, a Hitachi company who we've been following before Hitachi had acquired you guys. But you guys are unique in the sense that you're a company within Hitachi left alone after the acquisition. You're now running all the products. Congratulations, welcome back, great to see you. >> Yeah, thank you, good to be back. It's been a little while, but I think you've had some of our other friends on here, as well. >> Yep, and we'll be at Pentaho World, you have Orlando, I think is October. >> Yeah, October, so I'm excited about that, too, so. >> I'm sure the agenda is not yet baked for that because it's early in the year. But what's going on with Hitachi? Give us the update, because you're now, your purview into the product roadmap. The Big Data World, you guys have been very, very successful taking this approach to big data. It's been different and unique to others. >> [Donna} Yep. What's the update? >> Yeah, so, very exciting, actually. So, we've seen, especially at the show that the Big Data World, we all know that it's here. It's monetizable, it's where we, actually, where we shifted five years ago, and it's been a lot of what Pentaho's success has been based on. We're excited because the Hitachi acquisition, as you mentioned, sets us up for the next bit thing, which is IOT. And I've been hearing non-stop about machine learning, but that's the other component of it that's exciting for us. So, yeah, Hitachi, we're-- >> You guys doing a lot of machine learning, a lot of machine learning? >> So we, announced our own kind of own orchestration capabilities that really target how do you, it's less about building models, and how do you enable the data scientists and data preparers to leverage the actual kind of intellectual properties that companies have in those models they've built to transform their business. So we have our own, and then the other exciting piece on the Hitachi side is, on the products, we're now at the point where we're running as Pentaho, but we have access to these amazing labs, which there's about 25 to 50 depending on where you are, whether you're here or in Japan. And those data scientists are working on really interesting things on the R & D side, when you apply those to the kind of use cases we're solving for, that's just like a kid in a candy store with technology, so that's a great-- >> Yeah, you had a built-in customer there. But before I get into Pentaho focusing on what's unique, really happening within you guys with the product, especially with machine learning and AI, as it starts to really get some great momentum. But I want to get your take on what you see happening in the marketplace. Because you've seen the early days and as it's now, hitting a whole another step function as we approach machine learning and AI. Autonomous vehicles, sensors, everything's coming. How are enterprises in these new businesses, whether they're people supporting smart cities or a smart home or automotive, autonomous vehicles. What's the trends you are seeing that are really hitting the pavement here. >> Yeah, I think what we're seeing is, and it's been kind of Pentaho's focus for a long time now, which is it's always about the data. You know, what's the data challenge? Some of the amounts of data which everybody talks about from IOT, and then what's interesting is, it's not about kind of the concepts around AI that have been around forever, but when you start to apply some of those AI concepts to a data pipeline, for instance. We always talk about that 6data pipeline. The reason it's important is because you're really bringing together the data and the analytics. You can't separate those two things, and that's been kind of not only a Pentaho-specific, sort of bent that I've had for years, but a personal one, as well. That, hey, when you start separating it, it makes it really hard to get to any kind of value. So I think what we're doing, and what we're going to be seeing going forward, is applying AI to some of the things that, in a way, will close the gaps between the process and the people, and the data and the analytics that have been around for years. And we see those gaps closing with some of the tools that are emerging around preparing data. But really, when you start to bring some of that machine learning into that picture, and you start applying math to preparing data, that's where it gets really interesting. And I think we'll see some of that automation start to happen. >> So I got to ask you, what is unique about Pentaho? Take a minute to share with the audience some of the unique things that you guys are doing that's different in this sea of people trying to figure out big data. You guys are doing well, an6d you wrote a blog post that I referenced earlier yesterday, around these gaps. How, what's unique about Pentaho and what are you guys doing with examples that you could share? >> Yeah, so I think the big thing about Pentaho that's unique is that it's solving that analytics workflow from the data side. Always from the data. We've always believed that those two things go together. When you build a platform that's really flexible, it's based on open source technology, and you go into a world where a customer says, "I not only want to manage and have a data lake available," for instance, "I want to be able to have that thing extend over the years to support different groups of users. I don't want to deliver it to a tool, I want to deliver it to an application, I want to embed analytics." That's where having a complete end-to-end platform that can orchestrate the data and the analytics across the board is really unique. And what's happened is, it's like, the time has come. Where all we're hearing is, hey, I used to think it was throw some data over and, "here you go, here's the tools." The tools are really easy, so that's great. Now we have all kinds of people that can do analytics, but who's minding the data? With that end-to-end platform, we've always been able to solve for that. And when you move in the open source piece, that just makes it much easier when things like Spark emerge, right. Spark's amazing, right? But we know there's other things on the horizon. Flink, Beam, how are you going to deal with that without being kind of open source, so this is-- >> You guys made a good bet there, and your blog post got my attention because of the title. It wasn't click bait either, it was actually a great article, and I just shared it on Twitter. The Holy Grail of analytics is the value between data and insight. And this is interesting, it's about the data, it's in bold, data, data, data. Data's the hardest part. I get that. But I got to ask you, with cloud computing, you can see the trends of commoditization. You're renting stuff, and you got tools like Kinesis, Redshift on Amazon, and Azure's got tools, so you don't really own that, but the data, you own, right? >> Yeah, that's your intellectual property, right? >> But that's the heart of your piece here, isn't it, the Holy Grail. >> Yes, it is. >> What is that Holy Grail? >> Yeah, that Holy Grail is when you can bring those two things together. The analytics and the data, and you've got some governance, you've got the control. But you're allowing the access that lets the business derive value. For instance, we just had a customer, I think Eric might have mentioned it, but they're a really interesting customer. They're one of the largest community colleges in the country, Ivy Tech, and they won an award, actually, for their data excellence. But what's interesting about them is, they said we're going to create a data democracy. We want data to be available because we know that we see students dropping out, we can't be efficient, people can't get the data that they need, we have old school reporting. So they took Pentaho, and they really transformed the way they think about running their organization and their community colleges. Now they're adding predictive to that. So they've got this data democracy, but now they're looking at things like, "Okay we an see where certain classes are over capacity, but what if we could predict, next year, not only which classes are over capacity, what's the tendency of a particular student to drop out?" "What could we do to intervene?" That's where the kind of cool machine learning starts to apply. Well, Pentaho is what enables that data democracy across the board. I think that's where, when I look at it from a customer perspective, it's really kind of, it's only going to get more interesting. >> And with RFID and smart phones, you could have attendance tracking, too. You know, who's not showing up. >> Yeah absolutely. And you bring Hitachi into the picture, and you think about, for instance, from an IOT perspective, you might be capturing data from devices, and you've got a digital twin, right? And then you bring that data in with data that might be in a data lake, and you can set a threshold, and say, "Okay, not only do we want to be able to know where that student is," or whatever, "we want to trigger something back to that device," and say, "hey, here's a workshop for you to login to right away, so that you don't end up not passing a class." Or whatever it is, it's a simplistic model, but you can imagine where that starts to really become transformative. >> So I asked Eric a question yest6erday. It was from Dave Valante, who's in Boston, stuck in the snowstorm, but he was watching, and I'll ask you and see how it matches. He wrote it differently on Crouch, it was public, but this is in my chat, "HDS is known for main frames, historically, and storage, but Hitachi is an industrial giant. How is Pentaho leveraging the Hitachi monster?" >> Yes, that's a great way to put it. >> Or Godzilla, because it's Japan. >> We were just comparing notes. We were like, "Well, is it an $88 billion company or $90 billion. According to the yen today, it's 88. We usually say 90, but close enough, right? But yeah, it's a huge company. They're in every industry. Make all kinds of things. Pretty much, they've got the OT of the world under their belt. How we're leveraging it is number one, what that brings to the table, in terms of the transformations from a software perspective and data that we can bring to the table and the expertise. The other piece is, we've got a huge opportunity, via the Hitachi channel, which is what's seeing for us the growth that we've had over the last couple of years. It's been really significant since we were acquired. And then the next piece is how do we become part of that bigger Hitachi IOT strategy. And what's been starting to happen there is, as I mentioned before, you can kind of probably put the math together without giving anything away. But you think about capturing, being able to capture device data, being able to bring it into the digital twin, all of that. And then you think about, "Okay, and what if I added Pentaho to the mix?" That's pretty exciting. You bring those things together, and then you add a whole bunch of expertise and machine learning and you're like, okay. You could start to do, you could start to see where the IOT piece of it is where we're really going to-- >> IOT is a forcing function, would you agree? >> Yes, absolutely. >> It's really forcing IT to go, "Whoa, this is coming down fast." And AI and machine learning, and cloud, is just forcing everyone. >> Yeah, exactly. And when we came into the big data market, whatever it was, five years ago, in the early market it's always hard to kind of get in there. But one of the things that we were able to do, when it was sort of, people were still just talking about BI would say, "Have you heard about this stuff called big data, it's going to be hard." You are going to have to take advantage of this. And the same thing is happening with IOT. So the fact that we can be in these environments where customers are starting to see the value of the machine generated data, that's going to be-- >> And it's transformative for the business, like the community college example. >> Totally transformative, yeah. The other one was, I think Eric might have mentioned, the IMS, where all the sudden you're transforming the insurance industry. There's always looking at charts of, "I'm a 17-year-old kid," "Okay, you're rate should be this because you're a 17-year-old boy." And now they're starting to track the driving, and say, "Well, actually, maybe not, maybe you get a discount." >> Time for the self-driving car. >> Transforming, yeah. >> Well, Donna, I appreciate it. Give us a quick tease here, on Pentaho World coming in October. I know it's super early, but you have a roadmap on the product side, so you can see a little bit around the corner. >> Donna: Yeah. >> What is coming down the pike for Pentaho? What are the things that you guys are beavering away at inside the product group? >> Yeah, I think you're going to see some really cool innovations we're doing. I won't, on the Spark side, but with execution engines, in general, we're going to have some really interesting kind of innovative stuff coming. More on the machine learning coming out, and if you think about, if data is, you know what, is the hard part, just think about applying machine learning to the data, and I think you can think of some really cool things, we're going to come up with. >> We're going to need algorithms for the algorithms, machine learning for the machine learning, and, of course, humans to be smarter. Donna, thanks so much for sharing here inside theCUBE, appreciate it. >> Thank you. >> Pentaho, check them out. Going to be at Pentaho World in October, as well, in theCUBE, and hopefully we can get some more deep dives on, with their analyst group, for what's going on with the engines of innovation there. More CUBE coverage live from Silicon Valley for Big Data SV, in conjunction with Strata Hadoop, I'm John Furrier. Be right back with more after this short break. (techno music)

Published Date : Mar 16 2017

SUMMARY :

it's theCUBE. and great to have guests come by. but I think you've had some you have Orlando, I think is October. Yeah, October, so I'm because it's early in the year. What's the update? that the Big Data World, and how do you enable the data scientists What's the trends you are seeing and the data and the analytics and what are you guys doing that can orchestrate the but the data, you own, right? But that's the heart of The analytics and the data, you could have attendance tracking, too. and you think about, for and I'll ask you and see how it matches. of the transformations And AI and machine learning, and cloud, And the same thing is happening with IOT. for the business, the IMS, where all the on the product side, so and I think you can think for the algorithms, Going to be at Pentaho

ENTITIES

Entity	Category	Confidence
Donna	PERSON	0.99+
Hitachi	ORGANIZATION	0.99+
Donna Prlich	PERSON	0.99+
Dave Valante	PERSON	0.99+
Eric	PERSON	0.99+
Ivy Tech	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Boston	LOCATION	0.99+
$88 billion	QUANTITY	0.99+
$90 billion	QUANTITY	0.99+
Japan	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
October	DATE	0.99+
Silicon Valley	LOCATION	0.99+
next year	DATE	0.99+
Amazon	ORGANIZATION	0.99+
88	QUANTITY	0.99+
90	QUANTITY	0.99+
two things	QUANTITY	0.99+
17-year	QUANTITY	0.99+
NYC	LOCATION	0.99+
Big Data SV	ORGANIZATION	0.98+
Orlando	LOCATION	0.98+
five years ago	DATE	0.98+
Pentaho	ORGANIZATION	0.98+
five years ago	DATE	0.98+
today	DATE	0.98+
#BigDataSV	EVENT	0.98+
Informatica	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Silicon Valley	LOCATION	0.97+
Big Data SV	EVENT	0.95+
Spark	TITLE	0.94+
about 25	QUANTITY	0.93+
17-year-old	QUANTITY	0.92+
Pentaho	PERSON	0.91+
Twitter	ORGANIZATION	0.9+
Big Data World	EVENT	0.9+
Azure	ORGANIZATION	0.88+
Big Data Silicon Valley 2017	EVENT	0.88+
Big Data World	ORGANIZATION	0.85+
Big Data	EVENT	0.84+
Pentaho World	ORGANIZATION	0.81+
Pentaho	LOCATION	0.8+
Kinesis	ORGANIZATION	0.8+
Beam	PERSON	0.78+
6data	QUANTITY	0.78+
Redshift	ORGANIZATION	0.74+
Pentaho World	LOCATION	0.74+
Flink	ORGANIZATION	0.67+
yen	OTHER	0.66+
twin	QUANTITY	0.63+
HDS	ORGANIZATION	0.62+
Crouch	ORGANIZATION	0.62+
earlier yesterday	DATE	0.62+
CUBE	ORGANIZATION	0.61+
last couple of years	DATE	0.59+
Pentaho World	ORGANIZATION	0.58+
50	QUANTITY	0.58+

Murthy Mathiprakasam, - Informatica - Big Data SV 17 - #BigDataSV - #theCUBE1

(electronic music) >> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Okay, welcome back everyone. We are live in Silicon Valley for Big Data Silicon Valley. Our companion showed at Big Data NYC in conjunction with Strata Hadoop, Big Data Week. Our next guest is Murthy Mathiprakasam, with the director of product marketing Informatica. Did I get it right? >> Murthy: Absolutely (laughing)! >> Okay (laughing), welcome back. Good to see you again. >> Good to see you! >> Informatica, you guys had a AMIT on earlier yesterday, kicking off our event. It is a data lake world out there, and the show theme has been, obviously beside a ton of machine learning-- >> Murthy: Yep. >> Which has been fantastic. We love that because that's a real trend. And IOT has been a subtext to the conversation and almost a forcing function. Every year the big data world is getting more and more pokes and levers off of Hadoop to a variety of different data sources, so a lot of people are taking a step back, and a protracted view of their landscape inside their own companies and, saying, Okay, where are we? So kind of a checkpoint in the industry. You guys do a lot of work with customers, your history with Informatica, and certainly over the past few years, the change in focus, certainly on the product side, has been kind of interesting. You guys have what looks like to be a solid approach, a abstraction layer for data and metadata, to be the keys to the kingdom, but yet not locking it down, making it freely available, yet provide the governance and all that stuff. >> Murthy: Exactly. >> And my interview with AMIT laid it all out there. But the question is what are the customers doing? I'd like to dig in, if you could share just some of the best practices. What are you seeing? What are the trends? Are they taking a step back? How is IOT affecting it? What's generally happening? >> Yeah, I know, great question. So it has been really, really exciting. It's been kind of a whirlwind over the last couple years, so many new technologies, and we do get the benefit of working with a lot of very, very, innovative organizations. IOT is really interesting because up until now, IOT's always been sort of theoretical, you're like, what's the thing? >> John: Yeah. (laughing) What's this Internet of things? >> But-- >> And IT was always poo-pooing someone else's department (laughing). >> Yeah, exactly. But we have actually have customers doing this now, so we've been working with automative manufacturers on connected vehicle initiatives, pulling sensor data, been working with oil and gas companies, connected meters and connected energy, manufacturing, logistics companies, looking at putting meters on trucks, so they can actually track where all the trucks are going. Huge cost savings and service delivery kind of benefits from all this stuff, so you're absolutely right IOT, I think is finally becoming real. And we have a streaming solution that kind of works on top of all the open source streaming platforms, so we try to simplify everything, just like we have always done. We did that MapReduce, with Spark, now with all the streaming technologies. You gave a graphical approach where you can go in and say, Well, here's what the kind of processing we want. You'd lay it out visually and it executes in the Hadoop cluster. >> I know you guys have done a great job with the product, it's been very complimentary you guys, and it's almost as if there's been an transformation within Informatica. And I know you went private and everything, but a lot of good product shops there. You guys got a lot good product guys, so I got to ask you the question, I don't see IOT sometimes as an operational technology component, usually running their own stacks, not even plugged into IT, so that's the whole another story. I'll get to that in a second. But the trend here is you have the batch world, companies that have been in this ecosystem here that are on the show floor, at O'Reilly Media, or talking to us on The Cube. Some have been just pure play batch-related! Then the fashionable steaming technologies have come out, but what's happened with Spark, you're starting to see the collision between batch and realtime-- >> Umm-hmm. >> Called streaming or what not. And at the center of that's the deep learning, it's the IOT, and it's the AI, that's going to be at the intersection of these two colliding forces, so you can't have a one-trick pony here and there. You got to kind of have a blended, more of a holistic, horizontal, scalable approach. >> Murthy: Yes. >> So I want to get your reaction to that. And two, what product gaps and organizational gaps and process gaps emerge from this trend? And what do you guys do? So, three-part question. >> Murthy: Yeah (laughing). >> Go ahead. Go ahead. >> I'll try to cover all three. >> So, first, the collision and your reaction to that trend. >> Murthy: Yeah, yeah. >> And then the gaps. >> Absolutely. So basically, you know Informatica, we've supported every type of kind of variation of these type of environments, and so we're not really a believer in it's this or that. It's not on premise or cloud, it's not realtime or batch. We want to make it simple and no matter how you want to process the data, or where you want to process it. So customers who use our platform for their realtime or streaming solutions, are using the same interface, as if they were doing it batched. We just run it differently under the hood. And so, that simplifies and makes a lot of these initiatives more practical because you might start with a certain latency, and you think maybe it's okay to do it at one speed. Maybe you decide to change. It could be faster or slower, and you don't have to go through code rewrites and just starting completely from scratch. That's the benefit of the abstraction layer, like you were saying. And so, I think that's one way that organizations can shield themselves from the question because why even pose that question in the first... Why is it either this or that? Why not have a system that you can actually tune and maybe today you want to start batch, and tomorrow you evolve it to be more streaming and more realtime. Help me on the-- >> John: On the gaps-- >> Yes. >> Always product gaps because, again, you mentioned that you're solving it, and that might be an integration challenge for you guys. >> Yep. >> Or an integration solution for you guys, challenge, opportunity, whatever you guys want to call it. >> Absolutely! >> Organizational gaps maybe not set up for and then processed. >> Right. I think it was interesting that we actually went out to dinner with a couple of customers last night. And they were talking a lot about the organizational stuff because the technology they're using is Informatica, so that's part's easy. So, they're like, Okay, it's always the stuff around budgeting, it's around resourcing, skills gap, and we've been talking about this stuff for a long time, right. >> John: Yeah. >> But it's fascinating, even in 2017, it's still a persistent issue, and part of what their challenge was is that even the way IT projects have been funded in the past. You have this kind of waterfall-ish type of governance mechanism where you're supposed to say, Oh, what are you going to do over the next 12 months? We're going to allocate money for that. We'll allocate people for that. Like, what big data project takes 12 months? Twelve months you're going to have a completely (laughing) different stack that you're going to be working with. And so, their challenge is evolving into a more agile kind of model where they can go justify quick-hit projects that may have very unknown kind of business value, but it's just getting by in that... Hey, sometime might be discovered here? This is kind of an exploration-use case, discovery, a lot of this IOT stuff, too. People are bringing back the sensor data, you don't know what's going to coming out of that or (laughing)-- >> John: Yeah. >> What insights you're going to get. >> So there's-- >> Frequency, velocity, could be completely dynamic. >> Umm-hmm. Absolutely! >> So I think part of the best practice is being able to set outside of this kind of notion of innovation where you have funding available for... Get a small cross-functional team together, so this is part of the other aspect of your question, which is organizationally, this isn't just IT. You got to have the data architects from IT, you got to have the data engineers from IT. You got to have data stewards from the line of business. You got business analysts from the line of business. Whenever you get these guys together-- >> Yeah. >> Small core team, and people have been talking about this, right. >> John: Yeah. >> Agile development and all that. It totally applies to the data world. >> John: And the cloud's right there, too, so they have to go there. >> Murthy: That's right! Exactly. So you-- >> So is the 12-month project model, the waterfall model, however you want... maybe 24 months more like it. But the problem on the fail side there is that when they wake up and ship the world's changed, so there's kind of a diminishing return. Is that kind of what you're getting out there on that fail side? >> Exactly. It's all about failing fast forward and succeeding very quickly as well. And so, when you look at most of the successful organizations, they have radically faster project lifecycles, and this is all the more reason to be using something like Informatica, which abstracts all the technology away, so you're not mired in code rewrites and long development cycles. You just want to ship as quickly as possible, get the organization by in that, Hey, we can make this work! Here's some new insights that we never had before. That gets you the political capital-- >> John: Yeah. >> For the next project, the next project, and you just got to keep doing that over and over again. >> Yeah, yeah. I always call that agile more of a blank check in a safe harbor because, in case you fail forward, (laughing) I'm failing forward. (laughing) You keep your job, but there's some merit to that. But here's the trick question for you: Now let's talk about hybrid. >> Umm-hmm. >> On prem and cloud. Now, that's the real challenge. What are you guys doing there because now I don't want to have a job on prem. I don't want to have a job on the cloud. That's not redundancy, that's inefficient, that's duplicates. >> Yes. >> So that's an issue. So how do you guys tee it up there for the customer? And what's the playbook for them, and people who are trying to scratching their heads saying, I want on prem. And Oracle got this right. Their earnings came out pretty good, same code on prem, off prem, same code base. So workloads can move depending upon the use cases. >> Yep. >> How do you guys compare? >> Actually that's the exact same approach that we're taking because, again, it's all about that customer shouldn't have to make the either or-- >> So for you guys, interfacing code same on prem and cloud. >> That's right. So you can run our big data solutions on Amazon, Microsoft, any kind of cloud Hadoop environment. We can connect to data sources that are in the cloud, so different SAAS apps. >> John: Umm-hmm. >> If you want to suck data out of there. We got all the out-of-the-box connectivity to all the major SAAS applications. And we can also actually leverage a lot of these new cloud processing engines, too. So we're trying to be the abstraction layer, so now it's not just about Spark and Spark streaming, there's all these new platforms that are coming out in the cloud. So we're integrating with that, so you can use our interface and then push down the processing to a cloud data processing system. So there's a lot of opportunity here to use cloud, but, again, we don't want to be... We want to make things more flexible. It's all about enabling flexibility for the organization. So if they want to go cloud, great. >> John: Yep. >> There's plenty of organizations that if they don't want to go cloud, that's fine, too. >> So if I get this right, standard interface on prem and cloud for the usability, under the hood it's integration points in clouds, so that data sources, whatever they are and through whatever could be Kinesis coming off Amazon-- >> Exactly! >> Into you guys, or Ah-jahs got some stuff-- >> Exactly! >> Over there, That all works under the hood. >> Exactly! >> Abstracts from the user. >> That's right! >> Okay, so the next question is, okay, to go that way, that means it's a multicloud world. You probably agree with that. Multicloud meaning, I'm a customer. I might have multiple workloads on multiple clouds. >> That's where it is today. I don't know if that's the endgame? And obviously all this is changing very, very quickly. >> Okay (laughing). >> So I mean, Informatica we're neutral across multiple vendors and everything. So-- >> You guys are Switzerland. >> We're the Switzerland (laughing), so we work with all the major cloud providers, and there's new one that we're constantly signing up also, but it's unclear how the market rule shipped out. >> Umm-hmm. >> There's just so much information out there. I think it's unlikely that you're going to see mass consolidation. We all know who the top players are, and I think that's where a lot of large enterprises are investing, but we'll see how things go in the future, too. >> Where should customers spend their focus because this you're seeing the clouds. I was just commenting about Google yesterday, with AMIT, AI, and others. That they're to be enterprise-ready. You guys are very savvy in the enterprising, there's a lot of table stakes, SLAs to integration points, and so, there's some clouds that aren't ready for prime time, like Google for the enterprise. Some are getting there fast like Amazon Ah-jahs super enterprise-friendly. They have their own problems and opportunities. But they are very strong on the enterprise. What do you guys advise customers? What are they looking at right now? Where should they be spending their time, writing more code, scripts, or tackling the data? How do you guys help them shift their focus? >> Yeah, yeah! >> And where-- >> And definitely not scripts (laughing). >> It's about the worst thing you can do because... And it's all for all the reasons we understand. >> Why is that? >> Well, again, we we're talking about being agile. There's nothing agile about manually sitting there, writing Java code. Think about all the developers that were writing MapReduce code three or four years ago (laughing). Those guys, well, they're probably looking for new jobs right now. And with the companies who built that code, they're rewriting all of it. So that approach of doing things at the lowest possible level doesn't make engineering sense. That's why the kind of abstraction layer approach makes so much better sense. So where should people be spending their time? It's really... The one thing technology cannot do is it can't substitute for context. So that's business context, understanding if you're in healthcare there's things about the healthcare industry that only that healthcare company could possibly know, and know about their data, and why certain data is structured the way it is. >> John: Yeah. >> Or financial services or retail. So business context is something that only that organization can possibly bring to the table, and organizational context, as you were alluding to before, roles and responsibilities, who should have access to data, who shouldn't have access to data, That's also something that can be prescribed from the outside. It's something that organizations have to figure out. Everything else under the hood, there's no reason whatsoever to be mired in these long code cycles. >> John: Yeah. >> And then you got to rewrite it-- >> John: Yeah. >> And you got to maintain it. >> So automation is one level. >> Yep. >> Machine learning is a nice bridge between the taking advantage of either vertical data, or especially, data for that context. >> Yep. >> But then the human has to actually synthesize it. >> Right! >> And apply it. That's the interface. Did I get that right, that progression? >> Yeah, yeah. Absolutely! And the reason machine learning is so cool... And I'm glad you segway into that. Is that, so it's all about having the machine learning assist the human, right. So the humans don't go away. We still have to have people who understand-- >> John: Okay. >> The business context and the organizational context. But what machine learning can do is in the world of big data... Inherently, the whole idea of big data is that there's too much data for any human to mentally comprehend. >> John: Yeah. >> Well, you don't have to mentally comprehend it. Let the machine learning go through, so we've got this unique machine learning technology that will actually scan all the data inside of Hadoop and outside of Hadoop, and it'll identify what the data is-- >> John: Yeah. >> Because it's all just pattern matching and correlations. And most organizations have common patterns to their data. So we figured up all this stuff, and we can say, Oh, you got credit card information here. Maybe you should go look at that, if that's not supposed to be there (laughing). Maybe there's a potential violation there? So we can focus the manual effort onto the places where it matters, so now you're looking at issues, problems, instead of doing the day-to-day stuff. The day-to-day stuff is fully automated and that's not what organizations-- >> So the guys that are losing their jobs, those Java developers writing scripts, to do the queries, where should they be focusing? Where should they look for jobs? Because I would agree with you that their jobs would be because the the MapReduce guys and all the script guys and the Java guys... Java has always been the bulldozer of the programming language, very functional. >> Murthy: Yep. >> But where those guys go? What's your advice for... We have a lot of friends, I'm sure you do, too. I know a lot of friends who are Java developers who are awesome programmers. >> Yeah. >> Where should they go? >> Well, so first, I'm not saying that Java's going to go away, obviously (laughing). But I think Java-- >> Well, I mean, Java guys who are doing some of the payload stuff around some of the deep--- >> Exactly! >> In the bowels of big data. >> That's right! Well, there's always things that are unique to the organization-- >> Yeah. >> Custom applications, so all that stuff is fine. What we're talking about is like MapReduce coding-- >> Yeah, what should they do? What should those guys be focusing on? >> So it's just like every other industry you see. You go up the value stack, right. >> John: Right. >> So if you can become more of the data governor, the data stewards, look at policy, look at how you should be thinking about organizational context-- >> John: And governance is also a good area. >> And governance, right. Governance jobs are just going to explode here because somebody has to define it, and technology can't do this. Somebody has to tell the technology what data is good, what data is bad, when do you want to get flagged if something is going wrong, when is it okay to send data through. Whoever decides and builds those rules, that's going to be a place where I think there's a lot of opportunities. >> Murthy, final question. We got to break, we're getting the hook sign here, but we got Informatica World coming up soon in May. What's going to be on the agenda? What should we expect to hear? What's some of the themes that you could tease a little bit, get people excited. >> Yeah, yeah. Well, one thing we want to really provide a lot of content around the journey to the cloud. And we've been talking today, too, there's so many organizations who are exploring the cloud, but it's not easy, for all the reasons we just talked about. Some organizations want to just kind of break away, take out, rip out everything in IT, move all their data and their applications to the cloud. Some of them are taking more of a progressive journey. So we got customers who've been on the leading front of that, so we'll be having a lot of sessions around how they've done this, best practices that they've learned. So hopefully, it's a great opportunity for both our current audience who's always looked to us for interesting insights, but also all these kind of emerging folks-- >> Right. >> Who are really trying to figure out this new world of data. >> Murthy, thanks so much for coming on The Cube. Appreciate it. Informatica World coming up. You guys have a great solution, and again, making it easier (laughing) for people to get the data and put those new processes in place. This is The Cube breaking it down for Big Data SV here in conjunction with Strata Hadoop. I'm John Furrier. More live coverage after this short break. (electronic music)

Published Date : Mar 15 2017

SUMMARY :

it's The Cube, Did I get it right? Good to see you again. and the show theme has been, So kind of a checkpoint in the industry. What are the trends? over the last couple years, John: Yeah. And IT was always poo-pooing and it executes in the Hadoop cluster. so I got to ask you the question, and it's the AI, And what do you guys do? Go ahead. So, first, the collision and you don't have to and that might be an integration for you guys, not set up for and then processed. it's always the stuff around is that even the way IT could be completely dynamic. Umm-hmm. from the line of business. and people have been and all that. John: And the cloud's right there, too, So you-- So is the 12-month project model, at most of the successful organizations, and you just got to keep doing But here's the trick question for you: Now, that's the real challenge. So how do you guys So for you guys, sources that are in the cloud, the processing to a cloud that if they don't want to go cloud, That all works under the hood. Okay, so the next question I don't know if that's the endgame? So I mean, Informatica We're the Switzerland (laughing), go in the future, too. Google for the enterprise. And it's all for all the Think about all the from the outside. is a nice bridge between the has to actually synthesize it. That's the interface. So the humans don't go away. and the organizational context. Let the machine learning go through, instead of doing the day-to-day stuff. So the guys that are losing their jobs, I'm sure you do, too. going to go away, obviously (laughing). so all that stuff is fine. So it's just like every John: And governance that's going to be a place where I think What's some of the themes that you could for all the reasons we just talked about. to figure out this new world of data. get the data and put those

ENTITIES

Entity	Category	Confidence
John	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Murthy Mathiprakasam	PERSON	0.99+
2017	DATE	0.99+
Silicon Valley	LOCATION	0.99+
Murthy	PERSON	0.99+
Oracle	ORGANIZATION	0.99+
AMIT	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Twelve months	QUANTITY	0.99+
Java	TITLE	0.99+
Informatica	ORGANIZATION	0.99+
O'Reilly Media	ORGANIZATION	0.99+
12 months	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
24 months	QUANTITY	0.99+
May	DATE	0.99+
tomorrow	DATE	0.99+
yesterday	DATE	0.99+
Google	ORGANIZATION	0.99+
Spark	TITLE	0.99+
first	QUANTITY	0.99+
last night	DATE	0.99+
today	DATE	0.98+
Murth	PERSON	0.98+
Informatica World	ORGANIZATION	0.98+
Switzerland	LOCATION	0.98+
two	QUANTITY	0.98+
three-part	QUANTITY	0.98+
three	QUANTITY	0.98+
both	QUANTITY	0.97+
three	DATE	0.96+
NYC	LOCATION	0.96+
Big Data Week	EVENT	0.96+
one level	QUANTITY	0.96+
one	QUANTITY	0.96+
one speed	QUANTITY	0.96+
two colliding forces	QUANTITY	0.95+
one-trick	QUANTITY	0.93+
MapReduce	TITLE	0.93+
one way	QUANTITY	0.93+
four years ago	DATE	0.92+
#BigDataSV	TITLE	0.91+
Kinesis	ORGANIZATION	0.87+
The Cube	ORGANIZATION	0.86+
MapReduce	ORGANIZATION	0.85+
agile	TITLE	0.84+
Big Data	ORGANIZATION	0.81+

Stephanie McReynolds, Alation & Lee Paries, Think Big Analytics - #BigDataSV - #theCUBE

>> Voiceover: San Jose, California, tt's theCUBE, covering Big Data Silicon Valley 2017. (techno music) >> Hey, welcome back everyone. Live in Silicon Valley for Big Data SV. This is theCUBE coverage in conjunction with Strata + Hadoop. I'm John Furrier with George Gilbert at Wikibon. Two great guests. We have Stephanie McReynolds, Vice President of startup Alation, and Lee Paries who is the VP of Think Big Analytics. Thanks for coming back. Both been on theCUBE, you have been on theCUBE before, but Think Big has been on many times. Good to see you. What's new, what are you guys up to? >> Yeah, excited to be here and to be here with Lee. Lee and I have a personal relationship that goes back quite aways in the industry. And then what we're talking about today is the integration between Kylo, which was recently announced as an open source project from Think Big, and Alation's capability to sit on top of Kylo and to gather to increase the velocity of data lake initiatives, kind of going from zero to 60 in a pretty short amount of time to get both technical value from Kylo and business value from Alation. >> So talk about Alation's traction, because you guys has been an interesting startup, a lot of great press. George is a big fan. He's going to jump in with some questions, but some good product fit with the market. What's the update? What's some of the status on the traction in terms of the company and customers and whatnot? >> Yeah, we've been growing pretty rapidly for a startup. We've doubled our production customer count from last time we talked. Some great brand names. Munich Reinsurance this morning was talking about their implementation. So they have 600 users of Alation in their organization. We've entered Europe, not only with Munich Reinsurance but Tesco is a large account of ours in Europe now. And here in the States we've seen broad adoption across a wide range of industries, every one from Pfizer in the healthcare space to eBay, who's been our longest standing customer. They have about 1,000 weekly users on Alation. So not only a great increase in number of logos, but also organic growth internally at many of these companies across data scientists, data analysts, business analysts, a wide range of users of the product, as well. >> It's been interesting. What I like about your approach, and we talk about Think Big about it before, we let every guest come in so far that's been in the same area is talking about metadata layers, and so this is interesting, there's a metadata data addressability if you will for lack of a better description, but yet human usable has to be integrating into human processes, whether it's virtualization, or any kind of real time app or anything. So you're seeing this convergence between I need to get the data into an app, whether it's IoT data or something else, really really fast, so really kind of the discovery pieces now, the interesting layer, how competitive is it, and what's the different solutions that you guys see in this market? >> Yeah, I think it's interesting, because metadata has kind of had a revival, right? Everyone is talking about the importance in metadata and open integration with metadata. I think really our angle is as Alation is that having open transfer of technical metadata is very important for the foundation of analytics, but what really brings that technical metadata to life is also understanding what is the business context of what's happening technically in the system? What's the business context of data? What's the behavioral context of how that data has been used that might inform me as an analyst? >> And what's your unique approach to that? Because that's like the Holy Grail. It's like translating geek metadata, indexing stuff into like usable business outcomes. It's been a cliche for years, you know. >> The approach is really based on machine learning and AI technology to make recommendations to business users about what might be interesting to them. So we're at a state in the market where there is so much data that is available and that you can access, either in Hadoop as a data lake or in a data warehouse in a database like Teradata, that today what you need as state of the art is the system to start to recommend to you what might be interesting data for you to use as a data scientist or an analyst, and not just what's the data you could use, but how accurate is that data, how trustworthy is it? I think there's a whole nother theme of governance that's rising that's tied to that metadata discussion, which is it's not enough to just shove bits and bytes between different systems anymore. You really need to understand how has this data been manipulated and used and how does that influence my security considerations, my privacy considerations, the value I'm going to be able to get out of that data set? >> What's your take on this, 'cause you guys have a relationship. How is Think Big doing? Then talk about the partnership you guys have with Alation. >> Sure, so I mean when you look at what we've done specifically to an open source project it's the first one that Teradata has fully sponsored and released based on Apache 2.0 called Kylo, it's really about the enablement of the full data lake platform and the full framework, everywhere from ingest, to securing it, to governing it, which part of that is collecting is part of that process, the basic technical and business metadata so later you can hand it over to the user so they could sample, they could profile the data, they can find, they can search in a Google like manner, and then you can enable the organization with that data. So when you look at it from a standpoint of partnering together, it's really about collecting that data specifically within Hadoop to enable it, yet with the ability then to hand it off to more the enterprise wide solution like Alation through API connections that connect to that, and then for them they enrich it in a way that they go about it with the social collaboration and the business to extend it from there. >> So that's the accelerant then. So you're accelerating the open source project in through this new, with Alation. So you're still going to rock and roll with the open source. >> Very much going to rock and roll with the open source. So it's really been based on five years of Think Big's work in the marketplace over about 150 data lakes. The IT we've built around that to do things repeatedly, consistently, and then releasing that in the last two years, dedicated development based on Apache Spark and NiFi to stand that out. >> Great work by the way. Open sources continue to be more relevant. But I got to get your perspective on a meme that's been floating around day one here, and maybe it's because of the election, but someone said, "We got to drain the data swamp, "and make data great again." And not a play on Trump, but the data lake is going through a transition and saying, "Okay, we've got data lakes," but now this year it's been a focus on making that much more active and cleaner and making sure it doesn't become a swamp if you will. So there's been a focus of taking data lake content and getting it into real time, and IoT has kind of I think been a forcing function. But you guys, do you guys have a perspective on that on where data lakes are going? Certainly it's been trending conversation here at the show. >> Yeah, I think IoT has been part of drain that data swamp, but I think also now you have a mass of business analysts that are starting to get access to that data in the lake. These Hadoop implementations are maturing to the stage where you have-- >> John: To value coming out of it. >> Yeah, and people are trying to wring value out of that lake, and sometimes finding that it is harder than they expected because the data hasn't been pre-prepared for them. This old world of IT would pre-prepare the data, and then I got a single metric or I got a couple metrics to choose from is now turned on its head. People are taking a more exploratory, discovery oriented approach to navigating through their data and finding that the nuisances of data really matter when trying to evolve an insight. So the literacy in these organizations and their awareness of some of the challenges of a lake are coming to the forefront, and I think that's a healthy conversation for us all to have. If you're going to have a data driven organization, you have to really understand the nuisances of your data to know where to apply it appropriately to decision making. >> So (mumbles) actually going back quite a few years when he started at Microsoft said, Internet software has changed paradigm so much in that we have this new set of actions where it was discover, learn, try, buy, recommend, and it sounds like as a consumer of data in a data lake we've added or preppended this discovery step. Where in a well curated data warehouse it was learn, you had your X dimensions that were curated and refined, and you don't have that as much with the data lake. I guess I'm wondering, it's almost like if you're going to take, as we were talking to the last team with AtScale and moving OLAP to be something you consume on a data lake the way you consume on a data warehouse, it's almost like Alation and a smart catalog is as much a requirement as a visualization tool is by itself on a data warehouse? >> I think what we're seeing is this notion of data needing to be curated, and including many brains and many different perspectives in that curation process is something that's defining the future of analytics and how people use technical metadata, and what does it mean for the devops organization to get involved in draining that swamp? That means not only looking at the elements of the data that are coming in from a technical perspective, but then collaborating with a business to curate the value on top of that data. >> So in other words it's not just to help the user, the business analyst, navigate, but it's also to help the operational folks do a better job of curating once they find out who's using it, who's using the data and how. >> That's right. They kind of need to know how this data is going to be used in the organization. The volumes are so high that they couldn't possibly curate every bit and byte that is stored in the data lake. So by looking at how different individuals in the organization and different groups are trying to access that data that gives early signal to where should we be spending more time or less time in processing this data and helping the organization really get to their end goals of usage. >> Lee, I want to ask you a question. On your blog post, I just was pointed out earlier, you guys quote a Gartner stat which says, which is pretty doom and gloom, which said, "70% of Hadoop deployments in 2017 "will either fail or deliver their estimated cost savings "of their predicted revenue." And then it says, "That's a dim view, "but not shared by the Kylo community." How are you guys going to make the Kylo data lake software work well? What's your thoughts on that? Because I think people, that's the number one, again, question that I highlighted earlier is okay, I don't want a swamp, so that's fear, whether they get one or not, so they worry about data cleansing and all these things. So what's Kylo doing that's going to accelerate, or lower that number, of fails in the data lake world? >> Yeah sure, so again, a lot of it's through experience of going out there and seeing what's done. A lot of people have been doing a lot of different things within the data lakes, but when you go in there there's certain things they're not doing, and then when you're doing them it's about doing them over consistently and continually improving upon that, and that's what Kylo is, it's really a framework that we keep adding to, and as the community grows and other projects come in there can enhance it we bring the value. But a lot of times when we go in it it's basically end users can't get to the data, either one because they're not allowed to because maybe it's not secured and relied to turn it over to them and let them drive with it, or they don't know the data is there, which goes back to basic collecting the basic metadata and data (mumbles) to know it's there to leverage it. So a lot of times it's going back and looking at and leveraging what we have to build that solid foundation so IT and operations can feel like they can hand that over in a template format so business users could get to the data and start acting off of that. >> You just lost your mic there, but Stephanie, I got to ask you a question. So just on a point of clarification, so you guys, are you supporting Kylo? Is that the relationship, or how does that work? >> So we're integrated with Kylo. So Kylo will ingest data into the lake, manage that data lake from a security perspective giving folks permissions, enables some wrangling on that data, and what Alation is receiving then from Kylo is that technical metadata that's being created along that entire path. >> So you're certified with Kylo? How does that all work from the customer standpoint? >> That's a very much integration partnership that we'd be working together. >> So from a customer standpoint it's clean and you then provide the benefits on the other side? >> Correct. >> Yeah, absolutely. We've been working with data lake implementations for some time, since our founding really, and I think this is an extension of our philosophy that the data lakes are going to play an important role that are going to complement databases and analytics tools, business intelligence tools, and the analytics environment, and the open source is part of the future of how folks are building these environments. So we're excited to support the Kylo initiative. We've had a longstanding relationship with Teradata as a partner, so it's a great way to work together. >> Thanks for coming on theCUBE. Really appreciate it, and thank... What do you think of the show you guys so far? What's the current vibe of the show? >> Oh, it's been good so far. I mean, it's one day into it, but very good vibe so far. Different topics and different things-- >> AI machine learning. You couldn't be more happier with that machine learning-- >> Great to see machine learning taking a forefront, people really digging into the details around what it means when you apply it. >> Stephanie, thanks for coming on theCUBE, really appreciate it. More CUBE coverage after the show break. Live from Silicon Valley, I'm John Furrier with George Gilbert. We'll be right back after this short break. (techno music)

Published Date : Mar 15 2017

SUMMARY :

(techno music) What's new, what are you guys up to? and to gather to increase He's going to jump in with some questions, And here in the States we've seen broad adoption that you guys see in this market? Everyone is talking about the importance in metadata Because that's like the Holy Grail. is the system to start to recommend to you Then talk about the partnership you guys have with Alation. and the business to extend it from there. So that's the accelerant then. and NiFi to stand that out. and maybe it's because of the election, to the stage where you have-- and finding that the nuisances of data really matter to be something you consume on a data lake and many different perspectives in that curation process but it's also to help the operational folks and helping the organization really get in the data lake world? and data (mumbles) to know it's there to leverage it. but Stephanie, I got to ask you a question. and what Alation is receiving then from Kylo that we'd be working together. that the data lakes are going to play an important role What's the current vibe of the show? Oh, it's been good so far. You couldn't be more happier with that machine learning-- people really digging into the details More CUBE coverage after the show break.

ENTITIES

Entity	Category	Confidence
Stephanie McReynolds	PERSON	0.99+
George Gilbert	PERSON	0.99+
Europe	LOCATION	0.99+
Stephanie	PERSON	0.99+
Lee	PERSON	0.99+
Tesco	ORGANIZATION	0.99+
Lee Paries	PERSON	0.99+
George	PERSON	0.99+
Trump	PERSON	0.99+
2017	DATE	0.99+
John	PERSON	0.99+
Pfizer	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Think Big	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
70%	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Alation	ORGANIZATION	0.99+
Teradata	ORGANIZATION	0.99+
Think Big Analytics	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
Gartner	ORGANIZATION	0.99+
zero	QUANTITY	0.99+
Kylo	ORGANIZATION	0.99+
60	QUANTITY	0.99+
600 users	QUANTITY	0.98+
AtScale	ORGANIZATION	0.98+
eBay	ORGANIZATION	0.98+
Google	ORGANIZATION	0.98+
today	DATE	0.98+
first one	QUANTITY	0.98+
Hadoop	TITLE	0.98+
Both	QUANTITY	0.98+
both	QUANTITY	0.97+
Two great guests	QUANTITY	0.97+
this year	DATE	0.97+
about 1,000 weekly users	QUANTITY	0.97+
one day	QUANTITY	0.95+
single metric	QUANTITY	0.95+
Apache Spark	ORGANIZATION	0.94+
Kylo	TITLE	0.93+
Wikibon	ORGANIZATION	0.93+
NiFi	ORGANIZATION	0.92+
about 150 data lakes	QUANTITY	0.92+
Apache 2.0	TITLE	0.89+
this morning	DATE	0.88+
couple	QUANTITY	0.86+
Big Data Silicon Valley 2017	EVENT	0.84+
day one	QUANTITY	0.83+
Vice President	PERSON	0.81+
Strata	TITLE	0.77+
Kylo	PERSON	0.77+
#theCUBE	ORGANIZATION	0.76+
Big Data	ORGANIZATION	0.75+
last two years	DATE	0.71+
one	QUANTITY	0.7+
Munich Reinsurance	ORGANIZATION	0.62+
CUBE	ORGANIZATION	0.52+

Ravi Dharnikota, SnapLogic & Katharine Matsumoto, eero - Big Data SV 17 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's theCUBE, covering Big Data Silicon Valley 2017. (light techno music) >> Hey, welcome back everybody. Jeff Frick here with theCUBE. We're at Big Data SV, wrapping up with two days of wall-to-wall coverage of Big Data SV which is associated with Strata Comp, which is part of Big Data Week, which always becomes the epicenter of the big data world for a week here in San Jose. We're at the historic Pagoda Lounge, and we're excited to have our next two guests, talking a little bit different twist on big data that maybe you hadn't thought of. We've got Ravi Dharnikota, he is the Chief Enterprise Architect at SnapLogic, welcome. - Hello. >> Jeff: And he has brought along a customer, Katharine Matsumoto, she is a Data Scientist at eero, welcome. >> Thank you, thanks for having us. >> Jeff: Absolutely, so we had SnapLogic on a little earlier with Garavs, but tell us a little bit about eero. I've never heard of eero before, for folks that aren't familiar with the company. >> Yeah, so eero is a start-up based in San Francisco. We are sort of driven to increase home connectivity, both the performance and the ease of use, as wifi becomes totally a part of everyday life. We do that. We've created the world's first mesh wifi system. >> Okay. >> So that means you have, for an average home, three different individual units, and you plug one in to replace your router, and then the other three get plugged in throughout the home just to power, and they're able to spread coverage, reliability, speed, throughout your homes. No more buffering, dead zones, in that way back bedroom. >> Jeff: And it's a consumer product-- >> Yes. >> So you got all the fun and challenges of manufacturing, you've got the fun challenges of distribution, consumer marketing, so a lot of challenges for a start-up. But you guys are doing great. Why SnapLogic? >> Yeah, so in addition to the challenges with the hardware, we also are a really strong software. So, everything is either set up via the app. We are not just the backbone to your home's connectivity, but also part of it, so we're sending a lot of information back from our devices to be able to learn and improve the wifi that we're delivering based on the data we get back. So that's a lot of data, a lot of different teams working on different pieces. So when we were looking at launch, how do we integrate all of that information together to make it accessible to business users across different teams, and also how do we handle the scale. I made a checklist (laughs), and SnapLogic was really the only one that seemed to be able to deliver on both of those promises with a look to the future of like, I don't know what my next Sass product is, I don't know what our next API point we're going to need to hit is, sort of the flexibility of that as well as the fact that we have analysts were able to pick it up, engineers were able to pick it up, and I could still manage all the software written by, or the pipelines written by each of those different groups without having to read whatever version of code they're writing. >> Right, so Ravi, we heard you guys are like doubling your customer base every year, and lots of big names, Adobe we talked about earlier today. But I don't know that most people would think of SnapLogic really, as a solution to a start-up mesh network company. >> Yeah, absolutely, so that's a great point though, let me just start off with saying that in this new world, we don't discriminate-- (guest and host laugh) we integrate and we don't discriminate. In this new world that I speak about is social media, you know-- >> Jeff: Do you bus? (all laugh) >> So I will get to that. (all laugh) So, social, mobile, analytics, and cloud. And in this world, people have this thing which we fondly call integrators' dilemma. You want to integrate apps, you go to a different tool set. You integrate data, you start thinking about different tool sets. So we want to dispel that and really provide a unified platform for both apps and data. So remember, when we are seeing all the apps move into the cloud and being provided as services, but the data systems are also moving to the cloud. You got your data warehouses, databases, your BI systems, analytical tools, all are being provided to you as services. So, in this world data is data. If it's apps, it's probably schema mapping. If it's data systems, it's transformations moving from one end to the other. So, we're here to solve both those challenges in this new world with a unified platform. And it also helps that our lineage and the brain trust that brings us here, we did this a couple of decades ago and we're here to reinvent that space. >> Well, we expect you to bring Clayton Christensen on next time you come to visit, because he needs a new book, and I think that's a good one. (all laugh) But I think it was a really interesting part of the story though too, is you have such a dynamic product. Right, if you looked at your boxes, I've got the website pulled up, you wouldn't necessarily think of the dynamic nature that you're constantly tweaking and taking the data from the boxes to change the service that you're delivering. It's not just this thing that you made to a spec that you shipped out the door. >> Yeah, and that's really where the auto connected, we did 20 from our updates last year. We had problems with customers would have the same box for three years, and the technology change, the chips change, but their wifi service is the same, and we're constantly innovating and being able to push those out, but if you're going to do that many updates, you need a lot of feedback on the updates because things break when you update sometimes, and we've been able to build systems that catch that that are able to identify changes that say, not one person could be able to do by looking at their own things or just with support. We have leading indicators across all sorts of different stability and performance and different devices, so if Xbox changes their protocols, we can identify that really quickly. And that's sort of the goal of having all the data in one place across customer support and manufacturing. We can easily pinpoint where in the many different complicated factors you can find the problem. >> Have issues. - Yeah. >> So, I've actually got questions for both of you. Ravi, starting with you, it sounds like you're trying to tackle a challenge that in today's tools would have included Kafka at the data integration level, and there it's very much a hub and spoke approach. And I guess it's also, you would think of the application level integration more like the TIBCO and other EAI vendors in a previous generation-- - [Ravi] Yeah. >> Which I don't think was hub and spoke, it was more point to point, and I'm curious how you resolve that, in other words, how you'd tackle both together in a unified architecture? >> Yeah, that's an excellent question. In fact, one of the integrators' dilemma that I spoke about you've got the problem set where you've got the high-latency, high-volume, where you go to ETL tools. And then the low-latency, low-volume, you immediately go to the TIBCOs of the world and that's ESB, EAI sort of tool sets that you look to solve. So what we've done is we've thought about it hard. At one level we've just said, why can integration not be offered as a service? So that's step number one where the design experience is through the cloud, and then execution can just happen anywhere, behind your firewall or in the cloud, or in a big data system, so it caters to all of that. But then also, the data set itself is changing. You're seeing a lot of the document data model that are being offered by the Sass services. So the old ETL companies that were built before all of this social, mobile sort of stuff came around, it was all row and column oriented. So how do you deal with the more document oriented JSON sort of stuff? And we built that for, the platform to be able to handle that kind of data. Streaming is an interesting and important question. Pretty much everyone I spoke to last year were, streaming was a big-- let's do streaming, I want everything in real-time. But batch also has it's place. So you've got to have a system that does batch as well as real-time, or as near real-time as needed. So we solve for all of those problems. >> Okay, so Katharine, coming to you, each customer has a different, well, every consumer has a different, essentially, a stall base. To bring all the telemetry back to make sense out of what's working and what's not working, or how their environment is changing. How do you make sense out of all that, considering that it's not B to B, it's B to C so, I don't know how many customers you have, but it must be in the tens or hundreds. >> I'm sure I'm not allowed to say (laughs). >> No. But it's the distinctness of each customer that I gather makes the support challenge for you. >> Yeah, and part of that's exposing as much information to the different sources, and starting to automate the ways in which we do it. There's certainly a lot, we are very early on as a company. We've hit our year mark for public availability the end of last month so-- >> Jeff: Congratulations. >> Thank you, it's been a long year. But with that we learn more, constantly, and different people come to different views as different new questions come up. The special-snowflake aspect of each customer, there's a balance between how much actually is special and how much you can find patterns. And that's really where you get into much more interesting things on the statistics and machine learning side is how do you identify those patterns that you may not even know you're looking for. We are still beginning to understand our customers from a qualitative standpoint. It actually came up this week where I was doing an analysis and I was like, this population looks kind of weird, and with two clicks was able to send out a list over to our CX team. They had access to all the same systems because all of our data is connected and they could pull up the tickets based on, because through SnapLogic, we're joining all the data together. We use Looker as our BI tool, they were just able to start going into all the tickets and doing a deep dive, and that's being presented later this week as to like, hey, what is this population doing? >> So, for you to do this, that must mean you have at least some data that's common to every customer. For you to be able to use something like Looker, I imagine. If every customer was a distinct snowflake, it would be very hard to find patterns across them. >> Well I mean, look at how many people have iPhones, have MacBooks, you know, we are looking at a lot of aggregate-level data in terms of how things are behaving, and always the challenge of any data science project is creating those feature extractions, and so that's where the process we're going through as the analytics team is to start extracting those things and adding them to our central data source. That's one of the areas also where having very integrated analytics and ETL has been helpful as we're just feeding that information back in to everyone. So once we figure out, oh hey, this is how you differentiate small businesses from homes, because we do see a couple of small businesses using our product, that goes back into the data and now everyone's consuming it. Each of those common features, it's a slow process to create them, but it's also increases the value every time you add one to the central group. >> One last question-- >> It's an interesting way to think of the wifi service and the connected devices an integration challenge, as opposed to just an appliance that kind of works like an old POTS line, which it isn't, clearly at all. (all laugh) With 20 firmware updates a year (laughs). >> Yeah, there's another interesting point, that we were just having the discussion offline, it's that it's a start-up. They obviously don't have the resources or the app, but have a large IT department to set up these systems. So, as Katharine mentioned, one person team initially when they started, and to be able to integrate, who knows which system is going to be next. Maybe they experiment with one cloud service, it perhaps scales to their liking or not, and then they quickly change and go to another one. You cannot change the integration underneath that. You got to be able to adjust to that. So that flexibility, and the other thing is, what they've done with having their business become self-sufficient is another very fascinating thing. It's like, give them the power. Why should IT or that small team become the bottom line? Don't come to me, I'll just empower you with the right tool set and the patterns and then from there, you change and put in your business logic and be productive immediately. >> Let me drill down on that, 'cause my understanding, at least in the old world was that DTL was kind of brittle, and if you're constantly ... Part of actually, the genesis of Hadoop, certainly at Yahoo was, we're going to bring all the data we might ever possibly need into the repository so we don't have to keep re-writing the pipeline. And it sounds like you have the capability to evolve the pipeline rather quickly as you want to bring more data into this sort of central resource. Am I getting that about right? >> Yeah, it's a little bit of both. We do have a central, I think, down data's the fancy term for that, so we're bringing everything into S3, jumping it into those raw JSONs, you know, whatever nested format it comes into, so whatever makes it so that extraction is easy. Then there's also, as part of ETL, there's that last mile which is a lot of business logic, and that's where you run into teams starting to diverge very quickly if you don't have a way for them to give feedback into the process. We've really focused on empowering business users to be self-service, in terms of answering their own questions, and that's freed up our in list to add more value back into the greater group as well as answer harder questions, that both beget more questions, but also feeds back insights into that data source because they have access to their piece of that last business logic. By changing the way that one JSON field maps or combining two, they've suddenly created an entirely new variable that's accessible to everyone. So it's sort of last-leg business logic versus the full transport layer. We have a whole platform that's designed to transport everything and be much more robust to changes. >> Alright, so let me make sure I understand this, it sounds like the less-trained or more self-sufficient, they go after the central repository and then the more highly-trained and scarcer resource, they are responsible for owning one or more of the feeds and that they enrich that or make that more flexible and general-purpose so that those who are more self-sufficient can get at it in the center. >> Yeah, and also you're able to make use of the business. So we have sort of a hybrid model with our analysts that are really closely embedded into the teams, and so they have all that context that you need that if you're relying on, say, a central IT team, that you have to go back and forth of like, why are you doing this, what does this mean? They're able to do all that in logic. And then the goal of our platform team is really to focus on building technologies that complement what we have with SnapLogic or others that are accustomed to our data systems that enable that same sort of level of self-service for creating specific definitions, or are able to do it intelligently based on agreed upon patterns of extraction. >> George: Okay. >> Heavy science. Alright, well unfortunately we are out of time. I really appreciate the story, I love the site, I'll have to check out the boxes, because I know I have a bunch of dead spots in my house. (all laugh) But Ravi, I want to give you the last word, really about how is it working with a small start-up doing some cool, innovative stuff, but it's not your Adobes, it's not a lot of the huge enterprise clients that you have. What have you taken, why does that add value to SnapLogic to work with kind of a cool, fun, small start-up? >> Yeah, so the enterprise is always a retrofit job. You have to sort of go back to the SAPs and the Oracle databases and make sure that we are able to connect the legacy with a new cloud application. Whereas with a start-up, it's all new stuff. But their volumes are constantly changing, they probably have spikes, they have burst volumes, they're thinking about this differently, enabling everyone else, quickly changing and adopting newer technologies. So we have to be able to adjust to that agility along with them. So we're very excited as sort of partnering with them and going along with them on this journey. And as they start looking at other things, the machine learning and the AI and the IRT space, we're very excited to have that partnership and learn from them and evolve our platform as well. >> Clearly. You're smiling ear-to-ear, Katharine's excited, you're solving problems. So thanks again for taking a few minutes and good luck with your talk tomorrow. Alright, I'm Jeff Frick, he's George Gilbert, you're watching theCUBE from Big Data SV. We'll be back after this short break. Thanks for watching. (light techno music)

Published Date : Mar 15 2017

SUMMARY :

it's theCUBE, that maybe you hadn't thought of. Jeff: And he has brought along a customer, for folks that aren't familiar with the company. We are sort of driven to increase home connectivity, and you plug one in to replace your router, So you got all the fun and challenges of manufacturing, We are not just the backbone to your home's connectivity, and lots of big names, Adobe we talked about earlier today. (guest and host laugh) but the data systems are also moving to the cloud. and taking the data from the boxes and the technology change, the chips change, - Yeah. more like the TIBCO and other EAI vendors the platform to be able to handle that kind of data. considering that it's not B to B, that I gather makes the support challenge for you. and starting to automate the ways in which we do it. and how much you can find patterns. that must mean you have at least some data as the analytics team is to start and the connected devices an integration challenge, and then they quickly change and go to another one. into the repository so we don't have to keep and that's where you run into teams of the feeds and that they enrich that and so they have all that context that you need it's not a lot of the huge enterprise clients that you have. and the Oracle databases and make sure and good luck with your talk tomorrow.

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
Katharine Matsumoto	PERSON	0.99+
Jeff	PERSON	0.99+
Ravi Dharnikota	PERSON	0.99+
Katharine	PERSON	0.99+
George Gilbert	PERSON	0.99+
Adobe	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
George	PERSON	0.99+
San Jose	LOCATION	0.99+
San Francisco	LOCATION	0.99+
tens	QUANTITY	0.99+
last year	DATE	0.99+
three years	QUANTITY	0.99+
Clayton Christensen	PERSON	0.99+
20	QUANTITY	0.99+
one	QUANTITY	0.99+
Ravi	PERSON	0.99+
San Jose, California	LOCATION	0.99+
SnapLogic	ORGANIZATION	0.99+
iPhones	COMMERCIAL_ITEM	0.99+
Kafka	TITLE	0.99+
two days	QUANTITY	0.99+
hundreds	QUANTITY	0.99+
two	QUANTITY	0.99+
tomorrow	DATE	0.99+
two clicks	QUANTITY	0.99+
TIBCO	ORGANIZATION	0.99+
both	QUANTITY	0.99+
each customer	QUANTITY	0.99+
Xbox	COMMERCIAL_ITEM	0.99+
Big Data Week	EVENT	0.99+
Oracle	ORGANIZATION	0.99+
One last question	QUANTITY	0.98+
eero	ORGANIZATION	0.98+
Pagoda Lounge	LOCATION	0.98+
20 firmware updates	QUANTITY	0.98+
Adobes	ORGANIZATION	0.98+
this week	DATE	0.98+
S3	TITLE	0.98+
Strata Comp	ORGANIZATION	0.98+
MacBooks	COMMERCIAL_ITEM	0.98+
Each	QUANTITY	0.97+
three	QUANTITY	0.97+
each	QUANTITY	0.97+
one person	QUANTITY	0.96+
JSON	TITLE	0.96+
two guests	QUANTITY	0.95+
today	DATE	0.95+
three different individual units	QUANTITY	0.95+
later this week	DATE	0.95+
a week	QUANTITY	0.94+
#BigDataSV	TITLE	0.93+
earlier today	DATE	0.92+
one level	QUANTITY	0.92+
couple of decades ago	DATE	0.9+
CX	ORGANIZATION	0.9+
theCUBE	ORGANIZATION	0.9+
SnapLogic	TITLE	0.87+
end	DATE	0.87+
first mesh	QUANTITY	0.87+
one person team	QUANTITY	0.87+
Sass	TITLE	0.86+
one cloud	QUANTITY	0.84+
Big Data SV	TITLE	0.84+
last month	DATE	0.83+
one place	QUANTITY	0.83+
Big Data Silicon Valley 2017	EVENT	0.82+

Darren Chinen, Malwarebytes - Big Data SV 17 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 2017. >> Hey, welcome back everybody. Jeff Frick here with The Cube. We are at Big Data SV in San Jose at the Historic Pagoda Lounge, part of Big Data week which is associated with Strata + Hadoop. We've been coming here for eight years and we're excited to be back. The innovation and dynamicism of big data and evolutions now with machine learning and artificial intelligence, just continues to roll, and we're really excited to be here talking about one of the nasty aspects of this world, unfortunately, malware. So we're excited to have Darren Chinen. He's the senior director of data science and engineering from Malwarebytes. Darren, welcome. >> Darren: Thank you. >> So for folks that aren't familiar with the company, give us just a little bit of background on Malwarebytes. >> So Malwarebytes is basically a next-generation anti-virus software. We started off as humble roots with our founder at 14 years old getting infected with a piece of malware, and he reached out into the community and, at 14 years old, wrote his first, with the help of some people, wrote his first lines of code to remediate a couple of pieces of malware. It grew from there and I think by the ripe old age of 18, founded the company. And he's now I want to say 26 or 27 and we're doing quite well. >> It was interesting, before we went live you were talking about his philosophy and how important that is to the company and now has turned into really a strategic asset, that no one should have to suffer from malware, and he decided to really offer a solution for free to help people rid themselves of this bad software. >> Darren: That's right. Yeah, so Malwarebytes was founded under the principle that Marcin believes that everyone has the right to a malware-free existence and so we've always offered a free version Malwarebytes that will help you to remediate if your machine does get infected with a piece of malware. And that's actually still going to this day. >> And that's now given you the ability to have a significant amount of inpoint data, transactional data, trend data, that now you can bake back into the solution. >> Darren: That's right. It's turned into a strategic advantage for the company, it's not something I don't think that we could have planned at 18 years old when he was doing this. But we've instrumented it so that we can get some anonymous-level telemetry and we can understand how malware proliferates. For many, many years we've been positioned as a second-opinion scanner and so we're able to see a lot of things, some trends happening in there and we can actually now see that in real time. >> So, starting out as a second-position scanner, you're basically looking at, you're finding what others have missed. And how can you, what do you have to do to become the first line of defense? >> Well, with our new product Malwarebytes 3.0, I think some of that landscape is changing. We have a very complete and layered offering. I'm not the product manager, so I don't think, as the data science guy, I don't know that I'm qualified to give you the ins and outs, but I think some of that is changing as we have, we've combined a lot of products and we have a much more complete sweep of layered protection built into the product. >> And so, maybe tell us, without giving away all the secret sauce, what sort of platform technologies did you use that enabled you to scale to these hundreds of millions of in points, and then to be fast enough at identifying things that were trending that are bad that you had to prioritize? >> Right, so traditionally, I think AV companies, they have these honeypots, right, where they go and the collect a piece of virus or a piece of malware, and they'll take the MD5 hash of that and then they'll basically insert that into a definition's database. And that's a very exact way to do it. The problem is is that there's so much malware or viruses out there in the wild, it's impossible to get all of them. I think one of the things that we did was we set up telemetry and we have a phenomenal research team where we're able to actually have our team catch entire families of malware, and that's really the secret sauce to Malwarebytes. There's several other levels but that's where we're helping out in the immediate term. What we do is we have, internally, we sort of jokingly call it a Lambda Two architecture. We had considered Lambda long ago, long ago and I say about a year ago when we first started this journey. But there's, Lambda is riddled with, as you know, a number of issues. If you've ever talked to Jay Kreps from Confluent, he has a lot of opinions on that, right? And one of the key problems with that is, that if you do a traditional Lambda, you have to implement your code in two places, it's very difficult, things get out of sync, you have to have replay frameworks. And these are some of the challenges with Lambda. So we do processing in a number of areas. The first thing that we did was we implemented Kafka to handle all of the streaming data. We use Kafka streams to do inline stateless transformations and then we also use Kafka Connect. And we write all of our data both into HBase, we use that, we may swap that out later for something like Redis, and that would be a thin speed layer. And then we also move the data into S3 and we use some ephemeral clusters to do very large-scale batch processing, and that really provides our data lab. >> When you call that Lambda Two, is that because you're still working essentially on two different infrastructures, so your code isn't quite the same? You still have to check the results on either on either fork. >> That's right, yeah, we didn't feel like it was, we did evaluate doing everything in the stream. But there are certain operations that are difficult to do with purely streamed processing, and so we did need a little bit, we did need to have a thin, what we call real time indicators, a speed layer, to supplement what we were doing in the stream. And so that's the differentiating factor between a traditional Lambda architecture where you'd want to have everything in the stream and everything in batch, and the batch is really more of a truing mechanism as opposed to, our real time is really directional, so in the traditional sense, if you look at traditional business intelligence, you'd have KPIs that would allow you to gauge the health of your business. We have RTIs, Real Time Indicators, that allow us to gauge directionally, what is important to look at this day, this hour, this minute? >> This thing is burning up the charts, >> Exactly. >> Therefore it's priority one. >> That's right, you got it. >> Okay. And maybe tell us a little more, because everyone I'm sure is familiar with Kafka but the streams product from them is a little newer as is Kafka Connect, so it sounds like you've got, it's not just the transport, but you've got some basic analytics and you've got the ability to do the ETL because you've got Connect that comes from sources and destinations, sources and syncs. Tell us how you've used that. >> Well, the streams product is, it's quite different than something like Spark Streaming. It's not working off micro-batching, it's actually working off the stream. And the second thing is, it's not a separate cluster. It's just a library, effectively a .jar file, right? And so because it works natively with Kafka, it handles certain things there quite well. It handles back pressure and when you expand the cluster, it's pretty good with things like that. We've found it to be a fairly stable technology. It's just a library and we've worked very closely with Confluent to develop that. Whereas Kafka Connect is really something that we use to write out to S3. In fact, Confluent just released a new, an S3 connector direct. We were using Stream X, which was a wrapper on top of an HDFS connector and they rigged that up to write to S3 for us. >> So tell us, as you look out, what sorts of technologies do you see as enabling you to build a platform that's richer, and then how would that show up in the functionality consumers like we would see? >> Darren: With respect to the architecture? >> Yeah. >> Well one of the things that we had to do is we had to evaluate where we wanted to spend our time. We're a very small team, the entire data science and engineering team is less than I think 10 months old. So all of us got hired, we've started this platform, we've gone very, very fast. And we had to decide, how are we going to, a, get, we've made this big investment, how are we going to get value to our end customer quickly, so that they're not waiting around and you get the traditional big-data story where, we've spent all this money and now we're not getting anything out of it. And so we had to make some of those strategic decisions and because of the fact that the data was really truly big data in nature, there's just a huge amount of work that has to be done in these open-source technologies. They're not baked, it's not like going out to Oracle and giving them a purchase order and you install it and away you go. There's a tremendous amount of work, and so we've made some strategic decisions on what we're going to do in open-source and what we're going to do with a third-party vendor solution. And one of those solutions that we decided was workload automation. So I just did a talk on this about how Control-M from BMC was really the tool that we chose to handle a lot of the coordination, the sophisticated coordination, and the workload automation on the batch side, and we're about to implement that in a data-quality monitoring framework. And that's turned out to be an incredibly stable solution for us. It's allowed us to not spend time with open-source solutions that do the same things like Airflow, which may or may not work well, but there's really no support around that, and focus our efforts on what we believe to be the really, really hard problems to tackle in Kafka, Kafka Streams, Connect, et cetera. >> Is it fair to say that Kafka plus Kafka Connect solves many of the old ETL problems or do you still need some sort of orchestration tool on top of it to completely commoditize, essentially moving and transforming data from OLTP or operational system to a decision support system? >> I guess the answer to that is, it depends on your use case. I think there's a lot of things that Kafka and the stream's job can solve for you, but I don't think that we're at the point where everything can be streaming. I think that's a ways off. There's legacy systems that really don't natively stream to you anyway, and there's just certain operations that are just more efficient to do in batch. And so that's why we've, I don't think batch for us is going away any time soon and that's one of the reasons why workload automation in the batch layer initially was so important and we've decided to extend that, actually, into building out a data-quality monitoring framework to put a collar around how accurate our data is on the real-time side. >> Cuz it's really horses for courses, it's not one or the other, it's application-specific, what's the best solution for that particular is. >> Yeah, I don't think that there's, if there was a one-size-fits-all it'd be a company, and there would be no need for architects, so I think that you have to look at your use case, your company, what kind of data, what style of data, what type of analysis do you need. Do you really actually need the data in real time and if you do put in all the work to get it in real time, are you going to be able to take action on it? And I think Malwarebytes was a great candidate. When it came in, I said, "Well, it does look like we can justify "the need for real time data, and the effort "that goes into building out a real-time framework." >> Jeff: Right, right. And we always say, what is real time? In time to do something about it, (all chuckle) and if there's not time to do something about it, depending on how you define real time, really what difference does it make if you can't do anything about it that fast. So as you look out in the future with IoT, all these connected devices, this is a hugely increased attack surface as we just read our essay a few weeks back. How does that work into your planning? What do you guys think about the future where there's so many more connected devices out on the edge and various degrees of intelligence and opportunities to hi-jack, if you will? >> Yeah, I think, I don't think I'm qualified to speak about the Malwarebytes product roadmap as far as IoT goes. >> But more philosophically, from a professional point of view, cuz every coin has two sides, there's a lot of good stuff coming from IoT and connected devices, but as we keep hearing over and over, just this massive attack surface expansion. >> Well I think, for us, the key is we're small and we're not operating, like I came from Apple where we operated on a budget of infinity, so we're not-- >> Have to build the infinity or the address infinity (Darren laughs) with an actual budget. >> We're small and we have to make sure that whatever we do creates value. And so what I'm seeing in the future is, as we get more into the IoT space and logs begin to proliferate and data just exponentiates in size, it's really how do we do the same thing and how are we going to manage that in terms of cost? Generally, big data is very low in information density. It's not like transactional systems where you get the data, it's effectively an Excel spreadsheet and you can go run some pivot tables and filters and away you go. I think big data in general requires a tremendous amount of massaging to get to the point where a data scientist or an analyst can actually extract some insight and some value. And the question is, how do you massage that data in a way that's going to be cost-effective as IoT expands and proliferates? So that's the question that we're dealing with. We're, at this point, all in with cloud technologies, we're leveraging quite a few of Amazon services, server-less technologies as well. We just are in the process of moving to the Athena, to Athena, as just an on-demand query service. And we use a lot of ephemeral clusters as well, and that allows us to actually run all of our ETL in about two hours. And so these are some of the things that we're doing to prepare for this explosion of data and making sure that we're in a position where we're not spending a dollar to gain a penny if that makes sense. >> That's his business. Well, he makes fun of that business model. >> I think you could do it, you want to drive revenue to sell dollars for 90 cents. >> That's the dot com model, I was there. >> Exactly, and make it up in volume. All right, Darren Chenin, thanks for taking a few minutes out of your day and giving us the story on Malwarebytes, sounds pretty exciting and a great opportunity. >> Thanks, I enjoyed it. >> Absolutely, he's Darren, he's George, I'm Jeff, you're watching The Cube. We're at Big Data SV at the Historic Pagoda Lounge. Thanks for watching, we'll be right back after this short break. (upbeat techno music)

Published Date : Mar 15 2017

SUMMARY :

it's The Cube, and evolutions now with machine learning So for folks that aren't and he reached out into the community and, and how important that is to the company and so we've always offered a free version And that's now given you the ability it so that we can get what do you have to do to become and we have a much more complete sweep and that's really the secret the results on either and so we did need a little bit, and you've got the ability to do the ETL that we use to write out to S3. and because of the fact that the data and that's one of the reasons it's not one or the other, and if you do put in all the and opportunities to hi-jack, if you will? I don't think I'm qualified to speak and connected devices, or the address infinity and how are we going to Well, he makes fun of that business model. I think you could do it, and giving us the story on Malwarebytes, the Historic Pagoda Lounge.

ENTITIES

Entity	Category	Confidence
Jeff	PERSON	0.99+
Darren Chinen	PERSON	0.99+
Darren	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Darren Chenin	PERSON	0.99+
George	PERSON	0.99+
Jay Kreps	PERSON	0.99+
90 cents	QUANTITY	0.99+
two sides	QUANTITY	0.99+
Apple	ORGANIZATION	0.99+
Athena	LOCATION	0.99+
Marcin	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
two places	QUANTITY	0.99+
San Jose	LOCATION	0.99+
BMC	ORGANIZATION	0.99+
eight years	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
first lines	QUANTITY	0.99+
Malwarebytes	ORGANIZATION	0.99+
Kafka	TITLE	0.99+
one	QUANTITY	0.99+
10 months	QUANTITY	0.99+
Kafka Connect	TITLE	0.99+
Oracle	ORGANIZATION	0.99+
Lambda	TITLE	0.99+
first	QUANTITY	0.99+
second thing	QUANTITY	0.99+
Gene	PERSON	0.99+
Excel	TITLE	0.99+
Confluent	ORGANIZATION	0.99+
The Cube	TITLE	0.98+
first line	QUANTITY	0.98+
27	QUANTITY	0.97+
26	QUANTITY	0.97+
Redis	TITLE	0.97+
Kafka Streams	TITLE	0.97+
S3	TITLE	0.97+
18	QUANTITY	0.96+
14 years old	QUANTITY	0.96+
18 years old	QUANTITY	0.96+
about two hours	QUANTITY	0.96+
g ago	DATE	0.96+
Connect	TITLE	0.96+
second-position	QUANTITY	0.95+
HBase	TITLE	0.95+
first thing	QUANTITY	0.95+
Historic Pagoda Lounge	LOCATION	0.94+
both	QUANTITY	0.93+
two different infrastructures	QUANTITY	0.92+
S3	COMMERCIAL_ITEM	0.91+
Big Data	EVENT	0.9+
The Cube	ORGANIZATION	0.88+
Lambda Two	TITLE	0.87+
Malwarebytes 3.0	TITLE	0.84+
Airflow	TITLE	0.83+
a year ago	DATE	0.83+
second-opinion	QUANTITY	0.82+
hundreds of millions of	QUANTITY	0.78+

Scott Gnau, Hortonworks Big Data SV 17 #BigDataSV #theCUBE

>> Narrator: Live from San Jose, California it's theCUBE covering Big Data Silicon Valley 2017. >> Welcome back everyone. We're here live in Silicon Valley. This is theCUBE's coverage of Big Data Silicon Valley. Our event in conjunction with O'Reilly Strata Hadoop, of course we have our Big Data NYC event and we have our special popup event in New York and Silicon Valley. This is our Silicon Valley version. I'm John Furrier, with my co-host Jeff Frick and our next guest is Scott Gnau, CTO of Hortonworks. Great to have you on, good to see you again. >> Scott: Thanks for having me. >> You guys have an event coming up in Munich, so I know that there's a slew of new announcements coming up with Hortonworks in April, next month in Munich for your EU event and you're going to be holding a little bit of that back, but some interesting news this morning. We had Wei Wang yesterday with Microsoft Azure team HDInsight's. That's flowering nicely, a good bet there, but the question has always been at least from people in the industry and we've been questioning you guys on, hey, where's your cloud strategy? Because as a disture you guys have been very successful with your always open approach. Microsoft as your guy was basically like, that's why we go with Hortonworks because of pure open source, committed to that from day one, never wavered. The question is cloud first, AI, machine learning this is a sweet spot for IoT. You're starting to see the collision between cloud and data, and in the intersection of that is deep learning, IoT, a lot of amazing new stuff going to be really popping out of this. Your thoughts and your cloud strategy. >> Obviously we see cloud as an enabler for these use cases. In many instances the use cases can be femoral. They might not be tied immediately to an ROI, so you're going to go to the capital committee and all this kind of stuff, versus let me go prove some value very quickly. It's one of the key enablers core ingredients and when we say cloud first, we really mean it. It's something where the solutions work together. At the same time, cloud becomes important. Our cloud strategy and I think we've talked about this in many different venues is really twofold. One is we want to give a common experience to our customers across whatever footprint they chose, whether it be they roll their own, they do it on print, they do it in public cloud and they have choice of different public cloud vendors. We want to give them a similar experience, a good experience that is enterprise great, platform level experience, so not point solution kind of one function and then get rid of it, but really being able to extend the platform. What I mean by that of course, is being able to have common security, common governance, common operational management. Being able to have a blueprint of the footprint so that there's compatibility of applications that get written. And those applications can move as they decide to change their mind about where their platform hosting the data, so our goal really is to give them a great and common experience across all of those footprints number one. Then number two, to offer a lot of choices across all of those domains as well, whether it be, hey I want to do infrastructure as a service and I know what I want on one end of the spectrum to I'm not sure exactly what I want, but I want to spin up a data science cluster really quickly. Boom, here's a platform as a service offer that runs and is available very easy to consume, comes preconfigured and kind of everywhere in between. >> By the way yesterday Wei was pointing out 99.99 SLAs on some of the stuff coming out. >> Are amazing and obviously in the platform as a service space, you also get the benefit of other cloud services that can plug in that wouldn't necessarily be something you'd expect to be typical of a core Hadoop platform. Getting the SLAs, getting the disaster recovery, getting all of the things that cloud providers can provide behind the scenes is some additional upside obviously as well in those deployment options. Having that common look and feel, making it easy, making it frictionless, are all of the core components of our strategy and we saw a lot of success with that in coming out of year end last year. We see rapid customer adoption. We see rapid customer success and frankly I see that I would say that 99.9% of customers that I talk to are hybrid where they have a foot in nonprem and they have a foot in cloud and they may have a foot in multiple clouds. I think that's indicative of what's going on in the world. Think about the gravity of data. Data movement is expensive. Analytics and multi-core chipsets give us the ability to process and crunch numbers at unprecedented rates, but movement of data is actually kind of hard. There's latency, it can be expensive. A lot of data in the future, IoT data, machine data is going to be created and live its entire lifecycle in the cloud, so the notion of being able to support hybrid with a common look and feel, I think very strategically positions us to help our customers be successful when they start actually dealing with data that lives its entire lifecycle outside the four walls of the data center. >> You guys really did a good job I thought on having that clean positioning of data at rest, but also you had the data in motion, which I think ahead of its time you guys really nailed that and you also had the IoT edge in mind, we've talked I think two years ago and this was really not on everyone's radar, but you guys saw that, so you've made some good bets on the HDInsight and we talked about that yesterday with Wei on here and Microsoft. So edge analytics and data in motion a very key right now, because that batch streaming world's coming together and IoTs flooding it with all this kind of data. We've seen the success in the clouds where analytics have been super successful with powering by the clouds. I got to ask you with Microsoft as your preferred cloud provider, what's the current status for customers who have data in motion, specifically IoT too. It's the common question we're getting, not necessarily the Microsoft question, but okay I've got edge coming in strong-- >> Scott: Mm-hmm >> and I'm going to run certainly hybrid in a multi cloud world, but I want to put the cloud stuff for most of the analytics and how do I deal with the edge? >> Wow, there's a lot there (laughs) >> John: You got 10 seconds, go! (laughs) You have Microsoft as your premier cloud and you have an Amazon relationship with a marketplace and what not. You've got a great relationship with Microsoft. >> Yeah. I think it boils down to a bigger macro thing and hopefully I'll peel into some specifics. I think number one, we as an industry kind of short change ourselves talking about Hadoop, Hadoop, Hadoop, Hadoop, Hadoop. I think it's bigger than Hadoop, not different than but certainly than, right, and this is where we started with the whole connected platforms indicating of traditional Hadoop comes from traditional thinking of data at rest. So I've got some data, I've stored it and I want to run some analytics and I want to be able to scale it and all that kinds of stuff. Really good stuff, but only part of the issue. The other part of the issue is data that's moving, data that's being created outside of the four walls of the data center. Data that's coming from devices. How do I manage and move and handle all of that? Of course there have been different hype cycles on streaming and streaming analytics and data flow and all those things. What we wanted to do is take a very protracted look at the problem set of the future. We said look it's really about the entire lifecycle of data from inception to demise of the data or data being delayed, delete it, which very infrequently happens these days. >> Or cold storage-- >> Cold storage, whatever. You know it's created at the edge, it moves through, it moves in different places, its landed, its analyzed, there are models built. But as models get deployed back out to the edge, that entire problem set is a problem set that I think we, certainly we at Hortonworks are looking to address with the solutions. That actually is accelerated by the notion of multiple cloud footprints because when you think about a customer that may have multiple cloud footprints and trying to tie the data together, it creates a unique opportunity, I think there's a reversal in the way people need to think about the future of compute. Where having been around for a little bit of time, it's always been let me bring all the data together to the applications and have the applications run and then I'll send answers back. That is impossible in this new world order, whether it be the cloud or the fog or any of the things in between or the data center, data are going to be distributed and data movement will become the expensive thing, so it will be very important to be able to have applications that are deployable across a grid, and applications move to the data instead of data moving to the application. And or at least to have a choice and be able to be selective so that I believe that ultimately scalability five years from now, ten years from now, it's not going to be about how many exabytes I have in my cloud instance, that will be part of it, it will be about how many edge devices can I have computing and analyzing simultaneously and coordinating with each other this information to optimize customer experience, to optimize the way an autonomous car drives or anywhere in between. >> It's totally radical, but it's also innovative. You mentioned the cost of moving data will be the issue. >> Scott: Yeah. >> So that's going to change the architecture of the edge. What are you seeing with customers, cuz we're seeing a lot of people taking a protracted view like you were talking about and looking at the architectures, specifically around okay. There's some pressure, but there's no real gun to the head yet, but there's certainly pressure to do architectural thinking around edge and some of the things you mentioned. Patterns, things you can share, anecdotal stories, customer references. >> You know the common thing is that customers go, "Yep, that's going to be interesting. "It's not hitting me right now, "but I know it's going to be important. "How can I ease into it and kind of without the suspenders "how can I prove this is going to work and all that." We've seen a lot of certainly interest in that. What's interesting is we're able to apply some of that futuristic IoT technology in Hortonworks data flow that includes NiFi and MiNiFi out to the edge to traditional problems like, let me get the data from the branches into the central office and have that roundtrip communication to a banker who's talking to a customer and has the benefit of all the analytics at home, but I can guarantee that roundtrip of data and analytics. Things that we thought were solid before, can be solved very easily and efficiently with this technology, which is then also extensible even out further to the edge. In many instances, I've been surprised by customer adoption with them saying, "Yeah, I get that, but gee this helps me "solve a problem that I've had for the last 20 years "and it's very easy and it sets me up "on the right architectural course, "for when I start to add in those edge devices, "I know exactly how I'm going to go do it." It's been actually a really good conversation that's very pragmatic with immediate ROI, but again positioning people for the future that they know is coming. Doing that, by the way, we're also able to prove the security. Think about security is a big issue that everyone's talking about, cyber security and everything. That's typically security about my data center where I've got this huge fence around it and it's very controlled. Think about edge devices are now outside that fence, so security and privacy and provenance become really, really interesting in that world. It's been gratifying to be able to go prove that technology today and again put people on that architectural course that positions them to be able to go out further to the edge as their business demands it. >> That's such great validation when they come back to you with a different solution based on what you just proposed. >> Scott: Yep. >> That means they really start to understand, they really start to see-- >> Scott: Yep. >> How it can provide value to them. >> Absolutely, absolutely. That is all happening and again like I said this I think the notion of the bigger problem set, where it's not just storing data and analyzing data, but how do I have portable applications and portable applications that move further and further out to the edge is going to be the differentiation. The future successful deployments out there because those deployments and folks are able to adopt that kind of technology will have a time to market advantage, they'll have a latency advantage in terms of interaction with a customer, not waiting for that roundtrip of really being able to push out customized, tailored interactions, whether it be again if it's driving your car and stopping on time, which is kind of important, to getting a coupon when you're walking past a store and anywhere in between. >> It's good you guys have certainly been well positioned for being flexible, being an open source has been a great advantage. I got to ask you the final question for the folks watching, I'm sure you guys answer this either to investors or whatnot and customers. A lot's changed in the past five years and a lot's happening right now. You just illustrated it out, the scenario with the edge is very robust, dynamic, changing, but yet value opportunity for businesses. What's the biggest thing that's changing right now in the Hortonworks view of the world that's notable that you thinks worth highlighting to people watching that are your customers, investors, or people in the industry. >> I think you brought up a good point, the whole notion of open and the whole groundswell around open source, open community development as a new paradigm for delivering software. I talked a little bit about a new paradigm of the gravity of data and sensors and this new problem set that we've got to go solve, that's kind of one piece of this storm. The other piece of the storm is the adoption and the wave of open, open community collaboration of developers versus integrated silo stacks of software. That's manifesting itself in two places and obviously I think we're an example of helping to create that. Open collaboration means quicker time to market and more innovation and accelerated innovation in an increasingly complex world. That's one requirement slash advantage of being in the open world. I think the other thing that's happening is the generation of workforce. When I think about when I got my first job, I typed a resume with a typewriter. I'm dating myself. >> White out. >> Scott: Yeah, with white out. (laughter) >> I wasn't a good typer. >> Resumes today is basically name and get GitHub address. Here's my body of work and it's out there for everybody to see, and that's the mentality-- >> And they have their cute videos up there as well, of course. >> Scott: Well yeah, I'm sure. (laughter) >> So it's kind of like that shift to this is now the new paradigm for software delivery. >> This is important. You've got theCUBE interview, but I mean you're seeing it-- >> Is that the open source? >> In the entertainment. No, we're seeing people put huge interviews on their LinkedIn, so this notion of collaboration in the software engineering mindset. You go back to when we grew up in software engineering, now it went to open source, now it's GitHub is essentially a social network for your body of work. You're starting to see the software development open source concepts, they apply to data engineering, data science is still early days. Media media creation what not so, I think that's a really key point in the data science tools are still in their infancy. >> I think open, and by the way I'm not here to suggest that everything will be open, but I think a majority and-- >> Collaborative the majority of the problem that we're solving will be collaborative, it will be ecosystem driven and where there's an extremely large market open will be the most efficient way to address it. And certainly no one's arguing that data and big data is not a large market. >> Yep. You guys are all on the cloud now, you got the Microsoft, any other updates that you think worth sharing with folks. >> You've got to come back and see us in Munich then. >> Alright. We'll be there, theCUBE will be there in Munich in April. We have the Hortonworks coverage going on in Data Works, the conference is now called Data Works in Munich. This is theCUBE here with Scott Gnau, the CTO of Hortonworks. Breaking it down I'm John Furrier with Jeff Frick. More coverage from Big Data SV in conjunction with Strata Hadoop after the short break. (upbeat music)

Published Date : Mar 15 2017

SUMMARY :

it's theCUBE covering Big good to see you again. and in the intersection of blueprint of the footprint on some of the stuff coming out. of customers that I talk to are hybrid I got to ask you with Microsoft and you have an Amazon relationship of the data center. and be able to be selective You mentioned the cost of and looking at the architectures, and has the benefit on what you just proposed. and further out to the edge I got to ask you the final and the whole groundswell Scott: Yeah, with white out. and that's the mentality-- And they have their cute videos Scott: Well yeah, I'm sure. So it's kind of like that shift to but I mean you're seeing it-- in the data science tools the majority of the you got the Microsoft, You've got to come back We have the Hortonworks

ENTITIES

Entity	Category	Confidence
Scott	PERSON	0.99+
Jeff Frick	PERSON	0.99+
John	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Scott Gnau	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Scott Gnau	PERSON	0.99+
New York	LOCATION	0.99+
Munich	LOCATION	0.99+
John Furrier	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
April	DATE	0.99+
yesterday	DATE	0.99+
10 seconds	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
San Jose, California	LOCATION	0.99+
99.99	QUANTITY	0.99+
two places	QUANTITY	0.99+
LinkedIn	ORGANIZATION	0.99+
first job	QUANTITY	0.99+
GitHub	ORGANIZATION	0.99+
next month	DATE	0.99+
two years ago	DATE	0.98+
today	DATE	0.98+
99.9%	QUANTITY	0.98+
ten years	QUANTITY	0.97+
Big Data	EVENT	0.97+
five years	QUANTITY	0.96+
Big Data Silicon Valley 2017	EVENT	0.96+
this morning	DATE	0.95+
O'Reilly Strata Hadoop	ORGANIZATION	0.95+
One	QUANTITY	0.95+
Data Works	EVENT	0.94+
year end last year	DATE	0.94+
one	QUANTITY	0.93+
Hadoop	TITLE	0.93+
theCUBE	ORGANIZATION	0.93+
one piece	QUANTITY	0.93+
Wei Wang	PERSON	0.91+
NYC	LOCATION	0.9+
Wei	PERSON	0.88+
past five years	DATE	0.87+
first	QUANTITY	0.86+
CTO	PERSON	0.83+
four walls	QUANTITY	0.83+
Big Data SV	ORGANIZATION	0.83+
#BigDataSV	EVENT	0.82+
one function	QUANTITY	0.81+
Big Data SV 17	EVENT	0.78+
EU	LOCATION	0.73+
HDInsight	ORGANIZATION	0.69+
Strata Hadoop	PERSON	0.69+
one requirement	QUANTITY	0.68+
number two	QUANTITY	0.65+

Holden Karau, IBM Big Data SV 17 #BigDataSV #theCUBE

>> Announcer: Big Data Silicon Valley 2017. >> Hey, welcome back, everybody, Jeff Frick here with The Cube. We are live at the historic Pagoda Lounge in San Jose for Big Data SV, which is associated with Strathead Dupe World, across the street, as well as Big Data week, so everything big data is happening in San Jose, we're happy to be here, love the new venue, if you're around, stop by, back of the Fairmount, Pagoda Lounge. We're excited to be joined in this next segment by, who's now become a regular, any time we're at a Big Data event, a Spark event, Holden always stops by. Holden Karau, she's the principal software engineer at IBM. Holden, great to see you. >> Thank you, it's wonderful to be back yet again. >> Absolutely, so the big data meme just keeps rolling, Google Cloud Next was last week, a lot of talk about AI and ML and of course you're very involved in Spark, so what are you excited about these days? What are you, I'm sure you've got a couple presentations going on across the street. >> Yeah, so my two presentations this week, oh wow, I should remember them. So the one that I'm doing today is with my co-worker Seth Hendrickson, also at IBM, and we're going to be focused on how to use structured streaming for machine learning. And sort of, I think that's really interesting, because streaming machine learning is something a lot of people seem to want to do but aren't yet doing in production, so it's always fun to talk to people before they've built their systems. And then tomorrow I'm going to be talking with Joey on how to debug Spark, which is something that I, you know, a lot of people ask questions about, but I tend to not talk about, because it tends to scare people away, and so I try to keep the happy going. >> Jeff: Bugs are never fun. >> No, no, never fun. >> Just picking up on that structured streaming and machine learning, so there's this issue of, as we move more and more towards the industrial internet of things, like having to process events as they come in, make a decision. How, there's a range of latency that's required. Where does structured streaming and ML fit today, and where might that go? >> So structured streaming for today, latency wise, is probably not something I would use for something like that right now. It's in the like sub second range. Which is nice, but it's not what you want for like live serving of decisions for your car, right? That's just not going to be feasible. But I think it certainly has the potential to get a lot faster. We've seen a lot of renewed interest in ML liblocal, which is really about making it so that we can take the models that we've trained in Spark and really push them out to the edge and sort of serve them in the edge, and apply our models on end devices. So I'm really excited about where that's going. To be fair, part of my excitement is someone else is doing that work, so I'm very excited that they're doing this work for me. >> Let me clarify on that, just to make sure I understand. So there's a lot of overhead in Spark, because it runs on a cluster, because you have an optimizer, because you have the high availability or the resilience, and so you're saying we can preserve the predict and maybe serve part and carve out all the other overhead for running in a very small environment. >> Right, yeah. So I think for a lot of these IOT devices and stuff like that it actually makes a lot more sense to do the predictions on the device itself, right. These models generally are megabytes in size, and we don't need a cluster to do predictions on these models, right. We really need the cluster to train them, but I think for a lot of cases, pushing the prediction out to the edge node is actually a pretty reasonable use case. And so I'm really excited that we've got some work going on there. >> Taking that one step further, we've talked to a bunch of people, both like at GE, and at their Minds and Machines show, and IBM's Genius of Things, where you want to be able to train the models up in the cloud where you're getting data from all the different devices and then push the retrained model out to the edge. Can that happen in Spark, or do we have to have something else orchestrating all that? >> So actually pushing the model out isn't something that I would do in Spark itself, I think that's better served by other tools. Spark is not really well suited to large amounts of internet traffic, right. But it's really well suited to the training, and I think with ML liblocal it'll essentially, we'll be able to provide both sides of it, and the copy part will be left up to whoever it is that's doing their work, right, because like if you're copying over a cell network you need to do something very different as if you're broadcasting over a terrestrial XM or something like that, you need to do something very different for satellite. >> If you're at the edge on a device, would you be actually running, like you were saying earlier, structured streaming, with the prediction? >> Right, I don't think you would use structured streaming per se on the edge device, but essentially there would be a lot of code share between structured streaming and the code that you'd be using on the edge device. And it's being vectored out now so that we can have this code sharing and Spark machine learning. And you would use structured streaming maybe on the training side, and then on the serving side you would use your custom local code. >> Okay, so tell us a little more about Spark ML today and how we can democratize machine learning, you know, for a bigger audience. >> Right, I think machine learning is great, but right now you really need a strong statistical background to really be able to apply it effectively. And we probably can't get rid of that for all problems, but I think for a lot of problems, doing things like hyperparameter tuning can actually give really powerful tools to just like regular engineering folks who, they're smart, but maybe they don't have a strong machine learning background. And Spark's ML pipelines make it really easy to sort of construct multiple stages, and then just be like, okay, I don't know what these parameters should be, I want you to do a search over what these different parameters could be for me, and it makes it really easy to do this as just a regular engineer with less of an ML background. >> Would that be like, just for those of us who are, who don't know what hyperparameter tuning is, that would be the knobs, the variables? >> Yeah, it's going to spin the knobs on like our regularization parameter on like our regression, and it can also spin some knobs on maybe the engram sizes that we're using on the inputs to something else, right. And it can compare how these knobs sort of interact with each other, because often you can tune one knob but you actually have six different knobs that you want to tune and you don't know, if you just explore each one individually, you're not going to find the best setting for them working together. >> So this would make it easier for, as you're saying, someone who's not a data scientist to set up a pipeline that lets you predict. >> I think so, very much. I think it does a lot of the, brings a lot of the benefits from sort of the SciPy world to the big data world. And SciPy is really wonderful about making machine learning really accessible, but it's just not ready for big data, and I think this does a good job of bringing these same concepts, if not the code, but the same concepts, to big data. >> The SciPy, if I understand, is it a notebook that would run essentially on one machine? >> SciPy can be put in a notebook environment, and generally it would run on, yeah, a single machine. >> And so to make that sit on Spark means that you could then run it on a cluster-- >> So this isn't actually taking SciPy and distributing it, this is just like stealing the good concepts from SciPy and making them available for big data people. Because SciPy's done a really good job of making a very intuitive machine learning interface. >> So just to put a fine sort of qualifier on one thing, if you're doing the internet of things and you have Spark at the edge and you're running the model there, it's the programming model, so structured streaming is one way of programming Spark, but if you don't have structured streaming at the edge, would you just be using the core batch Spark programming model? >> So at the edge you'd just be using, you wouldn't even be using batch, right, because you're trying to predict individual events, right, so you'd just be calling predict with every new event that you're getting in. And you might have a q mechanism of some type. But essentially if we had this batch, we would be adding additional latency, and I think at the edge we really, the reason we're moving the models to the edge is to avoid the latency. >> So just to be clear then, is the programming model, so it wouldn't be structured streaming, and we're taking out all the overhead that forced us to use batch with Spark. So the reason I'm trying to clarify is a lot of people had this question for a long time, which is are we going to have a different programming model at the edge from what we have at the center? >> Yeah, that's a great question. And I don't think the answer is finished yet, but I think the work is being done to try and make it look the same. Of course, you know, trying to make it look the same, this is Boosh, it's not like actually barking at us right now, even though she looks like a dog, she is, there will always be things which are a little bit different from the edge to your cluster, but I think Spark has done a really good job of making things look very similar on single node cases to multi node cases, and I think we can probably bring the same things to ML. >> Okay, so it's almost time, we're coming back, Spark took us from single machine to cluster, and now we have to essentially bring it back for an edge device that's really light weight. >> Yeah, I think at the end of the day, just from a latency point of view, that's what we have to do for serving. For some models, not for everyone. Like if you're building a website with a recommendation system, you don't need to serve that model like on the edge node, that's fine, but like if you've got a car device we can't depend on cell latency, right, you have to serve that in car. >> So what are some of the things, some of the other things that IBM is contributing to the ecosystem that you see having a big impact over the next couple years? >> So there's a lot of really exciting things coming out of IBM. And I'm obviously pretty biased. I spend a lot of time focused on Python support in Spark, and one of the most exciting things is coming from my co-worker Brian, I'm not going to say his last name in case I get it wrong, but Brian is amazing, and he's been working on integrating Arrow with Spark, and this can make it so that it's going to be a lot easier to sort of interoperate between JVM languages and Python and R, so I'm really optimistic about the sort of Python and R interfaces improving a lot in Spark and getting a lot faster as well. And we're also, in addition to the Arrow work, we've got some work around making it a lot easier for people in R and Python to get started. The R stuff is mostly actually the Microsoft people, thanks Felix, you're awesome. I don't actually know which camera I should have done that to but that's okay. >> I think you got it! >> But Felix is amazing, and the other people working on R are too. But I think we've both been pursuing sort of making it so that people who are in the R or Python spaces can just use like Pit Install, Conda Install, or whatever tool it is they're used to working with, to just bring Spark into their machine really easily, just like they would sort of any other software package that they're using. Because right now, for someone getting started in Spark, if you're in the Java space it's pretty easy, but if you're in R or Python you have to do sort of a lot of weird setup work, and it's worth it, but like if we can get rid of that friction, I think we can get a lot more people in these communities using Spark. >> Let me see, just as a scenario, the R server is getting fairly well integrated into Sequel server, so would it be, would you be able to use R as the language with a Spark execution engine to somehow integrate it into Sequel server as an execution engine for doing the machine learning and predicting? >> You definitely, well I shouldn't say definitely, you probably could do that. I don't necessarily know if that's a good idea, but that's the kind of stuff that this would enable, right, it'll make it so that people that are making tools in R or Python can just use Spark as another library, right, and it doesn't have to be this really special setup. It can just be this library and they point out the cluster and they can do whatever work it wants to do. That being said, the Sequel server R integration, if you find yourself using that to do like distributed computing, you should probably take a step back and like rethink what you're doing. >> George: Because it's not really scale out. >> It's not really set up for that. And you might be better off doing this with like, connecting your Spark cluster to your Sequel server instance using like JDBC or a special driver and doing it that way, but you definitely could do it in another inverted sort of way. >> So last question from me, if you look out a couple years, how will we make machine learning accessible to a bigger and bigger audience? And I know you touched on the tuning of the knobs, hyperparameter tuning, what will it look like ultimately? >> I think ML pipelines are probably what things are going to end up looking like. But I think the other part that we'll sort of see is we'll see a lot more examples of how to work with certain kinds of data, because right now, like, I know what I need to do when I'm ingesting some textural data, but I know that because I spent like a week trying to figure out what the hell I was doing once, right. And I didn't bother to write it down. And it looks like no one else bothered to write it down. So really I think we'll see a lot of tools that look very similar to the tools we have today, they'll have more options and they'll be a bit easier to use, but I think the main thing that we're really lacking right now is good documentation and sort of good books and just good resources for people to figure out how to use these tools. Now of course, I mean, I'm biased, because I work on these tools, so I'm like, yeah, they're pretty great. So there might be other people who are like, Holden, no, you're wrong, we need to rethink everything. But I think this is, we can go very far with the pipeline concept. >> And then that's good, right? The democratization of these things opens it up to more people, you get more creative people solving more different problems, that makes the whole thing go. >> You can like install Spark easily, you can, you know, set up an ML pipeline, you can train your model, you can start doing predictions, you can, people that haven't been able to do machine learning at scale can get started super easily, and build a recommendation system for their small little online shop and be like, hey, you bought this, you might also want to buy Boosh, he's really cute, but you can't have this one. No no no, not this one. >> Such a tease! >> Holden: I'm sorry, I'm sorry. >> Well Holden, that will, we'll say goodbye for now, I'm sure we will see you in June in San Francisco at the Spark Summit, and look forward to the update. >> Holden: I look forward to chatting with you then. >> Absolutely, and break a leg this afternoon at your presentation. >> Holden: Thank you. >> She's Holden Karau, I'm Jeff Frick, he's George Gilbert, you're watching The Cube, we're at Big Data SV, thanks for watching. (upbeat music)

Published Date : Mar 15 2017

SUMMARY :

Announcer: Big Data We're excited to be joined to be back yet again. so what are you excited about these days? but I tend to not talk about, like having to process and really push them out to the edge and carve out all the other overhead We really need the cluster to train them, model out to the edge. and the copy part will be left up to and then on the serving side you would use you know, for a bigger audience. and it makes it really easy to do this that you want to tune and you don't know, that lets you predict. but the same concepts, to big data. and generally it would run the good concepts from SciPy the models to the edge So just to be clear then, from the edge to your cluster, machine to cluster, like on the edge node, that's fine, R and Python to get started. and the other people working on R are too. but that's the kind of stuff not really scale out. to your Sequel server instance and they'll be a bit easier to use, that makes the whole thing go. and be like, hey, you bought this, look forward to the update. to chatting with you then. Absolutely, and break you're watching The Cube,

ENTITIES

Entity	Category	Confidence
Jeff Frick	PERSON	0.99+
Brian	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Holden Karau	PERSON	0.99+
Holden	PERSON	0.99+
Felix	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Joey	PERSON	0.99+
Jeff	PERSON	0.99+
IBM	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
Seth Hendrickson	PERSON	0.99+
Spark	TITLE	0.99+
Python	TITLE	0.99+
last week	DATE	0.99+
Microsoft	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
San Francisco	LOCATION	0.99+
June	DATE	0.99+
six different knobs	QUANTITY	0.99+
GE	ORGANIZATION	0.99+
Boosh	PERSON	0.99+
Pagoda Lounge	LOCATION	0.99+
one knob	QUANTITY	0.99+
both sides	QUANTITY	0.99+
two presentations	QUANTITY	0.99+
this week	DATE	0.98+
today	DATE	0.98+
The Cube	ORGANIZATION	0.98+
Java	TITLE	0.98+
both	QUANTITY	0.97+
one thing	QUANTITY	0.96+
one	QUANTITY	0.96+
Big Data week	EVENT	0.96+
single machine	QUANTITY	0.95+
R	TITLE	0.95+
SciPy	TITLE	0.95+
Big Data	EVENT	0.95+
single machine	QUANTITY	0.95+
each one	QUANTITY	0.94+
JDBC	TITLE	0.93+
Spark ML	TITLE	0.89+
JVM	TITLE	0.89+
The Cube	TITLE	0.88+
single	QUANTITY	0.88+
Sequel	TITLE	0.87+
Big Data Silicon Valley 2017	EVENT	0.86+
Spark Summit	LOCATION	0.86+
one machine	QUANTITY	0.86+
a week	QUANTITY	0.84+
Fairmount	LOCATION	0.83+
liblocal	TITLE	0.83+

Frederick Reiss, IBM STC - Big Data SV 2017 - #BigDataSV - #theCUBE

>> Narrator: Live from San Jose, California it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Big Data SV 2016, day two of our wall to wall coverage of Strata Hadoob Conference, Big Data SV, really what we call Big Data Week because this is where all the action is going on down in San Jose. We're at the historic Pagoda Lounge in the back of the Faramount, come on by and say hello, we've got a really cool space and we're excited and never been in this space before, so we're excited to be here. So we got George Gilbert here from Wiki, we're really excited to have our next guest, he's Fred Rice, he's the chief architect at IBM Spark Technology Center in San Francisco. Fred, great to see you. >> Thank you, Jeff. >> So I remember when Rob Thomas, we went up and met with him in San Francisco when you guys first opened the Spark Technology Center a couple of years now. Give us an update on what's going on there, I know IBM's putting a lot of investment in this Spark Technology Center in the San Francisco office specifically. Give us kind of an update of what's going on. >> That's right, Jeff. Now we're in the new Watson West building in San Francisco on 505 Howard Street, colocated, we have about a 50 person development organization. Right next to us we have about 25 designers and on the same floor a lot of developers from Watson doing a lot of data science, from the weather underground, doing weather and data analysis, so it's a really exciting place to be, lots of interesting work in data science going on there. >> And it's really great to see how IBM is taking the core Watson, obviously enabled by Spark and other core open source technology and now applying it, we're seeing Watson for Health, Watson for Thomas Vehicles, Watson for Marketing, Watson for this, and really bringing that type of machine learning power to all the various verticals in which you guys play. >> Absolutely, that's been what Watson has been about from the very beginning, bringing the power of machine learning, the power of artificial intelligence to real world applications. >> Jeff: Excellent. >> So let's tie it back to the Spark community. Most folks understand how data bricks builds out the core or does most of the core work for, like, the sequel workload the streaming and machine learning and I guess graph is still immature. We were talking earlier about IBM's contributions in helping to build up the machine learning side. Help us understand what the data bricks core technology for machine learning is and how IBM is building beyond that. >> So the core technology for machine learning in Apache Spark comes out, actually, of the machine learning department at UC Berkeley as well as a lot of different memories from the community. Some of those community members also work for data bricks. We actually at the IBM Spark Technology Center have made a number of contributions to the core Apache Spark and the libraries, for example recent contributions in neural nets. In addition to that, we also work on a project called Apache System ML, which used to be proprietary IBM technology, but the IBM Spark Technology Center has turned System ML into Apache System ML, it's now an open Apache incubating project that's been moving forward out in the open. You can now download the latest release online and that provides a piece that we saw was missing from Spark and a lot of other similar environments and optimizer for machine learning algorithms. So in Spark, you have the catalyst optimizer for data analysis, data frames, sequel, you write your queries in terms of those high level APIs and catalyst figures out how to make them go fast. In System ML, we have an optimizer for high level languages like Spark and Python where you can write algorithms in terms of linear algebra, in terms of high level operations on matrices and vectors and have the optimizer take care of making those algorithms run in parallel, run in scale, taking account of the data characteristics. Does the data fit in memory, and if so, keep it in memory. Does the data not fit in memory? Stream it from desk. >> Okay, so there was a ton of stuff in there. >> Fred: Yep. >> And if I were to refer to that as so densely packed as to be a black hole, that might come across wrong, so I won't refer to that as a black hole. But let's unpack that, so the, and I meant that in a good way, like high bandwidth, you know. >> Fred: Thanks, George. >> Um, so the traditional Spark, the machine learning that comes with Spark's ML lib, one of it's distinguishing characteristics is that the models, the algorithms that are in there, have been built to run on a cluster. >> Fred: That's right. >> And very few have, very few others have built machine learning algorithms to run on a cluster, but as you were saying, you don't really have an optimizer for finding something where a couple of the algorithms would be fit optimally to solve a problem. Help us understand, then, how System ML solves a more general problem for, say, ensemble models and for scale out, I guess I'm, help us understand how System ML fits relative to Sparks ML lib and the more general problems it can solve. >> So, ML Live and a lot of other packages such as Sparking Water from H20, for example, provide you with a toolbox of algorithms and each of those algorithms has been hand tuned for a particular range of problem sizes and problem characteristics. This works great as long as the particular problem you're facing as a data scientist is a good match to that implementation that you have in your toolbox. What System ML provides is less like having a toolbox and more like having a machine shop. You can, you have a lot more flexibility, you have a lot more power, you can write down an algorithm as you would write it down if you were implementing it just to run on your laptop and then let the System ML optimizer take care of producing a parallel version of that algorithm that is customized to the characteristics of your cluster, customized to the characteristics of your data. >> So let me stop you right there, because I want to use an analogy that others might find easy to relate to for all the people who understand sequel and scale out sequel. So, the way you were describing it, it sounds like oh, if I were a sequel developer and I wanted to get at some data on my laptop, I would find it pretty easy to write the sequel to do that. Now, let's say I had a bunch of servers, each with it's own database, and I wanted to get data from each database. If I didn't have a scale out database, I would have to figure out physically how to go to each server in the cluster to get it. What I'm hearing for System ML is it will take that query that I might have written on my one server and it will transparently figure out how to scale that out, although in this case not queries, machine learning algorithms. >> The database analogy is very apt. Just like sequel and query optimization by allowing you to separate that logical description of what you're looking for from the physical description of how to get at it. Lets you have a parallel database with the exact same language as a single machine database. In System ML, because we have an optimizer that separates that logical description of the machine learning algorithm from the physical implementation, we can target a lot of parallel systems, we can also target a large server and the code, the code that implements the algorithm stays the same. >> Okay, now let's take that a step further. You refer to matrix math and I think linear algebra and a whole lot of other things that I never quite made it to since I was a humanities major but when we're talking about those things, my understanding is that those are primitives that Spark doesn't really implement so that if you wanted to do neural nets, which relies on some of those constructs for high performance, >> Fred: Yes. >> Then, um, that's not built into Spark. Can you get to that capability using System ML? >> Yes. System ML edits core, provides you with a library, provides you as a user with a library of machine, rather, linear algebra primitives, just like a language like r or a library like Mumpai gives you matrices and vectors and all of the operations you can do on top of those primitives. And just to be clear, linear algebra really is the language of machine learning. If you pick up a paper about an advanced machine learning algorithm, chances are the specification for what that algorithm does and how that algorithm works is going to be written in the paper literally in linear algebra and the implementation that was used in that paper is probably written in the language where linear algebra is built in, like r, like Mumpai. >> So it sounds to me like Spark has done the work of sort of the blocking and tackling of machine learning to run in parallel. And that's I mean, to be clear, since we haven't really talked about it, that's important when you're handling data at scale and you want to train, you know, models on very, very large data sets. But it sounds like when we want to go to some of the more advanced machine learning capabilities, the ones that today are making all the noise with, you know, speech to text, text to speech, natural language, understanding those neural network based capabilities are not built into the core Spark ML lib, that, would it be fair to say you could start getting at them through System ML? >> Yes, System ML is a much better way to do scalable linear algebra on top of Spark than the very limited linear algebra that's built into Spark. >> So alright, let's take the next step. Can System ML be grafted onto Spark in some way or would it have to be in an entirely new API that doesn't take, integrate with all the other Spark APIs? In a way, that has differentiated Spark, where each API is sort of accessible from every other. Can you tie System ML in or do the Spark guys have to build more primitives into their own sort of engine first? >> A lot of the work that we've done with the Spark Technology Center as part of bringing System ML into the Apache ecosystem has been to build a nice, tight integration with Apache Spark so you can pass Spark data frames directly into System ML you can get data frames back. Your System ML algorithm, once you've written it, in terms of one of System ML's main systematic languages it just plugs into Spark like all the algorithms that are built into Spark. >> Okay, so that's, that would keep Spark competitive with more advanced machine learning frameworks for a longer period of time, in other words, it wouldn't hit the wall the way if would if it encountered tensor flow from Google for Google's way of doing deep learning, Spark wouldn't hit the wall once it needed, like, a tensor flow as long as it had System ML so deeply integrated the way you're doing it. >> Right, with a system like System ML, you can quickly move into new domains of machine learning. So for example, this afternoon I'm going to give a talk with one of our machine learning developers, Mike Dusenberry, about our recent efforts to implement deep learning in System ML, like full scale, convolutional neural nets running on a cluster in parallel processing many gigabytes of images, and we implemented that with very little effort because we have this optimizer underneath that takes care of a lot of the details of how you get that data into the processing, how you get the data spread across the cluster, how you get the processing moved to the data or vice versa. All those decisions are taken care of in the optimizer, you just write down the linear algebra parts and let the system take care of it. That let us implement deep learning much more quickly than we would have if we had done it from scratch. >> So it's just this ongoing cadence of basically removing the infrastructure gut management from the data scientists and enabling them to concentrate really where their value is is on the algorithms themselves, so they don't have to worry about how many clusters it's running on, and that configuration kind of typical dev ops that we see on the regular development side, but now you're really bringing that into the machine learning space. >> That's right, Jeff. Personally, I find all the minutia of making a parallel algorithm worked really fascinating but a lot of people working in data science really see parallelism as a tool. They want to solve the data science problem and System ML lets you focus on solving the data science problem because the system takes care of the parallelism. >> You guys could go on in the weeds for probably three hours but we don't have enough coffee and we're going to set up a follow up time because you're both in San Francisco. But before we let you go, Fred, as you look forward into 2017, kind of the advances that you guys have done there at the IBM Spark Center in the city, what's kind of the next couple great hurdles that you're looking to cross, new challenges that are getting you up every morning that you're excited to come back a year from now and be able to say wow, these are the one or two things that we were able to take down in 2017? >> We're moving forward on several different fronts this year. On one front, we're helping to get the notebook experience with Spark notebooks consistent across the entire IBM product portfolio. We helped a lot with the rollout of notebooks on data science experience on z, for example, and we're working actively with the data science experience and with the Watson data platform. On the other hand, we're contributing to Spark 2.2. There are some exciting features, particularly in sequel that we're hoping to get into that release as well as some new improvements to ML Live. We're moving forward with Apache System ML, we just cut Version 0.13 of that. We're talking right now on the mailing list about getting System ML out of incubation, making it a full, top level project. And we're also continuing to help with the adoption of Apache Spark technology in the enterprise. Our latest focus has been on deep learning on Spark. >> Well, I think we found him! Smartest guy in the room. (laughter) Thanks for stopping by and good luck on your talk this afternoon. >> Thank you, Jeff. >> Absolutely. Alright, he's Fred Rice, he's George Gilbert, and I'm Jeff Rick, you're watching the Cube from Big Data SV, part of Big Data Week in San Jose, California. (upbeat music) (mellow music) >> Hi, I'm John Furrier, the cofounder of SiliconANGLE Media cohost of the Cube. I've been in the tech business since I was 19, first programming on mini computers.

Published Date : Mar 15 2017

SUMMARY :

it's the Cube, covering Big Data Silicon Valley 2017. in the back of the Faramount, come on by and say hello, in the San Francisco office specifically. and on the same floor a lot of developers from Watson to all the various verticals in which you guys play. of machine learning, the power of artificial intelligence or does most of the core work for, like, the sequel workload and have the optimizer take care of making those algorithms and I meant that in a good way, is that the models, the algorithms that are in there, and the more general problems it can solve. to that implementation that you have in your toolbox. in the cluster to get it. and the code, the code that implements the algorithm so that if you wanted to do neural nets, Can you get to that capability using System ML? and all of the operations you can do the ones that today are making all the noise with, you know, linear algebra on top of Spark than the very limited So alright, let's take the next step. System ML into the Apache ecosystem has been to build so deeply integrated the way you're doing it. and let the system take care of it. is on the algorithms themselves, so they don't have to worry because the system takes care of the parallelism. into 2017, kind of the advances that you guys have done of Apache Spark technology in the enterprise. Smartest guy in the room. and I'm Jeff Rick, you're watching the Cube cohost of the Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Jeff Rick	PERSON	0.99+
George	PERSON	0.99+
Jeff	PERSON	0.99+
Fred Rice	PERSON	0.99+
Mike Dusenberry	PERSON	0.99+
IBM	ORGANIZATION	0.99+
2017	DATE	0.99+
San Francisco	LOCATION	0.99+
John Furrier	PERSON	0.99+
San Jose	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
505 Howard Street	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Frederick Reiss	PERSON	0.99+
Spark Technology Center	ORGANIZATION	0.99+
Fred	PERSON	0.99+
IBM Spark Technology Center	ORGANIZATION	0.99+
one	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
Spark 2.2	TITLE	0.99+
three hours	QUANTITY	0.99+
Watson	ORGANIZATION	0.99+
UC Berkeley	ORGANIZATION	0.99+
one server	QUANTITY	0.99+
Spark	TITLE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Python	TITLE	0.99+
each server	QUANTITY	0.99+
both	QUANTITY	0.99+
each	QUANTITY	0.99+
each database	QUANTITY	0.98+
Big Data Week	EVENT	0.98+
Pagoda Lounge	LOCATION	0.98+
Strata Hadoob Conference	EVENT	0.98+
System ML	TITLE	0.98+
Big Data SV	EVENT	0.97+
each API	QUANTITY	0.97+
ML Live	TITLE	0.96+
today	DATE	0.96+
Thomas Vehicles	ORGANIZATION	0.96+
Apache System ML	TITLE	0.95+
Big Data	EVENT	0.95+
Apache Spark	TITLE	0.94+
Watson for Marketing	ORGANIZATION	0.94+
Sparking Water	TITLE	0.94+
first	QUANTITY	0.94+
one front	QUANTITY	0.94+
Big Data SV 2016	EVENT	0.94+
IBM Spark Technology Center	ORGANIZATION	0.94+
about 25 designers	QUANTITY	0.93+

Basil Faruqui, BMC Software - BigData SV 2017 - #BigDataSV - #theCUBE

(upbeat music) >> Announcer: Live from San Jose, California, it's theCUBE covering Big Data Silicon Valley 2017. >> Welcome back everyone. We are here live in Silicon Valley for theCUBE's Big Data coverage. Our event, Big Data Silicon Valley, also called Big Data SV. A companion event to our Big Data NYC event where we have our unique program in conjunction with Strata Hadoop. I'm John Furrier with George Gilbert, our Wikibon big data analyst. And we have Basil Faruqui, who is the Solutions Marketing Manager at BMC Software. Welcome to theCUBE. >> Thank you, great to be here. >> We've been hearing a lot on theCUBE about schedulers and automation, and machine learning is the hottest trend happening in big data. We're thinking that this is going to help move the needle on some things. Your thoughts on this, on the world we're living in right now, and what BMC is doing at the show. >> Absolutely. So, scheduling and workflow automation is absolutely critical to the success of big data projects. This is not something new. Hadoop is only about 10 years old but other technologies that have come before Hadoop have relied on this foundation for driving success. If we look the Hadoop world, what gets all the press is all the real-time stuff, but what powers all of that underneath it is a very important layer of batch. If you think about some of the most common use cases for big data, if you think of a bank, they're talking about fraud detection and things like that. Let's just take the fraud detection example. Detecting an anomaly of how somebody is spending, if somebody's credit card is used which doesn't match with their spending habits, the bank detects that and they'll maybe close the card down or contact somebody. But if you think about everything else that has happened before that as something that has happened in batch mode. For them to collect the history of how that card has been used, then match it with how all the other card members use the cards. When the cards are stolen, what are those patterns? All that stuff is something that is being powered by what's today known as workload automation. In the past, it's been known by names such as job scheduling and batch processing. >> In the systems businesses everyone knows what schedulers, compilers, all this computer science stuff. But this is interesting. Now that the data lake has become so swampy, and people call it the data swamp, people are looking at moving data out of data lakes into real time, as you mention, but it requires management. So, there's a lot of coordination going on. This seems to be where most enterprises are now focusing their attention on, is to make that data available. >> Absolutely. >> Hence the notion of scheduling and workloads. Because their use cases are different. Am I getting it right? >> Yeah, absolutely. And if we look at what companies are doing, every CEO and every boardroom, there's a charter for digital transformation for companies. And, it's no longer about taking one or two use cases around big data and driving success. Data and intelligence is now at the center of everything a company does, whether it's building new customer engagement models, whether it's building new ecosystems with their partners, suppliers. Back-office optimization. So, when CIOs and data architects think about having to build a system like that, they are faced with a number of challenges. It has to become enterprise ready. It has to take into account governance, security, and others. But, if you peel the onion just a little bit, what architects and CIOs are faced with is okay, you've got a web of complex technologies, legacy applications, modern applications that hold a lot of the corporate data today. And then you have new sources of data like social media, devices, sensors, which have a tendency to produce a lot more data. First things first, you've got a ecosystem like Hadoop, which is supposed to be kind of the nerve center of the new digital platform. You've got to start ingesting all this data into Hadoop. This has to be in an automated fashion for it to be able to scalable. >> But this is the combination of streaming and batch. >> Correct. >> Now this seems to be the management holy grail right now. Nailing those two. Did I get that? >> Absolutely. So, people talk about, in technical terms, the speed layer and the batch layer. And both have to converge for them to be able to deliver the intelligence and insight that the business users are looking for. >> Would it be fair to say it's not just the convergence of the speed layer and batch layer in Hadoop but what BMC brings to town is the non-Hadoop parts of those workloads. Whether it's batch outside Hadoop or if there's streaming, which sort-of pre-Hadoop was more nichey. But we need this over-arching control, which if it's not a Hadoop-centric architecture. >> Absolutely. So, I've said this for a long time, that Hadoop is never going to live on an island on its own in the enterprise. And with the maturation of the market, Hadoop has to now play with the other technologies in the stack So if you think about, just take data ingestion for an example, you've got ERP's, you've got CRM's, you've got middleware, you've got data warehouses, and you have to ingest a lot of that in. Where Control-M brings a lot of value and speeds up time to market is that we have out-of-the box integrations with a lot of the systems that already exist in the enterprise, such as ERP solutions and others. Virtually any application that can expose itself through an API or a web service, Control-M has the ability to automate that ingestion piece. But this is only step one of the journey. So, you've brought all this data into Hadoop and now you've got to process it. The number of tools available for processing this is growing at an unprecedented rate. You've got, you know MapReduce was a hot thing just two years ago and now Spark has taken over. So Control-M, about four years ago we started building very deep native capabilities in their new ecosystem. So, you've got ingestion that's automated, then you can seamlessly automate the actual processing of the data using things like Spark, Hive, PEG, and others. And the last mile of the journey, the most important one, is them making this refined data available to systems and users that can analyze it. Often Hadoop is not the repository where analytic systems sit on top of. It's another layer where all of this has to be moved. So, if you zoom out and take a look at it, this is a monumental task. And if you use siloed approach to automating this, this becomes unscalable. And that's where a lot of the Hadoop projects often >> Crash and burn. >> Crash and burn, yes, sustainability. >> Let's just say it, they crash and burn. >> So, Control-M has been around for 30 years. >> By the way, just to add to the crash-and-burn piece, the data lake gets stalled there, that's why the swamp happens, because they're like, now how do I operationalize this and scale it out? >> Right, if you're storing a lot of data and not making it available for processing and analysis, then it's of no use. And that's exactly our value proposition. This is a problem we haven't solved for the first time. We did this as we have seen these waves of automation come through. From the mainframe time when it was called batch processing. Then it evolved into distributed client server when it was known more as job scheduling. And now. >> So BMCs have seen this movie before. >> Absolutely. >> Alright, so let's take a step back. Zoom out, step back, go hang out in the big trees, look down on the market. Data practitioners, big data practitioners out there right now are wrestling with this issue. You've got streaming, real-time stuff, you got batch, it's all coming together. What is Control-M doing great right now with practitioners that you guys are solving? Because there are a zillion tools out there, but people are human. Every hammer looks for a nail. >> Sure. So, you have a lot of change happening at the same time but yet these tools. What is Control-M doing to really win? Where are you guys winning? >> Where we are adding a lot of value for our customers is helping them speed up the time to market and delivering these big data projects, in delivering them at scale and quality. >> Give me an example of a project. >> Malwarebytes is a Silicon Valley-based company. They are using this to ingest and analyze data from thousands of end-points from their end users. >> That's their Lambda architecture, right? >> In Lambda architecture, I won't steal their thunder, they're presenting tomorrow at eleven. >> Okay. >> Eleven-thirty tomorrow. Another example is a company called Navistar. Now here's a company that's been around for 200 years. They manufacture heavy-duty trucks, 18-wheelers, school buses. And they recently came up with a service called OnCommand. They have a fleet of 160,000 trucks that are fitted with sensors. They're sending telematic data back to their data centers. And in between that stops in the cloud. So it gets to the cloud. >> So they're going up to the cloud for upload and backhaul, basically, right? >> Correct. So, it goes to the cloud. From there it is ingested inside their Hadoop systems. And they're looking for trends to make sure none of the trucks break down because a truck that's carrying freight breaks down hits the bottom line right away. But that's not where they're stopping. In real time they can triangulate the position of the truck, figure out where the nearest dealership is. Do they have the parts? When to schedule the service. But, if you think about it, the warranty information, the parts information is not sitting in Hadoop. That's sitting in their mainframes, SAP systems, and others. And Control-M is orchestrating this across the board, from mainframe to ERP and into Hadoop for them to be able to marry all this data together. >> How do you get back into the legacy? That's because you have the experience there? Is that part of the product portfolio? >> That is absolutely a part of the product portfolio. We started our journey back in the mainframe days, and as the world has evolved, to client server to web, and now mobile and virtualized and software-defined infrastructures, we have kept pace with that. >> You guys have a nice end-to-end view right now going on. And certainly that example with the trucks highlights IOT rights right there. >> Exactly. You have a clear line of sight on IOT? >> Yup. >> That would be the best measure of your maturity is the breadth of your integrations. >> Absolutely. And we don't stop at what we provide just out of the box. We realized that we have 30 to 35 out-of-the box integrations but there are a lot more applications than that. We have architected control them in a way where that can automate data loads on any application and any database that can expose itself through an API. That is huge because if you think about the open-source world, by the time this conference is going to be over, there's going to be a dozen new tools and projects that come online. And that's a big challenge for companies too. How do you keep pace with this and how do you (drowned out) all this? >> Well, I think people are starting to squint past the fashion aspect of open source, which I love by the way, but it does create more diversity. But, you know, some things become fashionable and then get big-time trashed. Look at Spark. Spark was beautiful. That one came out of the woodwork. George, you're tracking all the fashion. What's the hottest thing right now on open source? >> It seems to me that we've spent five-plus years building data lakes and now we're trying to take that data and apply the insides from it to applications. And, really Control-M's value add, my understanding is, we have to go beyond Hadoop because Hadoop was an island, you know, an island or a data lake, but now the insides have to be enacted on applications that go outside that ecosystem. And that's where Control-M comes in. >> Yeah, absolutely. We are that overarching layer that helps you connect your legacy systems and modern systems and bring it all into Hadoop. The story I tell when I'm explaining this to somebody is that you've installed Hadoop day-one, great, guess what, it has no data in it. You've got to ingest data and you have to be able to take a strategic approach to that because you can use some point solutions and do scripting for the first couple of use cases, but as soon as the business gives us the green light and says, you know what, we really like what we've seen now let's scale up, that's where you really need to take a strategic approach, and that's where Control-M comes in. >> So, let me ask then, if the bleeding edge right now is trying to operationalize the machine learning models that people are beginning to experiment with, just the way they were experimenting with data lakes five years ago, what role can Control-M play today in helping people take a trained model and embed it in an application so it produces useful actions, recommendations, and how much custom integration does that take? >> If we take the example of machine learning, if you peel the onion of machine learning, you've got data that needs to be moved, that needs to be constantly evaluated, and then the algorithms have to be run against it to provide the insights. So, this in itself is exactly what Control-M allows you to do, is ingest the data, process the data, let the algorithms process it, and then of course move it to a layer where people and other systems, it's not just about people anymore, it's other systems that'll analyze the data. And the important piece here is that we're allowing you to do this from a single pane of glass. And being able to see this picture end to end. All of this work is being done to drive business results, generating new revenue models, like in the case of Navistar. Allowing you to capture all of this and then tie it to business SOAs, that is one of the most highly-rated capabilities of Control-M from our customers. >> This is the cloud equation we were talking last week at Google Next. A combination of enterprise readiness across the board. The end-to-end is the picture and you guys are in a good position. Congratulations, and thanks for coming on theCUBE. Really appreciate it. >> Absolutely, great to be here. >> It's theCUBE breaking it down here at Big Data World. This is the trend. It's an operating system world in the cloud. Big data with IOT, AI, machine learning. Big themes breaking out early-on at Big Data SV in conjunction with Strata Hadoop. More right after this short break.

Published Date : Mar 15 2017

SUMMARY :

it's theCUBE covering Big A companion event to and machine learning is the hottest trend is all the real-time stuff, and people call it the data swamp, Hence the notion of Data and intelligence is now at the center But this is the combination Now this seems to be the that the business users are looking for. of the speed layer and the market, Hadoop has to So, Control-M has From the mainframe time when look down on the market. What is Control-M doing to really win? and delivering these big data projects, Malwarebytes is a Silicon In Lambda architecture, And in between that stops in the cloud. So, it goes to the cloud. and as the world has evolved, And certainly that example with the trucks You have a clear line of sight on IOT? is the breadth of your integrations. is going to be over, That one came out of the woodwork. but now the insides have to and do scripting for the that is one of the most This is the cloud This is the trend.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Basil Faruqui	PERSON	0.99+
BMC	ORGANIZATION	0.99+
one	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
Navistar	ORGANIZATION	0.99+
George	PERSON	0.99+
five-plus years	QUANTITY	0.99+
30	QUANTITY	0.99+
John Furrier	PERSON	0.99+
160,000 trucks	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
two	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Malwarebytes	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
last week	DATE	0.99+
Lambda	TITLE	0.99+
both	QUANTITY	0.99+
OnCommand	ORGANIZATION	0.99+
five years ago	DATE	0.99+
tomorrow	DATE	0.98+
two years ago	DATE	0.98+
35	QUANTITY	0.98+
first time	QUANTITY	0.98+
Big Data SV	EVENT	0.98+
18-wheelers	QUANTITY	0.98+
first couple	QUANTITY	0.98+
Big Data	EVENT	0.98+
BMC Software	ORGANIZATION	0.97+
Google	ORGANIZATION	0.97+
today	DATE	0.97+
First	QUANTITY	0.97+
about 10 years old	QUANTITY	0.97+
Control-M	ORGANIZATION	0.96+
two use cases	QUANTITY	0.96+
Big Data Silicon Valley 2017	EVENT	0.95+
Hadoop	ORGANIZATION	0.95+
30 years	QUANTITY	0.94+
first	QUANTITY	0.94+
NYC	LOCATION	0.94+
Big Data Silicon Valley	EVENT	0.93+
single pane	QUANTITY	0.92+
Eleven-thirty	DATE	0.9+
step one	QUANTITY	0.88+
Strata Hadoop	TITLE	0.88+
200 years	QUANTITY	0.87+
theCUBE	ORGANIZATION	0.87+
a dozen new tools	QUANTITY	0.83+
about four years ago	DATE	0.83+
Wikibon	ORGANIZATION	0.83+
-M	ORGANIZATION	0.82+
Big Data SV	ORGANIZATION	0.82+
Control-M	PERSON	0.81+
a zillion tools	QUANTITY	0.8+
thousands of end-points	QUANTITY	0.76+
eleven	DATE	0.76+
Spark	TITLE	0.76+
BMCs	ORGANIZATION	0.74+
Strata Hadoop	PERSON	0.67+
BigData SV 2017	EVENT	0.66+
#BigDataSV	EVENT	0.62+
Big	ORGANIZATION	0.62+
SAP	ORGANIZATION	0.6+
MapReduce	ORGANIZATION	0.58+
Hive	TITLE	0.52+

Oliver Chiu, IBM & Wei Wang, Hortonworks | BigData SV 2017

>> Narrator: Live from San Jose, California It's the CUBE, covering Big Data Silicon Valley 2017. >> Okay welcome back everyone, live in Silicon Valley, this is the CUBE coverage of Big Data Week, Big Data Silicon Valley, our event, in conjunction with Strata Hadoop. This is the CUBE for two days of wall-to-wall coverage. I'm John Furrier with Analyst from Wikibon, George Gilbert our Big Data as well as Peter Buress, covering all of the angles. And our next guest is Wei Wang, Senior Director of Product Market at Hortonworks, a CUBE alumni, and Oliver Chiu, Senior Product Marketing Manager for Big Data and Microsoft Cloud at Azure. Guys, welcome to the CUBE, good to see you again. >> Yes. >> John: On the CUBE, appreciate you coming on. >> Thank you very much. >> So Microsoft and Hortonworks, you guys are no strangers. We have covered you guys many times on the CUBE, on HD insights. You have some stuff happening, here, and I was just talking about you guys this morning on another segment, like, saying hey, you know the distros need a Cloud strategy. So you have something happening tomorrow. Blog post going out. >> Wei: Yep. >> What's the news with Microsoft? >> So essentially I think that we are truly adopting the CloudFirst. And you know that we have been really acquiring a lot of customers in the Cloud. We have that announced in our earnings that more than a quarter of our customers actually already have a Cloud strategy. I want to give out a few statistics that Gardner told us actually last year. The increase for their end users went up 57% just to talk about Hadoop and Microsoft Azure. So what we're here, is to talk about the next generation. We're putting our latest and greatest innovation in which comes in in the package of the release of HDP2.6, that's our last release. I think our last conversation was on 2.5. So 2.6's great latest and newest innovations to put on CloudFirst, hence our partner, here, Microsoft. We're going to put it on Microsoft HD Insight. >> That's super exciting. And, you know, Oliver, one of the things that we've been really fascinated with and covering for multiple years now is the transformation of Microsoft. Even prior to Satya, who's a CUBE alumni by the way, been on the CUBE, when we were at XL event at Stanford. So, CEO of Microsoft, CUBE alumni, good to have that. But, it's interesting, right? I mean, the Open Compute Project. They donated a boatload of IP into the open-source. Heavily now open-source, Brendan Burns works for Microsoft. He's seeing a huge transformation of Microsoft. You've been working with Hortonworks for a while. Now, it's kind of coming together, and one of the things that's interesting is the trend that's teasing out on the CUBE all the time now is integration. He's seeing this flash point where okay, I've got some Hadoop, I've got a bunch of other stuff in the enterprise equation that's kind of coming together. And you know, things like IOT, and AIs all around the corner as well. How are you guys getting this all packaged together? 'Cause this kind of highlights some of the things that are now integrated in with the tools you have. Give us an update. >> Yeah, absolutely. So for sure, just to kind of respond to the trend, Microsoft kind of made that transformation of being CloudFirst, you know, many years ago. And, it's been great to partner with someone like Hortonworks actually for the last four years of bringing HD Insight as a first party Microsoft Cloud service. And because of that, as we're building other Cloud services around in Azure, we have over 60 services. Think about that. That's 60 PAZ and IAZ services in Microsoft, part of the Azure ecosystem. All of this is starting to get completely integrated with all of our other services. So HD Insight, as an example, is integrated with all of our relational investments, our BI investments, our machine learning investments, our data science investments. And so, it's really just becoming part of the fabric of the Azure Cloud. And so that's a testament to the great partnership that we're having with Hortonworks. >> So the inquiry comment from Gardner, and we're seeing similar things on the Wikibon site on our research team, is that now the legitimacy of say, of seeing how Hadoop fits into the bigger picture, not just Hadoop being the pure-play Big Data platform which many people were doing. But now they're seeing a bigger picture where I can have Hadoop, and I can have some other stuff all integrating. Is that all kind of where this is going from you guys' perspective? >> So yeah, it's again, some statistics we have done tech-validate service that our customers are telling us that 43% of the responders are actually using that integrated approach, the hybrid. They're using the Cloud. They're using our stuff on-premise to actually provide integrated end-to-end processing workload. They are now, I think, people are less think about, I would think, a couple years ago, people probably think a little bit about what kind of data they want to put in the Cloud. What kind of workload they want to actually execute in the Cloud, versus their own premise. I think, what we see is that line starting to blur a little bit. And given the partnership we have with Microsoft, the kind of, the enterprise-ready functionalities, and we talk about that for a long time last time I was here. Talk about security, talk about governance, talk about just layer of, integrated layer to manage the entire thing. Either on-premise, or in the Cloud. I think those are some of the functionalities or some of the innovations that make people a lot more at ease with the idea of putting the entire mission-critical applications in the Cloud, and I want to mention that, especially with our blog going out tomorrow that we will actually announce the Spark 2.1. In which, in Microsoft Azure HD Insight, we're actually going to guarantee 99.9% of SLA. Right, so it's, for that, it's for enterprise customers. In which many of us have together that is truly an insurance outfield, that people are not just only feel at ease about their data, that where they're going to locate, either in the Cloud or within their data center, but also the kind of speed and response and reliability. >> Oliver, I want to queue off something you said which was interesting, that you have 60 services, and that they're increasingly integrated with each other. The idea that Hadoop itself is made up of many projects or services and I think in some amount of time, we won't look at it as a discrete project or product, but something that's integrated with together makes a pipeline, a mix-and-match. I'm curious if you can share with us a vision of how you see Hadoop fitting in with a richer set of Microsoft services, where it might be SQL server, it might be streaming analytics, what that looks like and so the issue of sort of a mix-and-match toolkit fades into a more seamless set of services. >> Yeah, absolutely. And you're right, Hadoop and Wei will definitely reiterate this, is that Hadoop is a platform right, and certainly there is multiple different workloads and projects on that platform that do a lot of different things. You have Spark that can do machine learning and streaming, and SQL-like queries, and you have Hadoop itself that can do badge, interactive, streaming as well. So, you see kind of a lot of workloads being built on open-source Hadoop. And as you bring it to the Cloud, it's really for customers that what we found, and kind of this new Microsoft that is often talked about, is it's all about choice and flexibility for our customers. And so, some customers want to be 100% open-source Apache Hadoop, and if they want that, HD Insight is the right offering, and what we can do is we can surround it with other capabilities that are outside of maybe core Hadoop-type capabilities. Like if you want to media services, all the way down to, you know, other technologies nothing related to, specifically to data and analytics. And so they can combine that with the Hadoop offering, and blend it into a combined offering. And there are some customers that will blend open-source Hadoop with some of our Azure data services as well, because it offers something unique or different. But it's really a choice for our customers. Whatever they're open to, whatever their kind of their strategy for their organization. >> Is there, just to kind of then compare it with other philosophies, do you see that notion that Hadoop now becomes a set of services that might or might not be mixed and matched with native services. Is that different from how Amazon or Google, you know, you perceive them to be integrating Hadoop into their sort of pipelines and services? >> Yeah, it's different because I see Amazon and Google, like, for instance, Google kind of is starting to change their philosophy a little bit with introduction of dataproc. But before, you can see them as an organization that was really focused on bringing some of the internal learnings of Google into the marketplace with their own, you can say proprietary-type services with some of the offerings that they have. But now, they're kind of realizing the value that Hadoop, that Apache Hadoop ecosystem brings. And so, with that comes the introduction of their own manage service. And for AWS, their roots is IAZ, so to speak, is kind of the roots of their Cloud, and they're starting to bring kind of other systems, very similar to, I would say Microsoft Strategy. For us, we are all about making things enterprise-ready. So that's what the unique differentiator and kind of what you alluded to. And so for Microsoft, all of our data services are backed by 99.9% service-level agreement including our relationship with Hortonworks. So that's kind of one, >> Just say that again, one more time. >> 99.9% up-time, and if, >> SLA. >> Oliver: SLA and so that's a guarantee to our customers. So if anything we're, >> John: One more time. >> It's a guarantee to our customers. >> No, this is important. SLA, I mean Google Next didn't talk much about last week their Cloud event. They talked about speed thieves, >> Exactly >> Not a lot of SLAs. This is mandate for the enterprise. They care more about SLA so, not that they don't care about price, but they'd much rather have solid, bulletproof SLAs than the best price. 'Cause the total cost of ownership. >> Right. And that's really the heritage of where Microsoft comes from, is we have been serving our on-premises customers for so long, we understand what they want and need and require for a mission-critical enterprise-ready deployment. And so, our relationship with Hortonworks absolutely 99.9% service-level agreement that we will guarantee to our customers and across all of the Hadoop workloads, whether it would be Hive, whether it would be Spark, whether it'd be Kafka, any of the workloads that we have on HD Insight, is enterprise-ready by virtue, mission-critical, built-in, all that stuff that you would expect. >> Yeah, you guys certainly have a great track record with enterprise. No debate about that, 100%. Um, back to you guys, I want to take a step back and look at some things we've been observing kicking off this week at the Strata Hadoop. This is our eighth year covering, Hadoop world now has evolved into a whole huge thing with Big Data SV and Big Data NYC that we run as well. The bets that were made. And so, I've been intrigued by HD Insights from day one. >> Yep. >> Especially the relationship with Microsoft. Got our attention right away, because of where we saw the dots connecting, which is kind of where we are now. That's a good bet. We're looking at what bets were made and who's making which bets when, and how they're panning out, so I want to just connect the dots. Bets that you guys have made, and the bets that you guys have made that are now paying off, and certainly we've done before camera revolution analytics. Obviously, now, looking real good middle of the fairway as they say. Bets you guys have made that hey, that was a good call. >> Right, and we think that first and foremost, we are sworn to work to support machine learning, we don't call it AI, but we are probably the one that first to always put the Spark, right, in Hadoop. I know that Spark has gained a lot of traction, but I remember that in the early days, we are the ones that as a distro that, going out there not only just verbally talk about support of Spark, but truly put it in our distribution as one of the component. We actually now in the last version, we are actually allows also flexibility. You know Spark, how often they change. Every six weeks they have a new version. And that's kind of in the sense of running into paradox of what actually enterprise-ready is. Within six weeks, they can't even roll out an entire process, right? If they have a workload, they probably can't even get everyone to adopt that yet, within six weeks. So what we did, actually, in the last version, in which we will continue to do, is to essentially support multiple versions of Spark. Right, we essentially to talk about that. And the other bet we have made is about Hive. We truly made that as kind of an initiative behind project Stinger initiative, and also have ties now with LAP. We made the effort to join in with all the other open-source developers to go behind this project that make sure that SQL is becoming truly available for our customers, right. Not only just affordable, but also have the most comprehensive coverage for SQL, and C20-11. But also now having that almost sub-second interactive query. So I think that's the kind of bet we made. >> Yeah, I guess the compatibility of SQL, then you got the performance advantage going on, and this database is where it's in memory or it's SSD, That seems to be the action. >> Wei: Yeah. >> Oliver, you guys made some good bets. So, let's go down the list. >> So let's go down memory lane. I always kind of want to go back to our partnership with Hortonworks. We partnered with Hortonworks really early on, in the early days of Hortonworks' existence. And the reason we made that bet was because of Hortonworks' strategy of being completely open. Right, and so that was a key decision criteria for Microsoft. That we wanted to partner with someone whose entire philosophy was open-source, and committing everything back to the Apache ecosystem. And so that was a very strategic bet that we made. >> John: It was bold at the time, too. >> It was very bold, at the time, yeah. Because Hortonworks at that time was a much smaller company than they are today. But we kind of understood of where the ecosystem was going, and we wanted to partner with people who were committing code back into the ecosystem. So that, I would argue, is definitely one really big bet that was a very successful one and continues to play out even today. Other bets that we've made and like we've talked about prior is our acquisition of Revolution Analytics a couple years ago and that's, >> R just keeps on rolling, it keeps on rolling, rolling, rolling. Awesome. >> Absolutely. Yeah. >> Alright, final words. Why don't we get updated on the data science experiences you guys have. Is there any update there? What's going on, what seems to be, the data science tools are accelerating fast. And, in fact, some are saying that looks like the software tools years and years ago. A lot more work to do. So what's happening with the data science experience? >> Yeah absolutely and just tying back to that original comment around R, Revolution Analytics. That has become Microsoft, our server. And we're offering that, available on-premises and in the Cloud. So on-premises, it's completely integrated with SQL server. So all SQL server customers will now be able to do in-database analytics with R built-in-to-the-core database. And that we see as a major win for us, and a differentiator in the marketplace. But in the Cloud, in conjunction with our partnership with Hortonworks, we're making Microsoft R server, available as part of our integration with Azure HD Insights. So we're kind of just tying back all that integration that we talked about. And so that's built in, and so any customer can take R, and paralyze that across any number of Hadoop and Sparknotes in a managed service within minutes. Clusters will spin up, and they can just run all their data science models and train them across any number of Hadoop and Sparknotes. And so that is, >> John: That takes the heavy lifting away on the cluster management side, so they can focus on their jobs. >> Oliver: Absolutely. >> Awesome. Well guys, thanks for coming on. We really appreciate Wei Wang with Hortonworks, and we have Oliver Chiu from Microsoft. Great to get the update, and tomorrow 10:30, the CloudFirst news hits. CloudFirst, Hortonworks with Azure, great news, congratulations, good Cloud play for Hortonworks. To CUBE, I'm John Furrier with George Gilbert. More coverage live in Silicon Valley after this short break.

Published Date : Mar 15 2017

SUMMARY :

It's the CUBE, covering all of the angles. and I was just talking about you guys this morning a lot of customers in the Cloud. and one of the things that's interesting that we're having with Hortonworks. is that now the legitimacy of say, And given the partnership we have with Microsoft, and that they're increasingly integrated with each other. all the way down to, you know, other technologies a set of services that might or might not be and kind of what you alluded to. Oliver: SLA and so that's a guarantee to our customers. No, this is important. This is mandate for the enterprise. and across all of the Hadoop workloads, that we run as well. and the bets that you guys have made but I remember that in the early days, Yeah, I guess the compatibility of SQL, So, let's go down the list. And so that was a very strategic bet that we made. and we wanted to partner with people it keeps on rolling, rolling, rolling. Yeah. on the data science experiences you guys have. and in the Cloud. on the cluster management side, and we have Oliver Chiu from Microsoft.

ENTITIES

Entity	Category	Confidence
John	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Oliver	PERSON	0.99+
Google	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Satya	PERSON	0.99+
John Furrier	PERSON	0.99+
Oliver Chiu	PERSON	0.99+
Peter Buress	PERSON	0.99+
43%	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
99.9%	QUANTITY	0.99+
60 services	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
eighth year	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
Hadoop	TITLE	0.99+
CUBE	ORGANIZATION	0.99+
tomorrow 10:30	DATE	0.99+
Brendan Burns	PERSON	0.99+
Hortonworks'	ORGANIZATION	0.99+
last year	DATE	0.99+
last week	DATE	0.99+
SQL	TITLE	0.99+
Spark	TITLE	0.99+
57%	QUANTITY	0.99+
tomorrow	DATE	0.99+
Big Data Week	EVENT	0.99+
two days	QUANTITY	0.99+
Wei Wang	PERSON	0.99+
Big Data	ORGANIZATION	0.99+
Gardner	PERSON	0.98+

Yuanhao Sun, Transwarp Technology - BigData SV 2017 - #BigDataSV - #theCUBE

>> Announcer: Live from San Jose, California, it's theCUBE, covering Big Data Silicon Valley 2017. (upbeat percussion music) >> Okay, welcome back everyone. Live here in Silicon Valley, San Jose, is the Big Data SV, Big Data Silicon Valley in conjunction with Strata Hadoop, this is theCUBE's exclusive coverage. Over the next two days, we've got wall-to-wall interviews with thought leaders, experts breaking down the future of big data, future of analytics, future of the cloud. I'm John Furrier with my co-host George Gilbert with Wikibon. Our next guest is Yuanhao Sun, who's the co-founder and CTO of Transwarp Technologies. Welcome to theCUBE. You were on, during the, 166 days ago, I noticed, on theCUBE, previously. But now you've got some news. So let's get the news out of the way. What are you guys announcing here, this week? >> Yes, so we are announcing 5.0, the latest version of Transwarp Hub. So in this version, we will call it probably revolutionary product, because the first one is we embedded communities in our product, so we will allow people to isolate different kind of workloads, using dock and containers, and we also provide a scheduler to better support mixed workloads. And the second is, we are building a set of tools allow people to build their warehouse. And then migrate from existing or traditional data warehouse to Hadoop. And we are also providing people capability to build a data mart, actually. It allow you to interactively query data. So we build a column store in memory and on SSD. And we totally write the whole SQL engine. That is a very tiny SQL engine, allow people to query data very quickly. And so today that tiny SQL engine is like about five to ten times faster than Spark 2.0. And we also allow people to build cubes on top of Hadoop. And then, once the cube is built, the SQL performance, like the TBCH performance, is about 100 times faster than existing database, or existing Spark 2.0. So it's super-fast. And in, actually we found a Paralect customer, so they replace their data with software, to build a data mart. And we already migrate, say 100 reports, from their data to our product. So the promise is very good. And the first one is we are providing tool for people to build the machine learning pipelines and we are leveraging TensorFlow, MXNet, and also Spark for people to visualize the pipeline and to build the data mining workflows. So this is kind of like Datasense tools, it's very easy for people to use. >> John: Okay, so take a minute to explain, 'cus that was great, you got the performance there, that's the news out of the way. Take a minute to explain Transwarp, your value proposition, and when people engage you as a customer. >> Yuanhao: Yeah so, people choose our product and the major reason is our compatibility to Oracle, DV2, and teradata SQL syntax, because you know, they have built a lot of applications onto those databases, so when they migrate to Hadoop, they don't want to rewrote whole program, so our compatibility, SQL compatibility is big advantage to them, so this is the first one. And we also support full ANCIT and distribute transactions onto Hadoop. So that a lot of applications can be migrate to our product, with few modification or without any changes. So this is the first our advantage. The second is because we are providing, even the best streaming engine, that is actually derived from Spark. So we apply this technology to IOT applications. You know the IOT pretty soon, they need a very low latency but they also need very complicated models on top of streams. So that's why we are providing full SQL support and machine learning support on top of streaming events. And we are also using event-driven technology to reduce the latency, to five to ten milliseconds. So this is second reason people choose our product. And then today we are announcing 5.0, and I think people will find more reason to choose our product. >> So you have the compatibility SQL, you have the tooling, and now you have the performance. So kind of the triple threat there. So what's the customer saying, when you go out and talk with your customers, what's the view of the current landscape for customers? What are they solving right now, what are the key challenges and pain points that customers have today? >> We have customers in more than 12 vertical segments, and in different verticals they have different pain points, actually so. Take one example: in financial services, the main pain point for them is to migrate existing legacy applications to Hadoop, you know they have accumulated a lot of data, and the performance is very bad using legacy database, so they need high performance Hadoop and Spark to speed up the performance, like reports. But in another vertical, like in logistic and transportation and IOT, the pain point is to find a very low latency streaming engine. At the same time, they need very complicated programming model to write their applications. And that example, like in public sector, they actually need very complicated and large scale search engine. They need to build analytical capability on top of search engine. They can search the results and analyze the result in the same time. >> George: Yuanhao, as always, whenever we get to interview you on theCube, you toss out these gems, sort of like you know diamonds, like big rocks that under millions of years, and incredible pressure, have been squeezed down into these incredibly valuable, kind of, you know, valuable, sort of minerals with lots of goodness in them, so I need you to unpack that diamond back into something that we can make sense out of, or I should say, that's more accessible. You've done something that none of the Hadoop Distro guys have managed to do, which is to build databases that are not just decision support, but can handle OLTP, can handle operational applications. You've done the streaming, you've done what even Databricks can't do without even trying any of the other stuff, which is getting the streaming down to event at a time. Let's step back from all these amazing things, and tell us what was the secret sauce that let you build a platform this advanced? >> So actually, we are driven by our customers, and we do see the trends people are looking for, better solutions, you know there are a lot of pain to set up a habitable class to use the Hadoop technology. So that's why we found it's very meaningful and also very necessary for us to build a SQL database on top of Hadoop. Quite a lot of customers in FS side, they ask us to provide asset until the transaction can be put on top of Hadoop, because they have to guarantee the consistency of their data. Otherwise they cannot use the technology. >> At the risk of interrupting, maybe you can tell us why others have built the analytic databases on top of Hadoop, to give the familiar SQL access, and obviously have a desire also to have transactions next to it, so you can inform a transaction decision with the analytics. One of the questions is, how did you combine the two capabilities? I mean it only took Oracle like 40 years. >> Right, so. Actually our transaction capability is only for analytics, you know, so this OLTP capability it is not for short term transactional applications, it's for data warehouse kind of workloads. >> George: Okay, so when you're ingesting. >> Yes, when you're ingesting, when you modify your data, in batch, you have to guarantee the consistency. So that's the OLTP capability. But we are also building another distributed storage, and distributed database, and that are providing that with OLTP capability. That means you can do concurrent transactions, on that database, but we are still developing that software right now. Today our product providing the digital transaction capability for people to actually build their warehouse. You know quite a lot of people believe data warehouse do not need transaction capability, but we found a lot of people modify their data in data warehouse, you know, they are loading their data continuously to data warehouse, like the CRM tables, customer information, they can be changed over time. So every day people need to update or change the data, that's why we have to provide transaction capability in data warehouse. >> George: Okay, and then so then well tell us also, 'cus the streaming problem is, you know, we're told that roughly two thirds of Spark deployments use streaming as a workload. And the biggest knock on Spark is that it can't process one event at a time, you got to do a little batch. Tell us some of the use cases that can take advantage of doing one event at a time, and how you solved that problem? >> Yuanhao: Yeah so the first use case we encounter is the anti-fraud, or fraud detection application in FSI, so whenever you swipe your credit card, the bank needs to tell you if the transaction is a fraud or not in a few milliseconds. But if you are using Spark streaming, it will usually take 500 milliseconds, so the latency is too high for such kind of application. And that's why we have to provide event per time, like means event-driven processing to detect the fraud, so that we can interrupt the transaction in a few milliseconds, so that's one kind of application. The other can come from IOT applications, so we already put our streaming framework in large manufacture factory. So they have to detect the main function of their equipments in a very short time, otherwise it may explode. So if you... So if you are using Spark streaming, probably when you submit your application, it will take you hundreds of milliseconds, and when you finish your detection, it usually takes a few seconds, so that will be too long for such kind of application. And that's why we need a low latency streaming engine, but you can see it is okay to use Storm or Flink, right? And problem is, we found it is: They need a very complicated programming model, that they are going to solve equation on the streaming events, they need to do the FFT transformation. And they are also asking to run some linear regression or some neural network on top of events, so that's why we have to provide a SQL interface and we are also embedding the CEP capability into our streaming engine, so that you can use pattern to match the events and to send alerts. >> George: So, SQL to get a set of events and maybe join some in the complex event processing, CEP, to say, does this fit a pattern I'm looking for? >> Yuanhao: Yes. >> Okay, and so, and then with the lightweight OLTP, that and any other new projects you're looking at, tell us perhaps the new use cases you'd be appropriated for. >> Yuanhao: Yeah so that's our official product actually, so we are going to solve the problem of large scale OLTP transaction problems like, so you know, a lot of... You know, in China, there is so many population, like in public sector or in banks, they need build a highly scalable transaction systems so that they can support a very high concurrent transactions at the same time, so that's why we are building such kind of technology. You know, in the past, people just divide transaction into multiple databases, like multiple Oracle instances or multiple mySQL instances. But the problem is: if the application is simple, you can very easily divide a transaction over the multiple instances of databases. But if the application is very complicated, especially when the ISV already wrote the applications based on Oracle or traditional database, they already depends on the transaction systems so that's why we have to build a same kind of transaction systems, so that we can support their legacy applications, but they can scale to hundreds of nodes, and they can scale to millions of transactions per second. >> George: On the transactional stuff? >> Yuanhao: Yes. >> Just correct me if I'm wrong, I know we're running out of time but I thought Oracle only scales out when you're doing decision support work, not when you're doing OLTP, not that it, that it can only, that it can maybe stretch to ten nodes or something like that, am I mistaken? >> Yuanhao: Yes, they can scale to 16 to all 32 nodes. >> George: For transactional work? >> For transaction works, but so that's the theoretical limit, but you know, like Google F1 and Google Spanner, they can scale to hundreds of nodes. But you know, the latency is higher than Oracle because you have to use distributed particle to communicate with multiple nodes, so the latency is higher. >> On Google? >> Yes. >> On Google. The latency is higher on the Google? >> 'Cus it has to go like all the way to Europe and back. >> Oracle or Google latency, you said? >> Google, because if you are using two phase commit protocol you have to talk to multiple nodes to broadcast your request to multiple nodes, and then wait for the feedback, so that mean you have a much higher latency, but it's necessary to maintain the consistency. So in a distributed OLTP databases, the latency is usually higher, but the concurrency is also much higher, and scalability is much better. >> George: So that's a problem you've stretched beyond what Oracle's done. >> Yuanhao: Yes, so because customer can tolerant the higher latency, but they need to scale to millions of transactions per second, so that's why we have to build a distributed database. >> George: Okay, for this reason we're going to have to have you back for like maybe five or ten consecutive segments, you know, maybe starting tomorrow. >> We're going to have to get you back for sure. Final question for you: What are you excited about, from a technology, in the landscape, as you look at open source, you're working with Spark, you mentioned Kubernetes, you have micro services, all the cloud. What are you most excited about right now in terms of new technology that's going to help simplify and scale, with low latency, the databases, the software. 'Cus you got IOT, you got autonomous vehicles, you have all this data, what are you excited about? >> So actually, so this technology we already solve these problems actually, but I think the most exciting thing is we found... There's two trends, the first trend is: We found it's very exciting to find more competition framework coming out, like the AI framework, like TensorFlow and MXNet, Torch, and tons of such machine learning frameworks are coming out, so they are solving different kinds of problems, like facial recognition from video and images, like human computer interactions using voice, using audio. So it's very exciting I think, but for... And also it's very, we found it's very exciting we are embedding these, we are combining these technologies together, so that's why we are using competitors you know. We didn't use YARN, because it cannot support TensorFlow or other framework, but you know, if you are using containers and if you have good scheduler, you can schedule any kind of competition frameworks. So we found it's very interesting to, to have these new frameworks, and we can combine together to solve different kinds of problems. >> John: Thanks so much for coming onto theCube, it's an operating system world we're living in now, it's a great time to be a technologist. Certainly the opportunities are out there, and we're breaking it down here inside theCube, live in Silicon Valley, with the best tech executives, best thought leaders and experts here inside theCube. I'm John Furrier with George Gilbert. We'll be right back with more after this short break. (upbeat percussive music)

Published Date : Mar 14 2017

SUMMARY :

Jose, California, it's theCUBE, So let's get the news out of the way. And the first one is we are providing tool and when people engage you as a customer. And then today we are announcing 5.0, So kind of the triple threat there. the pain point is to find so I need you to unpack because they have to guarantee next to it, so you can you know, so this OLTP capability So that's the OLTP capability. 'cus the streaming problem is, you know, the bank needs to tell you Okay, and so, and then and they can scale to millions scale to 16 to all 32 nodes. so the latency is higher. The latency is higher on the Google? 'Cus it has to go like all so that mean you have George: So that's a the higher latency, but they need to scale segments, you know, to get you back for sure. like the AI framework, like it's a great time to be a technologist.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
China	LOCATION	0.99+
five	QUANTITY	0.99+
Europe	LOCATION	0.99+
Transwarp Technologies	ORGANIZATION	0.99+
40 years	QUANTITY	0.99+
500 milliseconds	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
San Jose, California	LOCATION	0.99+
hundreds of nodes	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Today	DATE	0.99+
ten nodes	QUANTITY	0.99+
first	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
100 reports	QUANTITY	0.99+
tomorrow	DATE	0.99+
second	QUANTITY	0.99+
first one	QUANTITY	0.99+
Yuanhao Sun	PERSON	0.99+
second reason	QUANTITY	0.99+
Spark 2.0	TITLE	0.99+
today	DATE	0.99+
this week	DATE	0.99+
ten times	QUANTITY	0.99+
16	QUANTITY	0.99+
two trends	QUANTITY	0.99+
Yuanhao	PERSON	0.99+
SQL	TITLE	0.99+
Spark	TITLE	0.99+
first trend	QUANTITY	0.99+
two capabilities	QUANTITY	0.98+
Silicon Valley, San Jose	LOCATION	0.98+
TensorFlow	TITLE	0.98+
one event	QUANTITY	0.98+
32 nodes	QUANTITY	0.98+
theCUBE	ORGANIZATION	0.98+
Torch	TITLE	0.98+
166 days ago	DATE	0.98+
one example	QUANTITY	0.98+
more than 12 vertical segments	QUANTITY	0.97+
ten milliseconds	QUANTITY	0.97+
hundreds of milliseconds	QUANTITY	0.97+
two thirds	QUANTITY	0.97+
MXNet	TITLE	0.97+
Databricks	ORGANIZATION	0.96+
Google	ORGANIZATION	0.96+
ten consecutive segments	QUANTITY	0.95+
first use	QUANTITY	0.95+
Wikibon	ORGANIZATION	0.95+
Big Data Silicon Valley	ORGANIZATION	0.95+
Strata Hadoop	ORGANIZATION	0.95+
about 100 times	QUANTITY	0.94+
Big Data SV	ORGANIZATION	0.94+
One of	QUANTITY	0.94+

Arik Pelkey, Pentaho - BigData SV 2017 - #BigDataSV - #theCUBE

>> Announcer: Live from Santa Fe, California, it's the Cube covering Big Data Silicon Valley 2017. >> Welcome, back, everyone. We're here live in Silicon Valley in San Jose for Big Data SV in conjunct with stratAHEAD Hadoop part two. Three days of coverage here in Silicon Valley and Big Data. It's our eighth year covering Hadoop and the Hadoop ecosystem. Now expanding beyond just Hadoop into AI, machine learning, IoT, cloud computing with all this compute is really making it happen. I'm John Furrier with my co-host George Gilbert. Our next guest is Arik Pelkey who is the senior director of product marketing at Pentaho that we've covered many times and covered their event at Pentaho world. Thanks for joining us. >> Thank you for having me. >> So, in following you guys I'll see Pentaho was once an independent company bought by Hitachi, but still an independent group within Hitachi. >> That's right, very much so. >> Okay so you guys some news. Let's just jump into the news. You guys announced some of the machine learning. >> Exactly, yeah. So, Arik Pelkey, Pentaho. We are a data integration and analytics software company. You mentioned you've been doing this for eight years. We have been at Big Data for the past eight years as well. In fact, we're one of the first vendors to support Hadoop back in the day, so we've been along for the journey ever since then. What we're announcing today is really exciting. It's a set of machine learning orchestration capabilities, which allows data scientists, data engineers, and data analysts to really streamline their data science processes. Everything from ingesting new data sources through data preparation, feature engineering which is where a lot of data scientists spend their time through tuning their models which can still be programmed in R, in Weka, in Python, and any other kind of data science tool of choice. What we do is we help them deploy those models inside of Pentaho as a step inside of Pentaho, and then we help them update those models as time goes on. So, really what this is doing is it's streamlining. It's making them more productive so that they can focus their time on things like model building rather than data preparation and feature engineering. >> You know, it's interesting. The market is really active right now around machine learning and even just last week at Google Next, which is their cloud event, they had made the acquisition of Kaggle, which is kind of an open data science. You mentioned the three categories: data engineer, data science, data analyst. Almost on a progression, super geek to business facing, and there's different approaches. One of the comments from the CEO of Kaggle on the acquisition when we wrote up at Sylvan Angle was, and I found this fascinating, I want to get your commentary and reaction to is, he says the data science tools are as early as generations ago, meaning that all the advances and open source and tooling and software development is far along, but now data science is still at that early stage and is going to get better. So, what's your reaction to that, because this is really the demand we're seeing is a lot of heavy lifing going on in the data science world, yet there's a lot of runway of more stuff to do. What is that more stuff? >> Right. Yeah, we're seeing the same thing. Last week I was at the Gardener Data and Analytics conference, and that was kind of the take there from one of their lead machine learning analysts was this is still really early days for data science software. So, there's a lot of Apache projects out there. There's a lot of other open source activity going on, but there are very few vendors that bring to the table an integrated kind of full platform approach to the data science workflow, and that's what we're bringing to market today. Let me be clear, we're not trying to replace R, or Python, or MLlib, because those are the tools of the data scientists. They're not going anywhere. They spent eight years in their phD program working with these tools. We're not trying to change that. >> They're fluent with those tools. >> Very much so. They're also spending a lot of time doing feature engineering. Some research reports, say between 70 and 80% of their time. What we bring to the table is a visual drag and drop environment to do feature engineering a much faster, more efficient way than before. So, there's a lot of different kind of desperate siloed applications out there that all do interesting things on their own, but what we're doing is we're trying to bring all of those together. >> And the trends are reduce the time it takes to do stuff and take away some of those tasks that you can use machine learning for. What unique capabilities do you guys have? Talk about that for a minute, just what Pentaho is doing that's unique and added value to those guys. >> So, the big thing is I keep going back to the data preparation part. I mean, that's 80% of time that's still a really big challenge. There's other vendors out there that focus on just the data science kind of workflow, but where we're really unqiue is around being able to accommodate very complex data environments, and being able to onboard data. >> Give me an example of those environments. >> Geospatial data combined with data from your ERP or your CRM system and all kinds of different formats. So, there might be 15 different data formats that need to be blended together and standardized before any of that can really happen. That's the complexity in the data. So, Pentaho, very consistent with everything else that we do outside of machine learning, is all about helping our customers solve those very complex data challenges before doing any kind of machine learning. One example is one customer is called Caterpillar Machine Asset Intelligence. So, their doing predictive maintenance onboard container ships and on ferry's. So, they're taking data from hundreds and hundreds of sensors onboard these ships, combining that kind of operational sensor data together with geospatial data and then they're serving up predictive maintenance alerts if you will, or giving signals when it's time to replace an engine or complace a compressor or something like that. >> Versus waiting for it to break. >> Versus waiting for it to break, exactly. That's one of the real differentiators is that very complex data environment, and then I was starting to move toward the other differentiator which is our end to end platform which allows customers to deliver these analytics in an embedded fashion. So, kind of full circle, being able to send that signal, but not to an operational system which is sometimes a challenge because you might have to rewrite the code. Deploying models is a really big challenge within Pentaho because it is this fully integrated application. You can deploy the models within Pentaho and not have to jump out into a mainframe environment or something like that. So, I'd say differentiators are very complex data environments, and then this end to end approach where deploying models is much easier than ever before. >> Perhaps, let's talk about alternatives that customers might see. You have a tool suite, and others might have to put together a suite of tools. Maybe tell us some of the geeky version would be the impendent mismatch. You know, like the chasms you'd find between each tool where you have to glue them together, so what are some of those pitfalls? >> One of the challenges is, you have these data scientists working in silos often times. You have data analysts working in silos, you might have data engineers working in silos. One of the big pitfalls is not really collaborating enough to the point where they can do all of this together. So, that's a really big area that we see pitfalls. >> Is it binary not collaborating, or is it that the round trip takes so long that the quality or number of collaborations is so drastically reduced that the output is of lower quality? >> I think it's probably a little bit of both. I think they want to collaborate but one person might sit in Dearborn, Michigan and the other person might sit in Silicon Valley, so there's just a location challenge as well. The other challenge is, some of the data analysts might sit in IT and some of the data scientists might sit in an analytics department somewhere, so it kind of cuts across both location and functional area too. >> So let me ask from the point of view of, you know we've been doing these shows for a number of years and most people have their first data links up and running and their first maybe one or two use cases in production, very sophisticated customers have done more, but what seems to be clear is the highest value coming from those projects isn't to put a BI tool in front of them so much as to do advanced analytics on that data, apply those analytics to inform a decision, whether a person or a machine. >> That's exactly right. >> So, how do you help customers over that hump and what are some other examples that you can share? >> Yeah, so speaking of transformative. I mean, that's what machine learning is all about. It helps companies transform their businesses. We like to talk about that at Pentaho. One customer kind of industry example that I'll share is a company called IMS. IMS is in the business of providing data and analytics to insurance companies so that the insurance companies can price insurance policies based on usage. So, it's a usage model. So, IMS has a technology platform where they put sensors in a car, and then using your mobile phone, can track your driving behavior. Then, your insurance premium that month reflects the driving behavior that you had during that month. In terms of transformative, this is completely upending the insurance industry which has always had a very fixed approach to pricing risk. Now, they understand everything about your behavior. You know, are you turning too fast? Are you breaking too fast, and they're taking it further than that too. They're able to now do kind of a retroactive look at an accident. So, after an accident, they can go back and kind of decompose what happened in the accident and determine whether or not it was your fault or was in fact the ice on the street. So, transformative? I mean, this is just changing things in a really big way. >> I want to get your thoughts on this. I'm just looking at some of the research. You know, we always have the good data but there's also other data out there. In your news, 92% of organizations plan to deploy more predictive analytics, however 50% of organizations have difficulty integrating predictive analytics into their information architecture, which is where the research is shown. So my question to you is, there's a huge gap between the technology landscapes of front end BI tools and then complex data integration tools. That seems to be the sweet spot where the value's created. So, you have the demand and then front end BI's kind of sexy and cool. Wow, I could power my business, but the complexity is really hard in the backend. Who's accessing it? What's the data sources? What's the governance? All these things are complicated, so how do you guys reconcile the front end BI tools and the backend complexity integrations? >> Our story from the beginning has always been this one integrated platform, both for complex data integration challenges together with visualizations, and that's very similar to what this announcement is all about for the data science market. We're very much in line with that. >> So, it's the cart before the horse? Is it like the BI tools are really driven by the data? I mean, it makes sense that the data has to be key. Front end BI could be easy if you have one data set. >> It's funny you say that. I presented at the Gardner conference last week and my topic was, this just in: it's not about analytics. Kind of in jest, but it drove a really big crowd. So, it's about the data right? It's about solving the data problem before you solve the analytics problem whether it's a simple visualization or it's a complex fraud machine learning problem. It's about solving the data problem first. To that quote, I think one of the things that they were referencing was the challenging information architectures into which companies are trying to deploy models and so part of that is when you build a machine learning model, you use R and Python and all these other ones we're familiar with. In order to deploy that into a mainframe environment, someone has to then recode it in C++ or COBOL or something else. That can take a really long time. With our integrated approach, once you've done the feature engineering and the data preparation using our drag and drop environment, what's really interesting is that you're like 90% of the way there in terms of making that model production ready. So, you don't have to go back and change all that code, it's already there because you used it in Pentaho. >> So obviously for those two technologies groups I just mentioned, I think you had a good story there, but it creates problems. You've got product gaps, you've got organizational gaps, you have process gaps between the two. Are you guys going to solve that, or are you currently solving that today? There's a lot of little questions in there, but that seems to be the disconnect. You know, I can do this, I can do that, do I do them together? >> I mean, sticking to my story of one integrated approach to being able to do the entire data science workflow, from beginning to end and that's where we've really excelled. To the extent that more and more data engineers and data analysts and data scientists can get on this one platform even if their using R and WECCA and Python. >> You guys want to close those gaps down, that's what you guys are doing, right? >> We want to make the process more collaborative and more efficient. >> So Dave Alonte has a question on CrowdChat for you. Dave Alonte was in the snowstorm in Boston. Dave, good to see you, hope you're doing well shoveling out the driveway. Thanks for coming in digitally. His question is HDS has been known for mainframes and storage, but Hitachi is an industrial giant. How is Pentaho leveraging Hitatchi's IoT chops? >> Great question, thanks for asking. Hitatchi acquired Pentaho about two years ago, this is before my time. I've been with Pentaho about ten months ago. One of the reasons that they acquired Pentaho is because a platform that they've announced which is called Lumata which is their IoT platform, so what Pentaho is, is the analytics engine that drives that IoT platform Lumata. So, Lumata is about solving more of the hardware sensor, bringing data from the edge into being able to do the analytics. So, it's an incredibly great partnership between Lumata and Pentaho. >> Makes an eternal customer too. >> It's a 90 billion dollar conglomerate so yeah, the acquisition's been great and we're still very much an independent company going to market on our own, but we now have a much larger channel through Hitatchi's reps around the world. >> You've got IoT's use case right there in front of you. >> Exactly. >> But you are leveraging it big time, that's what you're saying? >> Oh yeah, absolutely. We're a very big part of their IoT strategy. It's the analytics. Both of the examples that I shared with you are in fact IoT, not by design but it's because there's a lot of demand. >> You guys seeing a lot of IoT right now? >> Oh yeah. We're seeing a lot of companies coming to us who have just hired a director or vice president of IoT to go out and figure out the IoT strategy. A lot of these are manufacturing companies or coming from industries that are inefficient. >> Digitizing the business model. >> So to the other point about Hitachi that I'll make, is that as it relates to data science, a 90 billion dollar manufacturing and otherwise giant, we have a very deep bench of phD data scientists that we can go to when there's very complex data science problems to solve at customer sight. So, if a customer's struggling with some of the basic how do I get up and running doing machine learning, we can bring our bench of data scientist at Hitatchi to bear in those engagements, and that's a really big differentiator for us. >> Just to be clear and one last point, you've talked about you handle the entire life cycle of modeling from acquiring the data and prepping it all the way through to building a model, deploying it, and updating it which is a continuous process. I think as we've talked about before, data scientists or just the DEV ops community has had trouble operationalizing the end of the model life cycle where you deploy it and update it. Tell us how Pentaho helps with that. >> Yeah, it's a really big problem and it's a very simple solution inside of Pentaho. It's basically a step inside of Pentaho. So, in the case of fraud let's say for example, a prediction might say fraud, not fraud, fraud, not fraud, whatever it is. We can then bring that kind of full lifecycle back into the data workflow at the beginning. It's a simple drag and drop step inside of Pentaho to say which were right and which were wrong and feed that back into the next prediction. We could also take it one step further where there has to be a manual part of this too where it goes to the customer service center, they investigate and they say yes fraud, no fraud, and then that then gets funneled back into the next prediction. So yeah, it's a big challenge and it's something that's relatively easy for us to do just as part of the data science workflow inside of Pentaho. >> Well Arick, thanks for coming on The Cube. We really appreciate it, good luck with the rest of the week here. >> Yeah, very exciting. Thank you for having me. >> You're watching The Cube here live in Silicon Valley covering Strata Hadoop, and of course our Big Data SV event, we also have a companion event called Big Data NYC. We program with O'Reilley Strata Hadoop, and of course have been covering Hadoop really since it's been founded. This is The Cube, I'm John Furrier. George Gilbert. We'll be back with more live coverage today for the next three days here inside The Cube after this short break.

Published Date : Mar 14 2017

SUMMARY :

it's the Cube covering Big Data Silicon Valley 2017. and the Hadoop ecosystem. So, in following you guys I'll see Pentaho was once You guys announced some of the machine learning. We have been at Big Data for the past eight years as well. One of the comments from the CEO of Kaggle of the data scientists. environment to do feature engineering a much faster, and take away some of those tasks that you can use So, the big thing is I keep going back to the data That's the complexity in the data. So, kind of full circle, being able to send that signal, You know, like the chasms you'd find between each tool One of the challenges is, you have these data might sit in IT and some of the data scientists So let me ask from the point of view of, the driving behavior that you had during that month. and the backend complexity integrations? is all about for the data science market. I mean, it makes sense that the data has to be key. It's about solving the data problem before you solve but that seems to be the disconnect. To the extent that more and more data engineers and more efficient. shoveling out the driveway. One of the reasons that they acquired Pentaho the acquisition's been great and we're still very much Both of the examples that I shared with you of IoT to go out and figure out the IoT strategy. is that as it relates to data science, from acquiring the data and prepping it all the way through and feed that back into the next prediction. of the week here. Thank you for having me. for the next three days here inside The Cube

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Hitachi	ORGANIZATION	0.99+
Dave Alonte	PERSON	0.99+
Pentaho	ORGANIZATION	0.99+
Dave	PERSON	0.99+
90%	QUANTITY	0.99+
Arik Pelkey	PERSON	0.99+
Boston	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Hitatchi	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
one	QUANTITY	0.99+
50%	QUANTITY	0.99+
eight years	QUANTITY	0.99+
Arick	PERSON	0.99+
One	QUANTITY	0.99+
Lumata	ORGANIZATION	0.99+
Last week	DATE	0.99+
two technologies	QUANTITY	0.99+
15 different data formats	QUANTITY	0.99+
first	QUANTITY	0.99+
92%	QUANTITY	0.99+
One example	QUANTITY	0.99+
Both	QUANTITY	0.99+
Three days	QUANTITY	0.99+
Python	TITLE	0.99+
Kaggle	ORGANIZATION	0.99+
one customer	QUANTITY	0.99+
today	DATE	0.99+
eighth year	QUANTITY	0.99+
last week	DATE	0.99+
Santa Fe, California	LOCATION	0.99+
two	QUANTITY	0.99+
each tool	QUANTITY	0.99+
90 billion dollar	QUANTITY	0.99+
80%	QUANTITY	0.99+
Caterpillar	ORGANIZATION	0.98+
both	QUANTITY	0.98+
NYC	LOCATION	0.98+
first data	QUANTITY	0.98+
Pentaho	LOCATION	0.98+
San Jose	LOCATION	0.98+
The Cube	TITLE	0.98+
Big Data SV	EVENT	0.97+
COBOL	TITLE	0.97+
70	QUANTITY	0.97+
C++	TITLE	0.97+
IMS	TITLE	0.96+
MLlib	TITLE	0.96+
one person	QUANTITY	0.95+
R	TITLE	0.95+
Big Data	EVENT	0.95+
Gardener Data and Analytics	EVENT	0.94+
Gardner	EVENT	0.94+
Strata Hadoop	TITLE	0.93+

Yaron Haviv | BigData SV 2017

>> Announcer: Live from San Jose, California, it's the CUBE, covering Big Data Silicon Valley 2017. (upbeat synthesizer music) >> Live with the CUBE coverage of Big Data Silicon Valley or Big Data SV, #BigDataSV in conjunction with Strata + Hadoop. I'm John Furrier with the CUBE and my co-host George Gilbert, analyst at Wikibon. I'm excited to have our next guest, Yaron Haviv, who's the founder and CTO of iguazio, just wrote a post up on SiliconANGLE, check it out. Welcome to the CUBE. >> Thanks, John. >> Great to see you. You're in a guest blog this week on SiliconANGLE, and always great on Twitter, cause Dave Alante always liked to bring you into the contentious conversations. >> Yaron: I like the controversial ones, yes. (laughter) >> And you add a lot of good color on that. So let's just get right into it. So your company's doing some really innovative things. We were just talking before we came on camera here, about some of the amazing performance improvements you guys have on many different levels. But first take a step back, and let's talk about what this continuous analytics platform is, because it's unique, it's different, and it's got impact. Take a minute to explain. >> Sure, so first a few words on iguazio. We're developing a data platform which is unified, so basically it can ingest data through many different APIs, and it's more like a cloud service. It is for on-prem and edge locations and co-location, but it's managed more like a cloud platform so very similar experience to Amazon. >> John: It's software? >> It's software. We do integrate a lot with hardware in order to achieve our performance, which is really about 10 to 100 times faster than what exists today. We've talked to a lot of customers and what we really want to focus with customers in solving business problems, Because I think a lot of the Hadoop camp started with more solving IT problems. So IT is going kicking tires, and eventually failing based on your statistics and Gardner statistics. So what we really wanted to solve is big business problems. We figured out that this notion of pipeline architecture, where you ingest data, and then curate it, and fix it, et cetera, which was very good for the early days of Hadoop, if you think about how Hadoop started, was page ranking from Google. There was no time sensitivity. You could take days to calculate it and recalibrate your search engine. Based on new research, everyone is now looking for real time insights. So there is sensory data from (mumbles), there's stock data from exchanges, there is fraud data from banks, and you need to act very quickly. So this notion of and I can give you examples from customers, this notion of taking data, creating Parquet file and log files, and storing them in S3 and then taking Redshift and analyzing them, and then maybe a few hours later having an insight, this is not going to work. And what you need to fix is, you have to put some structure into the data. Because if you need to update a single record, you cannot just create a huge file of 10 gigabyte and then analyze it. So what we did is, basically, a mechanism where you ingest data. As you ingest the data, you can run multiple different processes on the same thing. And you can also serve the data immediately, okay? And two examples that we demonstrate here in the show, one is video surveillance, very nice movie-style example, that you, basically, ingest pictures for S3 API, for object API, you analyze the picture to detect faces, to detect scenery, to extract geolocation from pictures and all that, all those through different processes. TensorFlow doing one, serverless functions that we have, do other simpler tasks. And in the same time, you can have dashboards that just show everything. And you can have Spark, that basically does queries of where was this guys last seen? Or who was he with, you know, or think about the Boston Bomber example. You could just do it in real time. Because you don't need this notion of pipeline. And this solves very hard business problems for some of the customers we work with. >> So that's the key innovation, there's no pipe lining. And what's the secret sauce? >> So first, our system does about a couple of million of transactions per second. And we are a multi-modal database. So, basically, you can ingest data as a stream, exactly the same data could be read by Spark as a table. So you could, basically, issue a query on the same data. Give me everything that has a certain pattern or something, and could also be served immediately through RESTful APIs to a dashboard running AngularJS or something like that. So that's the secret sauce, is by having this integration, and this unique data model, it allows you all those things to work together. There are other aspects, like we have transactional semantics. One of the challenges is how do you make sure that a bunch of processes don't collide when they update the same data. So first you need a very low ground alert. 'cause each one may update to different field. Like this example that I gave with GeoData, the serverless function that does the GeoData extraction only updates the GeoData fields within the records. And maybe TensorFlow updates information about the image in a different location in the record or, potentially, a different record. So you have to have that, along with transaction safety, along with security. We have very tight security at the field level, identity level. So that's re-thinking the entire architecture. And I think what many of the companies you'll see at the show, they'll say, okay, Hadoop is given, let's build some sort of convenience tools around it, let's do some scripting, let's do automation. But serve the underlying thing, I won't use dirty words, but is not well-equipped to the new challenges of real time. We basically restructured everything, we took the notions of cloud-native architectures, we took the notions of Flash and latest Flash technologies, a lot of parallelism on CPUs. We didn't take anything for granted on the underlying architecture. >> So when you found the company, take a personal story here. What was the itch you were scratching, why did you get into this? Obviously, you have a huge tech advantage, which is, will double-down with the research piece and George will have some questions. What got you going with the company? You got a unique approach, people would love to do away with the pipeline, that sounds great. And the performance, you said about 100x. So how did you get here? (laughs) Tell the story. >> So if you know my background, I ran all the data center activities in Mellanox, and you know Mellanox, I know Kevin was here. And my role was to take Mellanox technology, which is 100 gig networking and silicon, and fit it into the different applications. So I worked with SAP HANA, I worked with Teradata, I worked on Oracle Exadata, I work with all the cloud service providers on building their own object storage and NoSQL and other solutions. I also owned all the open source activities around Hadoop and Saf and all those projects, and my role was to fix many of those. If a customer says I don't need 100 gig, it's too fast for me, how do I? And my role was to convince him that yes, I can open up all the bottleneck all the way up to your stack so you can leverage those new technologies. And for that we basically sowed inefficiencies in those stacks. >> So you had a good purview of the marketplace. >> Yaron: Yes. >> You had open source on one hand, and then all the-- >> All the storage players, >> vendors, network. >> all the database players and all the cloud service providers were my customers. So you're a very unique point where you see the trajectory of cloud. Doing things totally different, and sometimes I see the trajectory of enterprise storage, SAN, NAS, you know, all Flash, all that, legacy technologies where cloud providers are all about object, key value, NoSQL. And you're trying to convince those guys that maybe they were going the wrong way. But it's pretty hard. >> Are they going the wrong way? >> I think they are going the wrong way. Everyone, for example, is running to do NVMe over Fabric now that's the new fashion. Okay, I did the first implementation of NVMe over Fabric, in my team at Mellanox. And I really loved it, at that time, but databases cannot run on top of storage area networks. Because there are serialization problems. Okay, if you use a storage area network, that mean that every node in the cluster have to go and serialize an operation against the shared media. And that's not how Google and Amazon works. >> There's a lot more databases out there too, and a lot more data sources. You've got the Edge. >> Yeah, but all the new databases, all the modern databases, they basically shared the data across the different nodes so there are no serialization problems. So that's why Oracle doesn't scale, or scale to 10 nodes at best, with a lot of RDMA as a back plane, to allow that. And that's why Amazon can scale to a thousand nodes, or Google-- >> That's the horizontally-scalable piece that's happening. >> Yeah, because, basically, the distribution has to move into the higher layers of the data, and not the lower layers of the data. And that's really the trajectory where the traditional legacy storage and system vendors are going, and we sort of followed the way the cloud guys went, just with our knowledge of the infrastructure, we sort of did it better than what the cloud guys did. 'Cause the cloud guys focused more on the higher levels of the implementation, the algorithms, the Paxos, and all that. Their implementation is not that efficient. And we did both sides extremely efficient. >> How about the Edge? 'Cause Edge is now part of cloud, and you got cloud has got the compute, all the benefits, you were saying, and still they have their own consumption opportunities and challenges that everyone else does. But Edge is now exploding. The combination of those things coming together, at the intersection of that is deep learning, machine learning, which is powering the AI hype. So how is the Edge factoring into your plan and overall architectures for the cloud? >> Yeah, so I wrote a bunch of posts that are not published yet about the Edge, But my analysis along with your analysis and Pierre Levin's analysis, is that cloud have to start distribute more. Because if you're looking at the trends. Five gig, 5G Wi-Fi in wireless networking is going to be gigabit traffic. Gigabit to the homes, they're going to buy Google, 70 bucks a month. It's going to push a lot more bend with the Edge. On the same time, a cloud provider, is in order to lower costs and deal with energy problems they're going to rural areas. The traditional way we solve cloud problems was to put CDNs, so every time you download a picture or video, you got to a CDN. When you go to Netflix, you don't really go to Amazon, you got to a Netflix pop, one of 250 locations. The new work loads are different because they're no longer pictures that need to be cashed. First, there are a lot of data going up. Sensory data, upload files, et cetera. Data is becoming a lot more structured. Censored data is structured. All this car information will be structured. And you want to (mumbles) digest or summarize the data. So you need technologies like machine learning, NNI and all those things. You need something which is like CDNs. Just mini version of cloud that sits somewhere in between the Edge and the cloud. And this is our approach. And now because we can string grab the mini cloud, the mini Amazon in a way more dense approach, then this is a play that we're going to take. We have a very good partnership with Equinox. Which has 170 something locations with very good relations. >> So you're, essentially, going to disrupt the CDN. It's something that I've been writing about and tweeting about. CDNs were based on the old Yahoo days. Cashing images, you mentioned, give me 1999 back, please. That's old school, today's standards. So it's a whole new architecture because of how things are stored. >> You have to be a lot more distributive. >> What is the architecture? >> In our innovation, we have two layers of innovation. One is on the lower layers of, we, actually, have three main innovations. One is on the lower layers of what we discussed. The other one is the security layer, where we classify everything. Layer seven at 100 gig graphic rates. And the third one is all this notion of distributed system. We can, actually, run multiple systems in multiple locations and manage them as one logical entity through high level semantics, high level policies. >> Okay, so when we take the CUBE global, we're going to have you guys on every pop. This is a legit question. >> No it's going to take time for us. We're not going to do everything in one day and we're starting with the local problems. >> Yeah but this is digital transmissions. Stay with me for a second. Stay with this scenario. So video like Netflix is, pretty much, one dimension, it's video. They use CDNs now but when you start thinking in different content types. So, I'm going to have a video with, maybe, just CGI overlayed or social graph data coming in from tweets at the same time with Instagram pictures. I might be accessing multiple data everywhere to watch a movie or something. That would require beyond a CDN thinking. >> And you have to run continuous analytics because it can not afford batch. It can not afford a pipeline. Because you ingest picture data, you may need to add some subtext with the data and feed it, directly, to the consumer. So you have to move to those two elements of moving more stuff into the Edge and running into continuous analytics versus a batch on pipeline. >> So you think, based on that scenario I just said, that there's going to be an opportunity for somebody to take over the media landscape for sure? >> Yeah, I think if you're also looking at the statistics. I seen a nice article. I told George about it. That analyzing the Intel cheap distribution. What you see is that there is a 30% growth on Intel's cheap Intel Cloud which is faster than what most analysts anticipate in terms of cloud growth. That means, actually, that cloud is going to cannibalize Enterprise faster than what most think. Enterprise is shrinking about 7%. There is another place which is growing. It's Telcos. It's not growing like cloud but part of it is because of this move towards the Edge and the move of Telcos buying white boxes. >> And 5G and access over the top too. >> Yeah but that's server chips. >> Okay. >> There's going to be more and more computation in the different Telco locations. >> John: Oh you're talking about computer, okay. >> This is an opportunity that we can capitalize on if we run fast enough. >> It sounds as though because you've implemented these industry standard APIs that come from the, largely, the open source ecosystem, that you can propagate those to areas on the network that the vendors, who are behind those APIs can't, necessarily, do. Into the Telcos, towards the Edge. And, I assume, part of that is cause of the density and the simplicity. So, essentially, your footprint's smaller in terms of hardware and the operational simplicity is greater. Is that a fair assessment? >> Yes and also, we support a lot of Amazon compatible APIs which are RESTful, typically, HTTP based. Very convenient to work with in a cloud environment. Another thing is, because we're taking all the state on ourself, the different forms of states whether it's a message queue or a table or an object, et cetera, that makes the computation layer very simple. So one of the things that we are, also, demonstrating is the integration we have with Kubernetes that, basically, now simplifies Kubernetes. Cause you don't have to build all those different data services for cloud native infrastructure. You just run Kubernetes. We're the volume driver, we're the database, we're the message queues, we're everything underneath Kubernetes and then, you just run Spark or TensorFlow or a serverless function as a Kubernetes micro service. That allows you now, elastically, to increase the number of Spark jobs that you need or, maybe, you have another tenant. You just spun a Spark job. YARN has some of those attributes but YARN is very limited, very confined to the Hadoop Ecosystem. TensorFlow is not a Hadoop player and a bunch of those new tools are not in Hadoop players and everyone is now adopting a new way of doing streaming and they just call it serverless. serverless and streaming are very similar technologies. The advantage of serverless is all this pre-packaging and all this automation of the CICD. The continuous integration, the continuous development. So we're thinking, in order to simplify the developer in an operation aspects, we're trying to integrate more and more with cloud native approach around CICD and integration with Kubernetes and cloud native technologies. >> Would it be fair to say that from a developer or admin point of view, you're pushing out from the cloud towards the Edge faster than if the existing implementations say, the Apache Ecosystem or the AWS Ecosystem where AWS has something on the edge. I forgot whether it's Snowball or Green Grass or whatever. Where they at least get the lambda function. >> They're field by the way and it's interesting to see. One of the things they allowed lambda functions in their CDS which is going the direction I mentioned just for a minimal functionality. Another thing is they have those boxes where they have a single VM and they can run lambda function as well. But I think their ability to run computation is very limited and also, their focus is on shipping the boxes through mail and we want it to be always connected. >> Our final question for you, just to get your thoughts. Great save up, by the way. This is very informative. Maybe be should do a follow up on Skype in our studio for Silocon Friday show. Google Next was interesting. They're serious about the Enterprise but you can see that they're not yet there. What is the Enterprise readiness from your perspective? Cause Google has the tech and they try to flaunt the tech. We're great, we're Google, look at us, therefore, you should buy us. It's not that easy in the Enterprise. How would you size up the different players? Because they're all not like Amazon although Amazon is winning. You got Amazon, Azure and Google. Your thoughts on the cloud players. >> The way we attack Enterprise, we don't attack it from an Enterprise perspective or IT perspective, we take it from a business use case perspective. Especially, because we're small and we have to run fast. You need to identify a real critical business problem. We're working with stock exchanges and they have a lot of issues around monitoring the daily trade activities in real time. If you compare what we do with them on this continuous analytics notion to how they work with Excel's and Hadoops, it's totally different and now, they could do things which are way different. I think that one of the things that Hadook's customer, if Google wants to succeed against Amazon, they have to find the way of how to approach those business owners and say here's a problem Mr. Customer, here's a business challenge, here's what I'm going to solve. If they're just going to say, you know what? My VM's are cheaper than Amazon, it's not going to be a-- >> Also, they're doing the whole, they're calling lift and shift which is code word for rip and replace in the Enterprise. So that's, essentially, I guess, a good opportunity if you can get people to do that but not everyone's ripping and replacing and lifting and shifting. >> But a lot of Google advantages around areas of AI and things like that. So they should try and leverage, if you think about Amazon approach to AI, this fund the university to build a project and then set it's hours where Google created TensorFlow and created a lot of other IPs and Dataflow and all those solutions and consumered it to the community. I really love Google's approach of contributing Kubernetes, to contributing TensorFlow. And this way, they're planting the seeds so the new generation this is going to work with Kubernetes and TensorFlow who are going to say, "You know what?" "Why would I mess with this thing on (mumbles) just go and. >> Regular cloud, do multi-cloud. >> Right to the cloud. But I think a lot of criticism about Google is that they're too research oriented. They don't know how to monetize and approach the-- >> Enterprise is just a whole different drum beat and I think that's the only thing on my complaint with them, they got to get that knowledge and/or buy companies. Have a quick final point on Spanner or any analysis of Spanner that went from paper, pretty quickly, from paper to product. >> So before we started iguazio, I started Spanner quite a bit. All the publication was there and all the other things like Spanner. Spanner has the underlying layer called Colossus. And our data layer is very similar to how Colossus works. So we're very familiar. We took a lot of concepts from Spanner on our platform. >> And you like Spanner, it's legit? >> Yes, again. >> Cause you copied it. (laughs) >> Yaron: We haven't copied-- >> You borrowed some best practices. >> I think I cited about 300 research papers before we did the architecture. But we, basically, took the best of each one of them. Cause there's still a lot of issues. Most of those technologies, by the way, are designed for mechanical disks and we can talk about it in a different-- >> And you have Flash. Alright, Yaron, we have gone over here. Great segment. We're here, live in Silicon Valley, breakin it down, getting under the hood. Looking a 10X, 100X performance advantages. Keep an eye on iguazio, they're looking like they got some great products. Check them out. This is the CUBE. I'm John Furrier with George Gilbert. We'll be back with more after this short break. (upbeat synthesizer music)

Published Date : Mar 14 2017

SUMMARY :

it's the CUBE, covering Big Welcome to the CUBE. to bring you into the Yaron: I like the about some of the amazing and it's more like a cloud service. And in the same time, So that's the key innovation, So that's the secret sauce, And the performance, you said about 100x. and fit it into the purview of the marketplace. and all the cloud service that's the new fashion. You've got the Edge. Yeah, but all the new databases, That's the horizontally-scalable and not the lower layers of the data. So how is the Edge digest or summarize the data. going to disrupt the CDN. One is on the lower layers of, we're going to have you guys on every pop. the local problems. So, I'm going to have a video with, maybe, of moving more stuff into the Edge and the move of Telcos buying white boxes. in the different Telco locations. John: Oh you're talking This is an opportunity that we and the operational simplicity is greater. is the integration we have with Kubernetes the Apache Ecosystem or the AWS Ecosystem One of the things they It's not that easy in the Enterprise. to say, you know what? and replace in the Enterprise. and consumered it to the community. Right to the cloud. that's the only thing and all the other things like Spanner. Cause you copied it. and we can talk about it in a different-- This is the CUBE.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Telcos	ORGANIZATION	0.99+
Yaron Haviv	PERSON	0.99+
Google	ORGANIZATION	0.99+
Equinox	ORGANIZATION	0.99+
John	PERSON	0.99+
Mellanox	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Telco	ORGANIZATION	0.99+
Kevin	PERSON	0.99+
Dave Alante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Yaron	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
Pierre Levin	PERSON	0.99+
100 gig	QUANTITY	0.99+
AngularJS	TITLE	0.99+
San Jose, California	LOCATION	0.99+
30%	QUANTITY	0.99+
John Furrier	PERSON	0.99+
One	QUANTITY	0.99+
two examples	QUANTITY	0.99+
First	QUANTITY	0.99+
third one	QUANTITY	0.99+
Skype	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
Netflix	ORGANIZATION	0.99+
10 gigabyte	QUANTITY	0.99+
Teradata	ORGANIZATION	0.99+
two elements	QUANTITY	0.99+
CUBE	ORGANIZATION	0.99+
Spanner	TITLE	0.99+
Oracle	ORGANIZATION	0.99+
S3	TITLE	0.99+
first	QUANTITY	0.99+
1999	DATE	0.98+
two layers	QUANTITY	0.98+
Excel	TITLE	0.98+
both sides	QUANTITY	0.98+
Spark	TITLE	0.98+
Five gig	QUANTITY	0.98+
Kubernetes	TITLE	0.98+
Paxos	ORGANIZATION	0.98+
Intel	ORGANIZATION	0.98+
100X	QUANTITY	0.98+
Azure	ORGANIZATION	0.98+
Colossus	TITLE	0.98+
about 7%	QUANTITY	0.98+
Yahoo	ORGANIZATION	0.98+
Hadoop	TITLE	0.97+
Boston Bomber	ORGANIZATION	0.97+

Tendü Yogurtçu | BigData SV 2017

>> Announcer: Live from San Jose, California. It's The Cube, covering Big Data Silicon Valley 2017. (upbeat electronic music) >> California, Silicon Valley, at the heart of the big data world, this is The Cube's coverage of Big Data Silicon Valley in conjunction with Strata Hadoop, well of course we've been here for multiple years, covering Hadoop World for now our eighth year, now that's Strata Hadoop but we do our own event, Big Data SV in New York City and Silicon Valley, SV NYC. I'm John Furrier, my cohost George Gilbert, analyst at Wikibon. Our next guest is Tendü Yogurtçu with Syncsort, general manager of the big data, did I get that right? >> Yes, you got it right. It's always a pleasure to be at The Cube. >> (laughs) I love your name. That's so hard for me to get, but I think I was close enough there. Welcome back. >> Thank you. >> Great to see you. You know, one of the things I'm excited about with Syncsort is we've been following you guys, we talk to you guys every year, and it just seems to be that every year, more and more announcements happen. You guys are unstoppable. You're like what Amazon does, just more and more announcements, but the theme seems to be integration. Give us the latest update. You had an update, you bought Trillium, you got a hit deal with Hortonworks, you got integrated with Spark, you got big news here, what's the news here this year? >> Sure. Thank you for having me. Yes, it's very exciting times at Syncsort and I've probably say that every time I appear because every time it's more exciting than the previous, which is great. We bought Trillium Software and Trillium Software has been leading data quality over a decade in many of the enterprises. It's very complimentary to our data integration, data management portfolio because we are helping our customers to access all of their enterprise data, not just the new emerging sources in the connected devices and mobile and streaming. Also leveraging reference data, my main frame legacy systems and the legacy enterprise data warehouse. While we are doing that, accessing data, data lake is now actually, in some cases, turning into data swamp. That was a term Dave Vellante used a couple of years back in one of the crowd chats and it's becoming real. So, data-- >> Real being the data swamps, data lakes are turning into swamps because they're not being leveraged properly? >> Exactly, exactly. Because it's about also having access to write data, and data quality is very complimentary because dream has had trusted right data, so to enterprise customers in the traditional environments, so now we are looking forward to bring that enterprise trust of the data quality into data lake. In terms of the data integration, data integration has been always very critical to any organization. It's even more critical now that the data is shifting gravity and the amount of data organizations have. What we have been delivering in very large enterprise production environments for the last three years is we are hearing our competitors making announcements in those areas very recently, which is a validation because we are already running in very large production environments. We are offering value by saying "Create your applications for integrating your data," whether it's in the cloud or originating on the cloud or origination on the main frames, whether it's on the legacy data warehouse, you can deploy the same exact application without any recompilations, without any changes on your standalone Windows laptop or in Hadoop MapReduce, or Spark in the cloud. So this design once and deploy anywhere is becoming more and more critical with data, it's originating in many different places and cloud is definitely one of them. Our data warehouse optimization solution with Hortonworks and AtScale, it's a special package to accelerate this adoption. It's basically helping organizations to offload the workload from the existing Teradata or Netezza data warehouse and deploying in Hadoop. We provide a single button to automatically map the metadata, create the metadata in Hive or on Hadoop and also make the data accessible in the new environment and AtScale provides fast BI on top of that. >> Wow, that's amazing. I want to ask you a question, because this is a theme, so I just did a tweetup just now while you were talking saying "the theme this year is cleaning up the data lakes, or data swamps, AKA data lakes. The other theme is integration. Can you just lay out your premise on how enterprises should be looking at integration now because it's the multi-vendor world, it's the multi-cloud world, multi-data type and source with metadata world. How do you advise customers that have the plethora of action coming at them. IOT, you've got cloud, you've got big data, I've got Hadoop here, I got Spark over here, what's the integration formula? >> First thing is identify your business use cases. What's your business's challenge, what's your business goals, and the challenge, because that should be the real driver. We assist in some organizations, they start with the intention "we would like to create a data lake" without having that very clear understanding, what is it that I'm trying to solve with this data lake? Data as a service is really becoming a theme across multiple organizations, whether it's on the enterprise side or on some of the online retail organizations, for example. As part of that data as a service, organizations really need to adopt tools that are going to enable them to take advantage of the technology stack. The technology stack is evolving very rapidly. The skill sets are rare, and skill sets are rare because you need to be kind of making adjustments. Am I hiring Ph.D students who can program Scala in the most optimized way, or should I hire Java developers, or should I hire Python developers, the names of the tools in the stack, Spark one versus Spark two APIs, change. It's really evolving very rapidly. >> It's hard to find Scala developers, I mean, you go outside Silicon Valley. >> Exactly. So you need to be, as an organization, ours advises that you really need to find tools that are going to fit those business use cases and provide a single software environment, that data integration might be happening on premise now, with some of the legacy enterprise data warehouse, and it might happen in a hybrid, on premise and cloud environment in the near future and perhaps completely in the cloud. >> So standard tools, tools that have some standard software behind it, so you don't get stuck in the personnel hiring problem. Some unique domain expertise that's hard to hire. >> Yes, skill set is one problem, the second problem is the fact that the applications needs to be recompiled because the stack is evolving and the APIs are not compatible with the previous version, so that's the maintenance cost to keep up with things, to be able to catch up with the new versions of the stack, that's another area that the tools really help, because you want to be able to develop the application and deploy it anywhere in any complete platform. >> So Tendü, if I hear you properly, what you're saying is integration sounds great on paper, it's important, but there's some hidden costs there, and that is the skill set and then there's the stack recompiling, I'm making sure. Okay, that's awesome. >> The tools help with that. >> Take a step back and zoom out and talk about Syncsort's positioning, because you guys have been changing with the stacks as well, I mean you guys have been doing very well with the announcements, you've been just coming on the market all the time. What is the current value proposition for Syncsort today? >> The current value proposition is really we have organizations to create the next generation modern data architecture by accessing and liberating all enterprise data and delivering that data at the right time and the right quality data. It's liberate, integrate, with integrity. That's our value proposition. How do we do that? We provide that single software environment. You can have batch legacy data and streaming data sources integrated in the same exact environment and it enables you to adapt to Spark 2 or Flink or whichever complete framework is going to help them. That has been our value proposition and it is proven in many production deployments. >> What's interesting to is the way you guys have approached the market. You've locked down the legacy, so you have, we talk about the main frame and well beyond that now, you guys have and understand the legacy, so you kind of lock that down, protect it, make it secure, it's security-wise, but you do that too, but making sure it works because it's still data there, because legacy systems are really critical in the hybrid. >> Main frame expertise and heritage that we have is a critical part of our offering. We will continue to focus on innovation on the main frame side as well as on the distributed. One of the announcements that we made since our last conversation was we have partnership with Compuware and we now bring in more data types about application failures, it's a Band-Aid data to Splunk for operational intelligence. We will continue to also support more delivery types, we have batch delivery, we have streaming delivery, and now replication into Hadoop has been a challenge so our focus is now replication from the B2 on mainframe and ISA on mainframe to Hadoop environments. That's what we will continue to focus on, mainframe, because we have heritage there and it's also part of big enterprise data lake. You cannot make sense of the customer data that you are getting from mobile if you don't reference the critical data sets that are on the mainframe. With the Trillium acquisition, it's very exciting because now we are at a kind of pivotal point in the market, we can bring that data validation, cleansing, and matching superior capabilities we have to the big data environments. One of the things-- >> So when you get in low latency, you guys do the whole low latency thing too? You bring it in fast? >> Yes, we bring it, that's our current value proposition and as we are accessing this data and integrating this part of the data lake, now we have capabilities with Trillium that we can profile that data, get statistics and start using machine learning to automate the data steward's job. Data stewards are still spending 75% of their time trying to clean the data. So if we can-- >> Lot of manual work labor there, and modeling too, by the way, the modeling and just the cleaning, cleaning and modeling kind of go hand in hand. >> Exactly. If we can automate any of these steps to drive the business rules automatically and provide right data on the data lake, that would be very valuable. This is what we are hearing from our customers as well. >> We've heard probably five years about the data lake as the center of gravity of big data, but we're hearing at least a bifurcation, maybe more, where now we want to take that data and apply it, operationalize it in making decisions with machine learning, predictive analytics, but at the same time we're trying to square this strange circle of data, the data lake where you didn't say up front what you wanted it to look like but now we want ever richer metadata to make sense out of it, a layer that you're putting on it, the data prep layer, and others are trying to put different metadata on top of it. What do you see that metadata layer looking like over the next three to five years? >> The governance is a very key topic and social organizations who are ahead of the game in the big data and who already established that data lake, data governance and even analytics governance becomes important. What we are delivering here with Trillium, we will have generally available by end of Q1. We are basically bringing business rules to the data. Instead of bringing data to business rules, we are taking the business rules and deploying them where the data exists. That will be key because of the data gravity you mentioned because the data might be in the Hadoop environment, there might be in a, like I said, enterprise data warehouse, and it might be originating in the cloud, and you don't want to move the data to the business rules. You want to move the business rules to where the data exists. Cloud is an area that we see more and more of our customers are moving forward. Two main use cases around our integration is one, because the data is originating in cloud, and the second one is archiving data to cloud, and we announced actually, tighter integration with cloud with our director earlier this week for this event, and that we have been in cloud deployments and we have actually an offering, an elastic MapReduce already and on AC too for couple of years now, and also on the Google cloud storage, but this announcement is primarily making deployments even easier by leveraging cloud director's elasticity for increasing and reducing the deployment. Now our customers will also take advantage of integration jobs from that elasticity. >> Tendü, it's great to have you on The Cube because you have an engineering mind but you're also now general manager of the business, and your business is changing. You're in the center of the action, so I want to get your expertise and insight into enterprise readiness concept and we saw last week at Google Cloud 2017, you know, Google going down the path of being enterprise ready, or taking steps, I don't think they're fully ready, but they're certainly serious about the cloud on the enterprise, and that's clear from Diane Green, who knows the enterprise. It sparked the conversation last week, around what does enterprise readiness mean for cloud players, because there's so many details in between the lines, if you will, of what products are, that integration, certification, SLAs. What's your take on the notion of cloud readiness? Vizaviz, Google and others that are bringing cloud compute, a lot of resources, with an IOT market that's now booming, big data evolving very, very fast, lot of realtime, lot of analytics, lot of innovation happening. What's the enterprise picture look like from a readiness standpoint? How do these guys get ready? >> From a big picture, for enterprise there are couple of things that these cannot be afterthought. Security, metadata lineage is part of data governance, and being able to have flexibility in the architecture, that they will not be kind of recreating the jobs that they might have all the way to deployed and on premise environments, right? To be able to have the same application running from on premise to cloud will be critical because it gives flexibility for adaptation in the enterprise. Enterprise may have some MapReduce jobs running on premise with the Spark jobs on cloud because they are really doing some predictive analytics, graph analytics on those, they want to be able to kind of have that flexible architecture where we hear this concept of a hybrid environment. You don't want to be deploying a completely different product in the cloud and redo your jobs. That flexibility of architecture, flexibility-- >> So having different code bases in the cloud versus on prem requires two jobs to do the same thing. >> Two jobs for maintaining, two jobs for standardizing, and two different skill sets of people potentially. So security, governance, and being able to access easily and have applications move in between environments will be very critical. >> So seamless integration between clouds and on prem first, and then potentially multi-cloud. That's table stakes in your mind. >> They are absolutely table stakes. A lot of vendors are trying to focus on that, definitely Hadoop vendors are also focusing on that. Also, one of the things, like when people talk about governance, the requirements are changing. We have been talking about single view and customer 360 for a while now, right? Do we have it right yet? The enrichment is becoming a key. With Trillium we made the recent announcement, the precise enriching, it's not just the address that you want to deliver and make sure that address should be correct, it's also the email address, and the phone number, is it mobile number, is it landline? It's enriched data sets that we have to be really dealing, and there's a lot of opportunity, and we are really excited because data quality, discovery and integration are coming together and we have a good-- >> Well Tendü, thank you for joining us, and congratulations as Syncsort broadens their scope to being a modern data platform solution provider for companies, congratulations. >> Thank you. >> Thanks for coming. >> Thank you for having me. >> This is The Cube here live in Silicon Valley and San Jose, I'm John Furrier, George Gilbert, you're watching our coverage of Big Data Silicon Valley in conjunction with Strata Hadoop. This is Silicon Angles, The Cube, we'll be right back with more live coverage. We've got two days of wall to wall coverage with experts and pros talking about big data, the transformations here inside The Cube. We'll be right back. (upbeat electronic music)

Published Date : Mar 14 2017

SUMMARY :

It's The Cube, covering Big Data Silicon Valley 2017. general manager of the big data, did I get that right? Yes, you got it right. That's so hard for me to get, but more announcements, but the theme seems to be integration. a decade in many of the enterprises. on Hadoop and also make the data accessible in it's the multi-cloud world, multi-data type it's on the enterprise side or on some It's hard to find Scala developers, I mean, the near future and perhaps completely in the cloud. get stuck in the personnel hiring problem. another area that the tools really help, So Tendü, if I hear you properly, what you're coming on the market all the time. and delivering that data at the right the legacy, so you kind of lock that down, One of the announcements that we made since automate the data steward's job. the modeling and just the cleaning, and provide right data on the data lake, data, the data lake where you didn't say the data to the business rules. many details in between the lines, if you will, kind of recreating the jobs that they might code bases in the cloud versus on prem So security, governance, and being able to on prem first, and then potentially multi-cloud. it's also the email address, and Well Tendü, thank you for the transformations here inside The Cube.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
John Furrier	PERSON	0.99+
two jobs	QUANTITY	0.99+
Two jobs	QUANTITY	0.99+
Dave Vellante	PERSON	0.99+
75%	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Diane Green	PERSON	0.99+
San Jose, California	LOCATION	0.99+
Google	ORGANIZATION	0.99+
Scala	TITLE	0.99+
Syncsort	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
second problem	QUANTITY	0.99+
last week	DATE	0.99+
Compuware	ORGANIZATION	0.99+
two days	QUANTITY	0.99+
Spark 2	TITLE	0.99+
one	QUANTITY	0.99+
one problem	QUANTITY	0.99+
Vizaviz	ORGANIZATION	0.99+
Tendü Yogurtçu	PERSON	0.99+
Spark	TITLE	0.99+
eighth year	QUANTITY	0.99+
One	QUANTITY	0.99+
five years	QUANTITY	0.99+
Two main use cases	QUANTITY	0.98+
Trillium	ORGANIZATION	0.98+
Python	TITLE	0.98+
Netezza	ORGANIZATION	0.98+
Trillium Software	ORGANIZATION	0.98+
this year	DATE	0.98+
Wikibon	ORGANIZATION	0.97+
Hortonworks	ORGANIZATION	0.97+
Hadoop	TITLE	0.97+
earlier this week	DATE	0.96+
today	DATE	0.96+
Teradata	ORGANIZATION	0.95+
Big Data Silicon Valley 2017	EVENT	0.94+
First thing	QUANTITY	0.94+
single view	QUANTITY	0.94+
big data	ORGANIZATION	0.92+
Hive	TITLE	0.92+
Java	TITLE	0.92+
The Cube	ORGANIZATION	0.92+
single button	QUANTITY	0.91+
AtScale	ORGANIZATION	0.91+
end of Q1	DATE	0.9+
single software	QUANTITY	0.9+
second one	QUANTITY	0.89+
first	QUANTITY	0.89+
California,	LOCATION	0.89+
Flink	TITLE	0.88+
Big Data	TITLE	0.88+
two different skill	QUANTITY	0.87+
Silicon Valley,	LOCATION	0.84+
360	QUANTITY	0.83+
three	QUANTITY	0.82+
last three years	DATE	0.8+
Valley	TITLE	0.79+
Google Cloud 2017	EVENT	0.79+
Windows	TITLE	0.78+
prem	ORGANIZATION	0.76+
couple of years back	DATE	0.76+
NYC	LOCATION	0.75+
two APIs	QUANTITY	0.75+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Big DataSilicon Valley 2017: