Basil Faruqui, BMC Software - BigData SV 2017 - #BigDataSV - #theCUBE
(upbeat music) >> Announcer: Live from San Jose, California, it's theCUBE covering Big Data Silicon Valley 2017. >> Welcome back everyone. We are here live in Silicon Valley for theCUBE's Big Data coverage. Our event, Big Data Silicon Valley, also called Big Data SV. A companion event to our Big Data NYC event where we have our unique program in conjunction with Strata Hadoop. I'm John Furrier with George Gilbert, our Wikibon big data analyst. And we have Basil Faruqui, who is the Solutions Marketing Manager at BMC Software. Welcome to theCUBE. >> Thank you, great to be here. >> We've been hearing a lot on theCUBE about schedulers and automation, and machine learning is the hottest trend happening in big data. We're thinking that this is going to help move the needle on some things. Your thoughts on this, on the world we're living in right now, and what BMC is doing at the show. >> Absolutely. So, scheduling and workflow automation is absolutely critical to the success of big data projects. This is not something new. Hadoop is only about 10 years old but other technologies that have come before Hadoop have relied on this foundation for driving success. If we look the Hadoop world, what gets all the press is all the real-time stuff, but what powers all of that underneath it is a very important layer of batch. If you think about some of the most common use cases for big data, if you think of a bank, they're talking about fraud detection and things like that. Let's just take the fraud detection example. Detecting an anomaly of how somebody is spending, if somebody's credit card is used which doesn't match with their spending habits, the bank detects that and they'll maybe close the card down or contact somebody. But if you think about everything else that has happened before that as something that has happened in batch mode. For them to collect the history of how that card has been used, then match it with how all the other card members use the cards. When the cards are stolen, what are those patterns? All that stuff is something that is being powered by what's today known as workload automation. In the past, it's been known by names such as job scheduling and batch processing. >> In the systems businesses everyone knows what schedulers, compilers, all this computer science stuff. But this is interesting. Now that the data lake has become so swampy, and people call it the data swamp, people are looking at moving data out of data lakes into real time, as you mention, but it requires management. So, there's a lot of coordination going on. This seems to be where most enterprises are now focusing their attention on, is to make that data available. >> Absolutely. >> Hence the notion of scheduling and workloads. Because their use cases are different. Am I getting it right? >> Yeah, absolutely. And if we look at what companies are doing, every CEO and every boardroom, there's a charter for digital transformation for companies. And, it's no longer about taking one or two use cases around big data and driving success. Data and intelligence is now at the center of everything a company does, whether it's building new customer engagement models, whether it's building new ecosystems with their partners, suppliers. Back-office optimization. So, when CIOs and data architects think about having to build a system like that, they are faced with a number of challenges. It has to become enterprise ready. It has to take into account governance, security, and others. But, if you peel the onion just a little bit, what architects and CIOs are faced with is okay, you've got a web of complex technologies, legacy applications, modern applications that hold a lot of the corporate data today. And then you have new sources of data like social media, devices, sensors, which have a tendency to produce a lot more data. First things first, you've got a ecosystem like Hadoop, which is supposed to be kind of the nerve center of the new digital platform. You've got to start ingesting all this data into Hadoop. This has to be in an automated fashion for it to be able to scalable. >> But this is the combination of streaming and batch. >> Correct. >> Now this seems to be the management holy grail right now. Nailing those two. Did I get that? >> Absolutely. So, people talk about, in technical terms, the speed layer and the batch layer. And both have to converge for them to be able to deliver the intelligence and insight that the business users are looking for. >> Would it be fair to say it's not just the convergence of the speed layer and batch layer in Hadoop but what BMC brings to town is the non-Hadoop parts of those workloads. Whether it's batch outside Hadoop or if there's streaming, which sort-of pre-Hadoop was more nichey. But we need this over-arching control, which if it's not a Hadoop-centric architecture. >> Absolutely. So, I've said this for a long time, that Hadoop is never going to live on an island on its own in the enterprise. And with the maturation of the market, Hadoop has to now play with the other technologies in the stack So if you think about, just take data ingestion for an example, you've got ERP's, you've got CRM's, you've got middleware, you've got data warehouses, and you have to ingest a lot of that in. Where Control-M brings a lot of value and speeds up time to market is that we have out-of-the box integrations with a lot of the systems that already exist in the enterprise, such as ERP solutions and others. Virtually any application that can expose itself through an API or a web service, Control-M has the ability to automate that ingestion piece. But this is only step one of the journey. So, you've brought all this data into Hadoop and now you've got to process it. The number of tools available for processing this is growing at an unprecedented rate. You've got, you know MapReduce was a hot thing just two years ago and now Spark has taken over. So Control-M, about four years ago we started building very deep native capabilities in their new ecosystem. So, you've got ingestion that's automated, then you can seamlessly automate the actual processing of the data using things like Spark, Hive, PEG, and others. And the last mile of the journey, the most important one, is them making this refined data available to systems and users that can analyze it. Often Hadoop is not the repository where analytic systems sit on top of. It's another layer where all of this has to be moved. So, if you zoom out and take a look at it, this is a monumental task. And if you use siloed approach to automating this, this becomes unscalable. And that's where a lot of the Hadoop projects often >> Crash and burn. >> Crash and burn, yes, sustainability. >> Let's just say it, they crash and burn. >> So, Control-M has been around for 30 years. >> By the way, just to add to the crash-and-burn piece, the data lake gets stalled there, that's why the swamp happens, because they're like, now how do I operationalize this and scale it out? >> Right, if you're storing a lot of data and not making it available for processing and analysis, then it's of no use. And that's exactly our value proposition. This is a problem we haven't solved for the first time. We did this as we have seen these waves of automation come through. From the mainframe time when it was called batch processing. Then it evolved into distributed client server when it was known more as job scheduling. And now. >> So BMCs have seen this movie before. >> Absolutely. >> Alright, so let's take a step back. Zoom out, step back, go hang out in the big trees, look down on the market. Data practitioners, big data practitioners out there right now are wrestling with this issue. You've got streaming, real-time stuff, you got batch, it's all coming together. What is Control-M doing great right now with practitioners that you guys are solving? Because there are a zillion tools out there, but people are human. Every hammer looks for a nail. >> Sure. So, you have a lot of change happening at the same time but yet these tools. What is Control-M doing to really win? Where are you guys winning? >> Where we are adding a lot of value for our customers is helping them speed up the time to market and delivering these big data projects, in delivering them at scale and quality. >> Give me an example of a project. >> Malwarebytes is a Silicon Valley-based company. They are using this to ingest and analyze data from thousands of end-points from their end users. >> That's their Lambda architecture, right? >> In Lambda architecture, I won't steal their thunder, they're presenting tomorrow at eleven. >> Okay. >> Eleven-thirty tomorrow. Another example is a company called Navistar. Now here's a company that's been around for 200 years. They manufacture heavy-duty trucks, 18-wheelers, school buses. And they recently came up with a service called OnCommand. They have a fleet of 160,000 trucks that are fitted with sensors. They're sending telematic data back to their data centers. And in between that stops in the cloud. So it gets to the cloud. >> So they're going up to the cloud for upload and backhaul, basically, right? >> Correct. So, it goes to the cloud. From there it is ingested inside their Hadoop systems. And they're looking for trends to make sure none of the trucks break down because a truck that's carrying freight breaks down hits the bottom line right away. But that's not where they're stopping. In real time they can triangulate the position of the truck, figure out where the nearest dealership is. Do they have the parts? When to schedule the service. But, if you think about it, the warranty information, the parts information is not sitting in Hadoop. That's sitting in their mainframes, SAP systems, and others. And Control-M is orchestrating this across the board, from mainframe to ERP and into Hadoop for them to be able to marry all this data together. >> How do you get back into the legacy? That's because you have the experience there? Is that part of the product portfolio? >> That is absolutely a part of the product portfolio. We started our journey back in the mainframe days, and as the world has evolved, to client server to web, and now mobile and virtualized and software-defined infrastructures, we have kept pace with that. >> You guys have a nice end-to-end view right now going on. And certainly that example with the trucks highlights IOT rights right there. >> Exactly. You have a clear line of sight on IOT? >> Yup. >> That would be the best measure of your maturity is the breadth of your integrations. >> Absolutely. And we don't stop at what we provide just out of the box. We realized that we have 30 to 35 out-of-the box integrations but there are a lot more applications than that. We have architected control them in a way where that can automate data loads on any application and any database that can expose itself through an API. That is huge because if you think about the open-source world, by the time this conference is going to be over, there's going to be a dozen new tools and projects that come online. And that's a big challenge for companies too. How do you keep pace with this and how do you (drowned out) all this? >> Well, I think people are starting to squint past the fashion aspect of open source, which I love by the way, but it does create more diversity. But, you know, some things become fashionable and then get big-time trashed. Look at Spark. Spark was beautiful. That one came out of the woodwork. George, you're tracking all the fashion. What's the hottest thing right now on open source? >> It seems to me that we've spent five-plus years building data lakes and now we're trying to take that data and apply the insides from it to applications. And, really Control-M's value add, my understanding is, we have to go beyond Hadoop because Hadoop was an island, you know, an island or a data lake, but now the insides have to be enacted on applications that go outside that ecosystem. And that's where Control-M comes in. >> Yeah, absolutely. We are that overarching layer that helps you connect your legacy systems and modern systems and bring it all into Hadoop. The story I tell when I'm explaining this to somebody is that you've installed Hadoop day-one, great, guess what, it has no data in it. You've got to ingest data and you have to be able to take a strategic approach to that because you can use some point solutions and do scripting for the first couple of use cases, but as soon as the business gives us the green light and says, you know what, we really like what we've seen now let's scale up, that's where you really need to take a strategic approach, and that's where Control-M comes in. >> So, let me ask then, if the bleeding edge right now is trying to operationalize the machine learning models that people are beginning to experiment with, just the way they were experimenting with data lakes five years ago, what role can Control-M play today in helping people take a trained model and embed it in an application so it produces useful actions, recommendations, and how much custom integration does that take? >> If we take the example of machine learning, if you peel the onion of machine learning, you've got data that needs to be moved, that needs to be constantly evaluated, and then the algorithms have to be run against it to provide the insights. So, this in itself is exactly what Control-M allows you to do, is ingest the data, process the data, let the algorithms process it, and then of course move it to a layer where people and other systems, it's not just about people anymore, it's other systems that'll analyze the data. And the important piece here is that we're allowing you to do this from a single pane of glass. And being able to see this picture end to end. All of this work is being done to drive business results, generating new revenue models, like in the case of Navistar. Allowing you to capture all of this and then tie it to business SOAs, that is one of the most highly-rated capabilities of Control-M from our customers. >> This is the cloud equation we were talking last week at Google Next. A combination of enterprise readiness across the board. The end-to-end is the picture and you guys are in a good position. Congratulations, and thanks for coming on theCUBE. Really appreciate it. >> Absolutely, great to be here. >> It's theCUBE breaking it down here at Big Data World. This is the trend. It's an operating system world in the cloud. Big data with IOT, AI, machine learning. Big themes breaking out early-on at Big Data SV in conjunction with Strata Hadoop. More right after this short break.
SUMMARY :
it's theCUBE covering Big A companion event to and machine learning is the hottest trend is all the real-time stuff, and people call it the data swamp, Hence the notion of Data and intelligence is now at the center But this is the combination Now this seems to be the that the business users are looking for. of the speed layer and the market, Hadoop has to So, Control-M has From the mainframe time when look down on the market. What is Control-M doing to really win? and delivering these big data projects, Malwarebytes is a Silicon In Lambda architecture, And in between that stops in the cloud. So, it goes to the cloud. and as the world has evolved, And certainly that example with the trucks You have a clear line of sight on IOT? is the breadth of your integrations. is going to be over, That one came out of the woodwork. but now the insides have to and do scripting for the that is one of the most This is the cloud This is the trend.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Basil Faruqui | PERSON | 0.99+ |
BMC | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Navistar | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
five-plus years | QUANTITY | 0.99+ |
30 | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
160,000 trucks | QUANTITY | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
two | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
Malwarebytes | ORGANIZATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
last week | DATE | 0.99+ |
Lambda | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
OnCommand | ORGANIZATION | 0.99+ |
five years ago | DATE | 0.99+ |
tomorrow | DATE | 0.98+ |
two years ago | DATE | 0.98+ |
35 | QUANTITY | 0.98+ |
first time | QUANTITY | 0.98+ |
Big Data SV | EVENT | 0.98+ |
18-wheelers | QUANTITY | 0.98+ |
first couple | QUANTITY | 0.98+ |
Big Data | EVENT | 0.98+ |
BMC Software | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.97+ | |
today | DATE | 0.97+ |
First | QUANTITY | 0.97+ |
about 10 years old | QUANTITY | 0.97+ |
Control-M | ORGANIZATION | 0.96+ |
two use cases | QUANTITY | 0.96+ |
Big Data Silicon Valley 2017 | EVENT | 0.95+ |
Hadoop | ORGANIZATION | 0.95+ |
30 years | QUANTITY | 0.94+ |
first | QUANTITY | 0.94+ |
NYC | LOCATION | 0.94+ |
Big Data Silicon Valley | EVENT | 0.93+ |
single pane | QUANTITY | 0.92+ |
Eleven-thirty | DATE | 0.9+ |
step one | QUANTITY | 0.88+ |
Strata Hadoop | TITLE | 0.88+ |
200 years | QUANTITY | 0.87+ |
theCUBE | ORGANIZATION | 0.87+ |
a dozen new tools | QUANTITY | 0.83+ |
about four years ago | DATE | 0.83+ |
Wikibon | ORGANIZATION | 0.83+ |
-M | ORGANIZATION | 0.82+ |
Big Data SV | ORGANIZATION | 0.82+ |
Control-M | PERSON | 0.81+ |
a zillion tools | QUANTITY | 0.8+ |
thousands of end-points | QUANTITY | 0.76+ |
eleven | DATE | 0.76+ |
Spark | TITLE | 0.76+ |
BMCs | ORGANIZATION | 0.74+ |
Strata Hadoop | PERSON | 0.67+ |
BigData SV 2017 | EVENT | 0.66+ |
#BigDataSV | EVENT | 0.62+ |
Big | ORGANIZATION | 0.62+ |
SAP | ORGANIZATION | 0.6+ |
MapReduce | ORGANIZATION | 0.58+ |
Hive | TITLE | 0.52+ |
Oliver Chiu, IBM & Wei Wang, Hortonworks | BigData SV 2017
>> Narrator: Live from San Jose, California It's the CUBE, covering Big Data Silicon Valley 2017. >> Okay welcome back everyone, live in Silicon Valley, this is the CUBE coverage of Big Data Week, Big Data Silicon Valley, our event, in conjunction with Strata Hadoop. This is the CUBE for two days of wall-to-wall coverage. I'm John Furrier with Analyst from Wikibon, George Gilbert our Big Data as well as Peter Buress, covering all of the angles. And our next guest is Wei Wang, Senior Director of Product Market at Hortonworks, a CUBE alumni, and Oliver Chiu, Senior Product Marketing Manager for Big Data and Microsoft Cloud at Azure. Guys, welcome to the CUBE, good to see you again. >> Yes. >> John: On the CUBE, appreciate you coming on. >> Thank you very much. >> So Microsoft and Hortonworks, you guys are no strangers. We have covered you guys many times on the CUBE, on HD insights. You have some stuff happening, here, and I was just talking about you guys this morning on another segment, like, saying hey, you know the distros need a Cloud strategy. So you have something happening tomorrow. Blog post going out. >> Wei: Yep. >> What's the news with Microsoft? >> So essentially I think that we are truly adopting the CloudFirst. And you know that we have been really acquiring a lot of customers in the Cloud. We have that announced in our earnings that more than a quarter of our customers actually already have a Cloud strategy. I want to give out a few statistics that Gardner told us actually last year. The increase for their end users went up 57% just to talk about Hadoop and Microsoft Azure. So what we're here, is to talk about the next generation. We're putting our latest and greatest innovation in which comes in in the package of the release of HDP2.6, that's our last release. I think our last conversation was on 2.5. So 2.6's great latest and newest innovations to put on CloudFirst, hence our partner, here, Microsoft. We're going to put it on Microsoft HD Insight. >> That's super exciting. And, you know, Oliver, one of the things that we've been really fascinated with and covering for multiple years now is the transformation of Microsoft. Even prior to Satya, who's a CUBE alumni by the way, been on the CUBE, when we were at XL event at Stanford. So, CEO of Microsoft, CUBE alumni, good to have that. But, it's interesting, right? I mean, the Open Compute Project. They donated a boatload of IP into the open-source. Heavily now open-source, Brendan Burns works for Microsoft. He's seeing a huge transformation of Microsoft. You've been working with Hortonworks for a while. Now, it's kind of coming together, and one of the things that's interesting is the trend that's teasing out on the CUBE all the time now is integration. He's seeing this flash point where okay, I've got some Hadoop, I've got a bunch of other stuff in the enterprise equation that's kind of coming together. And you know, things like IOT, and AIs all around the corner as well. How are you guys getting this all packaged together? 'Cause this kind of highlights some of the things that are now integrated in with the tools you have. Give us an update. >> Yeah, absolutely. So for sure, just to kind of respond to the trend, Microsoft kind of made that transformation of being CloudFirst, you know, many years ago. And, it's been great to partner with someone like Hortonworks actually for the last four years of bringing HD Insight as a first party Microsoft Cloud service. And because of that, as we're building other Cloud services around in Azure, we have over 60 services. Think about that. That's 60 PAZ and IAZ services in Microsoft, part of the Azure ecosystem. All of this is starting to get completely integrated with all of our other services. So HD Insight, as an example, is integrated with all of our relational investments, our BI investments, our machine learning investments, our data science investments. And so, it's really just becoming part of the fabric of the Azure Cloud. And so that's a testament to the great partnership that we're having with Hortonworks. >> So the inquiry comment from Gardner, and we're seeing similar things on the Wikibon site on our research team, is that now the legitimacy of say, of seeing how Hadoop fits into the bigger picture, not just Hadoop being the pure-play Big Data platform which many people were doing. But now they're seeing a bigger picture where I can have Hadoop, and I can have some other stuff all integrating. Is that all kind of where this is going from you guys' perspective? >> So yeah, it's again, some statistics we have done tech-validate service that our customers are telling us that 43% of the responders are actually using that integrated approach, the hybrid. They're using the Cloud. They're using our stuff on-premise to actually provide integrated end-to-end processing workload. They are now, I think, people are less think about, I would think, a couple years ago, people probably think a little bit about what kind of data they want to put in the Cloud. What kind of workload they want to actually execute in the Cloud, versus their own premise. I think, what we see is that line starting to blur a little bit. And given the partnership we have with Microsoft, the kind of, the enterprise-ready functionalities, and we talk about that for a long time last time I was here. Talk about security, talk about governance, talk about just layer of, integrated layer to manage the entire thing. Either on-premise, or in the Cloud. I think those are some of the functionalities or some of the innovations that make people a lot more at ease with the idea of putting the entire mission-critical applications in the Cloud, and I want to mention that, especially with our blog going out tomorrow that we will actually announce the Spark 2.1. In which, in Microsoft Azure HD Insight, we're actually going to guarantee 99.9% of SLA. Right, so it's, for that, it's for enterprise customers. In which many of us have together that is truly an insurance outfield, that people are not just only feel at ease about their data, that where they're going to locate, either in the Cloud or within their data center, but also the kind of speed and response and reliability. >> Oliver, I want to queue off something you said which was interesting, that you have 60 services, and that they're increasingly integrated with each other. The idea that Hadoop itself is made up of many projects or services and I think in some amount of time, we won't look at it as a discrete project or product, but something that's integrated with together makes a pipeline, a mix-and-match. I'm curious if you can share with us a vision of how you see Hadoop fitting in with a richer set of Microsoft services, where it might be SQL server, it might be streaming analytics, what that looks like and so the issue of sort of a mix-and-match toolkit fades into a more seamless set of services. >> Yeah, absolutely. And you're right, Hadoop and Wei will definitely reiterate this, is that Hadoop is a platform right, and certainly there is multiple different workloads and projects on that platform that do a lot of different things. You have Spark that can do machine learning and streaming, and SQL-like queries, and you have Hadoop itself that can do badge, interactive, streaming as well. So, you see kind of a lot of workloads being built on open-source Hadoop. And as you bring it to the Cloud, it's really for customers that what we found, and kind of this new Microsoft that is often talked about, is it's all about choice and flexibility for our customers. And so, some customers want to be 100% open-source Apache Hadoop, and if they want that, HD Insight is the right offering, and what we can do is we can surround it with other capabilities that are outside of maybe core Hadoop-type capabilities. Like if you want to media services, all the way down to, you know, other technologies nothing related to, specifically to data and analytics. And so they can combine that with the Hadoop offering, and blend it into a combined offering. And there are some customers that will blend open-source Hadoop with some of our Azure data services as well, because it offers something unique or different. But it's really a choice for our customers. Whatever they're open to, whatever their kind of their strategy for their organization. >> Is there, just to kind of then compare it with other philosophies, do you see that notion that Hadoop now becomes a set of services that might or might not be mixed and matched with native services. Is that different from how Amazon or Google, you know, you perceive them to be integrating Hadoop into their sort of pipelines and services? >> Yeah, it's different because I see Amazon and Google, like, for instance, Google kind of is starting to change their philosophy a little bit with introduction of dataproc. But before, you can see them as an organization that was really focused on bringing some of the internal learnings of Google into the marketplace with their own, you can say proprietary-type services with some of the offerings that they have. But now, they're kind of realizing the value that Hadoop, that Apache Hadoop ecosystem brings. And so, with that comes the introduction of their own manage service. And for AWS, their roots is IAZ, so to speak, is kind of the roots of their Cloud, and they're starting to bring kind of other systems, very similar to, I would say Microsoft Strategy. For us, we are all about making things enterprise-ready. So that's what the unique differentiator and kind of what you alluded to. And so for Microsoft, all of our data services are backed by 99.9% service-level agreement including our relationship with Hortonworks. So that's kind of one, >> Just say that again, one more time. >> 99.9% up-time, and if, >> SLA. >> Oliver: SLA and so that's a guarantee to our customers. So if anything we're, >> John: One more time. >> It's a guarantee to our customers. >> No, this is important. SLA, I mean Google Next didn't talk much about last week their Cloud event. They talked about speed thieves, >> Exactly >> Not a lot of SLAs. This is mandate for the enterprise. They care more about SLA so, not that they don't care about price, but they'd much rather have solid, bulletproof SLAs than the best price. 'Cause the total cost of ownership. >> Right. And that's really the heritage of where Microsoft comes from, is we have been serving our on-premises customers for so long, we understand what they want and need and require for a mission-critical enterprise-ready deployment. And so, our relationship with Hortonworks absolutely 99.9% service-level agreement that we will guarantee to our customers and across all of the Hadoop workloads, whether it would be Hive, whether it would be Spark, whether it'd be Kafka, any of the workloads that we have on HD Insight, is enterprise-ready by virtue, mission-critical, built-in, all that stuff that you would expect. >> Yeah, you guys certainly have a great track record with enterprise. No debate about that, 100%. Um, back to you guys, I want to take a step back and look at some things we've been observing kicking off this week at the Strata Hadoop. This is our eighth year covering, Hadoop world now has evolved into a whole huge thing with Big Data SV and Big Data NYC that we run as well. The bets that were made. And so, I've been intrigued by HD Insights from day one. >> Yep. >> Especially the relationship with Microsoft. Got our attention right away, because of where we saw the dots connecting, which is kind of where we are now. That's a good bet. We're looking at what bets were made and who's making which bets when, and how they're panning out, so I want to just connect the dots. Bets that you guys have made, and the bets that you guys have made that are now paying off, and certainly we've done before camera revolution analytics. Obviously, now, looking real good middle of the fairway as they say. Bets you guys have made that hey, that was a good call. >> Right, and we think that first and foremost, we are sworn to work to support machine learning, we don't call it AI, but we are probably the one that first to always put the Spark, right, in Hadoop. I know that Spark has gained a lot of traction, but I remember that in the early days, we are the ones that as a distro that, going out there not only just verbally talk about support of Spark, but truly put it in our distribution as one of the component. We actually now in the last version, we are actually allows also flexibility. You know Spark, how often they change. Every six weeks they have a new version. And that's kind of in the sense of running into paradox of what actually enterprise-ready is. Within six weeks, they can't even roll out an entire process, right? If they have a workload, they probably can't even get everyone to adopt that yet, within six weeks. So what we did, actually, in the last version, in which we will continue to do, is to essentially support multiple versions of Spark. Right, we essentially to talk about that. And the other bet we have made is about Hive. We truly made that as kind of an initiative behind project Stinger initiative, and also have ties now with LAP. We made the effort to join in with all the other open-source developers to go behind this project that make sure that SQL is becoming truly available for our customers, right. Not only just affordable, but also have the most comprehensive coverage for SQL, and C20-11. But also now having that almost sub-second interactive query. So I think that's the kind of bet we made. >> Yeah, I guess the compatibility of SQL, then you got the performance advantage going on, and this database is where it's in memory or it's SSD, That seems to be the action. >> Wei: Yeah. >> Oliver, you guys made some good bets. So, let's go down the list. >> So let's go down memory lane. I always kind of want to go back to our partnership with Hortonworks. We partnered with Hortonworks really early on, in the early days of Hortonworks' existence. And the reason we made that bet was because of Hortonworks' strategy of being completely open. Right, and so that was a key decision criteria for Microsoft. That we wanted to partner with someone whose entire philosophy was open-source, and committing everything back to the Apache ecosystem. And so that was a very strategic bet that we made. >> John: It was bold at the time, too. >> It was very bold, at the time, yeah. Because Hortonworks at that time was a much smaller company than they are today. But we kind of understood of where the ecosystem was going, and we wanted to partner with people who were committing code back into the ecosystem. So that, I would argue, is definitely one really big bet that was a very successful one and continues to play out even today. Other bets that we've made and like we've talked about prior is our acquisition of Revolution Analytics a couple years ago and that's, >> R just keeps on rolling, it keeps on rolling, rolling, rolling. Awesome. >> Absolutely. Yeah. >> Alright, final words. Why don't we get updated on the data science experiences you guys have. Is there any update there? What's going on, what seems to be, the data science tools are accelerating fast. And, in fact, some are saying that looks like the software tools years and years ago. A lot more work to do. So what's happening with the data science experience? >> Yeah absolutely and just tying back to that original comment around R, Revolution Analytics. That has become Microsoft, our server. And we're offering that, available on-premises and in the Cloud. So on-premises, it's completely integrated with SQL server. So all SQL server customers will now be able to do in-database analytics with R built-in-to-the-core database. And that we see as a major win for us, and a differentiator in the marketplace. But in the Cloud, in conjunction with our partnership with Hortonworks, we're making Microsoft R server, available as part of our integration with Azure HD Insights. So we're kind of just tying back all that integration that we talked about. And so that's built in, and so any customer can take R, and paralyze that across any number of Hadoop and Sparknotes in a managed service within minutes. Clusters will spin up, and they can just run all their data science models and train them across any number of Hadoop and Sparknotes. And so that is, >> John: That takes the heavy lifting away on the cluster management side, so they can focus on their jobs. >> Oliver: Absolutely. >> Awesome. Well guys, thanks for coming on. We really appreciate Wei Wang with Hortonworks, and we have Oliver Chiu from Microsoft. Great to get the update, and tomorrow 10:30, the CloudFirst news hits. CloudFirst, Hortonworks with Azure, great news, congratulations, good Cloud play for Hortonworks. To CUBE, I'm John Furrier with George Gilbert. More coverage live in Silicon Valley after this short break.
SUMMARY :
It's the CUBE, covering all of the angles. and I was just talking about you guys this morning a lot of customers in the Cloud. and one of the things that's interesting that we're having with Hortonworks. is that now the legitimacy of say, And given the partnership we have with Microsoft, and that they're increasingly integrated with each other. all the way down to, you know, other technologies a set of services that might or might not be and kind of what you alluded to. Oliver: SLA and so that's a guarantee to our customers. No, this is important. This is mandate for the enterprise. and across all of the Hadoop workloads, that we run as well. and the bets that you guys have made but I remember that in the early days, Yeah, I guess the compatibility of SQL, So, let's go down the list. And so that was a very strategic bet that we made. and we wanted to partner with people it keeps on rolling, rolling, rolling. Yeah. on the data science experiences you guys have. and in the Cloud. on the cluster management side, and we have Oliver Chiu from Microsoft.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
John | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Oliver | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Amazon | ORGANIZATION | 0.99+ |
Satya | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Oliver Chiu | PERSON | 0.99+ |
Peter Buress | PERSON | 0.99+ |
43% | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
99.9% | QUANTITY | 0.99+ |
60 services | QUANTITY | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
eighth year | QUANTITY | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
Hadoop | TITLE | 0.99+ |
CUBE | ORGANIZATION | 0.99+ |
tomorrow 10:30 | DATE | 0.99+ |
Brendan Burns | PERSON | 0.99+ |
Hortonworks' | ORGANIZATION | 0.99+ |
last year | DATE | 0.99+ |
last week | DATE | 0.99+ |
SQL | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
57% | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
Big Data Week | EVENT | 0.99+ |
two days | QUANTITY | 0.99+ |
Wei Wang | PERSON | 0.99+ |
Big Data | ORGANIZATION | 0.99+ |
Gardner | PERSON | 0.98+ |
Ben Sharma, Tony Fisher, Zaloni - BigData SV 2017 - #BigDataSV - #theCUBE
>> Announcer: Live from San Jose, California, it's The Cube, covering Big Data Silicon Valley 20-17. (rhythmic music) >> Hey, welcome back, everyone. We're live in Silicon Valley for Big Data SV, Big Data Silicon Valley in conjunction with Strata + Hadoob. This is the week where it all happens in Silicon Valley around the emergence of the Big Data as it goes to the next level. The Cube is actually on the ground covering it like a blanket. I'm John Furrier. My cohost, George Gilbert with Boogie Bond. And our next guest, we have two executives from Zeloni, Ben Sharma, who's the founder and CEO, and Tony Fischer, SVP and strategy. Guys, welcome back to The Cube. Good to see you. >> Thank you for having us back. >> You guys are great guests. You're in New York for Big Data NYC, and a lot is going on, certainly, here, and it's just getting kicked off with Strata-Hadoob, they got the sessions today, but you guys have already got some news out there. Give us the update. What's the big discussion at the show? >> So yeah, 20-16 was a great year for us. A lot of growth. We tripled our customer base, and a lot of interest in data lake, as customers are going from say Pilot and POCs into production implementation so far though. And in conjunction with that, this week we launched what we call a solution named Data Lake in a Box, appropriately, right? So what that means is we're bringing the full stack together to customers, so that we can get a data lake up and running in eight weeks time frame, with enterprise create data ingestion from their source systems hydrated into the data lake and ready for analytics. >> So is it a pretty big box, and is it waterproof? (all laughing) I mean, this is the big discussion now, pun intended. But the data lake is evolving, so I wanted to get your take on it. This is kind of been a theme that's been leading up and now front and center here on The Cube. Already the data lake has changed, also we've heard, I think Dave Alante in New York said data swamp. But using the data is critical on a data lake. So as it goes to more mature model of leveraging the data, what are the key trends right now? What are you guys seeing? Because this is a hot topic that everyone is talking about. >> Well, that's a good distinction that we like to make, is the difference between a data swamp and a data lake. >> And a data lake is much more governed. It has the rigor, it has the automation, it has a lot of the concepts that people are used to from traditional architectures, only we apply them in the scale-out architecture. So we put together a maturity model that really maps out a customer's journey throughout the big data and the data lake experience. And each phase of this, we can see what the customer's doing, what their trends are and where they want to go, and we can advise to them the right way to move forward. And so a lot of the customers we see are kind of in kind of what we call the ignore stage. I'd say most of the people we talk to are just ignoring. They don't have things active, but they're doing a lot of research. They're trying to figure out what's next. And we want to move them from there. The next stage up is called store. And store is basically just the sandbox environment. "I'm going to stick stuff in there." "I'm going to hope something comes out of it." No collaboration. But then, moving forward, there's the managed phase, the automated phase, and the optimized phase. And our goal is to move them up into those phases as quickly as possible. And data lake in a box is an effort to do that, to leapfrog them into a managed data lake environment. >> So that's kind of where the swamp analogy comes in, because the data lake, the swamp is kind of dirty, where you can almost think, "Okay, the first step is store it." And then they get busy or they try to figure out how to operationalize it, and then it's kind of like, "Uh ..." So your point, they're trying to get to that. So you guys get 'em to that set up, and then move them quickly to value? Is that kind of the approach? >> Yeah. So, time to value is critical, right? So how do you reduce the time to insight from the time the data is produced by the date producer, till the time you can make the data available to the data consumer for analytics and downstream use cases. So that's kind of our core focus in bringing these solutions to the market. >> Dave often and I were talking, and George always talk about the value of data at the right time at the right place, is the critical lynch-pin for the value, whether it's an app-driven, or whatever. So the data lake, you never know what data in the data lake will need to be pulled out and put into either real time or an app. So you have to assume at any given moment there's going to be data value. >> Sure >> So that, conceptually, people can get that. But how do you make that happen? Because that's a really hard problem. How do you guys tackle that when a customer says, "Hey, I want to do the data lake. "I've got to have the coverage. "I got to know who's accessing stuff. "But at the end of the day, "I got to move the data to where it's valuable." >> Sure. So the approach we have taken is with an integrated platform with a common metadata layer. Metadata is the key. So, using this common metadata layer, being able to do managed ingestion from various different sources, being able to do data validation and data quality, being able to manage the life cycle of the data, being able to generate these insights about the data itself, so that you can use that effectively for data science or for downstream applications and use cases is critical based on our experience of taking these applications from, say, a POC pilot phase into a production phase. >> And what's the next step, once you guys get to that point with the metadata? Because, like, I get that, it's like everyone's got the metadata focus. Now, I'm the data engineer, the data NG or the geek, the supergeek and then you've got the data science, then the analysts, then there will probably be a new category, a bot or something AI will do something. But you can have a spectrum of applications on the data side. How do they get access to the metadata? Is it through the machine learning? Do you guys have anything unique there that makes that seamless or is that the end goal? >> Sure, do you want to take that? >> Yes sure, it's a multi-pronged answer, but I'll start and you can jump in. One of the things we provide as part of our overall platform is a product called Micah. And Micah is really the kind of on-ramp to the data. And all those people that you just named, we love them all, but their access to the data is through a self-service data preparation product, and key to that is the metadata repository. So, all the metadata is out there; we call it a catalog at that point, and so they can go in, look at the catalog, get a sense for the data, get an understanding for the form and function of the data, see who uses it, see where it's used, and determine if that's the data that they want, and if it is, they have the ability to refine it further, or they can put it in a shopping cart if they have access to it, they can get it immediately, they can refine it, if they don't have access to it, there's an automatic request that they can get access to it. And so it's a onramp concept, of having a card catalog of all the information that's out there, how it's being used, how it's been refined, to allow the end user to make sure that they've got the right data, they can be positioned for their ultimate application. >> And just to add to what Tony said, because we are using this common metadata layer, and capturing metadata every instance, if you will, we are serving it up to the data consumers, using a rich catalog, so that a lot of our enterprise customers are now starting to create what they consider a data marketplace or a data portal within their organization, so that they're able to catalog not just the data that's in the data lake, but also data that's in other data stores. And provide one single unified view of these data sets, so that your data scientists can come in and see is this a data set that I can use for my model building? What are the different attributes of this data set? What is the quality of the data? How fresh is the data? And those kind of traits, so that they are effective in their analytical journey. >> I think that's the key thing that's interesting to me, is that you're seeing the big data explosions over the past ten years, eight years, we've been covering The Cube since the dupe world started. But now, it's the data set world, so it's a big data set in this market. The data sets are the key because that's what data scientists want to wrangle around with, and sling data sets with whatever tooling they want to use. Is that kind of the same trend that you guys see? >> That's correct. And also what we're seeing in the marketplace, is that customers are moving from a single architecture to a distributed architecture, where they may have a hybrid environment with some things being instantiated in the Cloud, some things being on PRIM. So how do you not provide a unified interface across these multiple environments, and in a governed way, so that the right people have access to the right data, and it's not the data swamp. >> Okay, so lets go back to the maturity model because I like that framework. So now you've just complicated the heck out of it. Cause now you've got Cloud, and then on PRIM, and then now, how do you put that prism of maturity model, on now hybrid, so how does that cross-connect there? And a second follow-up to that is, where are the customers on this progress bar? I'm sure they're different by customer but, so, maturity model to the hybrid, and then trends in the customer base that you're seeing? >> Alright, I'll take the second one, and then you can take the first one, okay? So, the vast majority of the people that we work with, and the people, the prospects customers, analysts we've talked to, other industry dignitaries, they put the vast majority of the customers in the ignore stage. Really just doing their research. So a good 50% plus of most organizations are still in that stage. And then, the data swamp environment, that I'm using it to store stuff, hopefully I'll get something good out of it. That's another 25% of the population. And so, most of the customers are there, and we're trying to move them kind of rapidly up and into a managed and automated data lake environment. The other trend along these lines that we're seeing, that's pretty interesting, is the emergence of IT in the big data world. It used to be a business user's world, and business users built these sandboxes, and business users did what they wanted to. But now, we see organizations that are really starting to bring IT into the fold, because they need the governance, they need the automation, they need the type of rigor that they're used to, in other data environments, and has been lacking in the big data environment. >> And you've got the IOT code cracking the code on the IOT side which has created another dimension of complexity. On the numbers of the 50% that ignore, is that profile more for Fortune 1000? >> It's larger companies, it's Fortune, and Global 2000. >> Got it, okay, and the terms of the hybrid maturity model, how's that, and add a third dimension, IOT, we've got a multi-dimensional chess game going here. >> I think they way we think about it is, that they're different patterns of data sets coming in. So they could be batched, they could be files, or database extracts, or they could be streams, right? So as long as you think about a converged architecture that can handle these different patterns, then you can map different use cases whether they are IOT and streaming use cases versus what we are seeing is that a lot of companies are trying to replace their operational analytics platforms with a data lake environment, and they're building their operational analytics on top of the data lake, correct? So you need to think more from an abstraction layer, how do you abstract it out? Because one of the challenges that we see customers facing, is that they don't want to get sticky with one Cloud service provider because they may have multiple Cloud service providers, >> John: It's a multi-Cloud world right now. >> So how do you leverage that, where you have one Cloud service provider in one geo, another Cloud service provider in another geo, and still being able to have an abstraction layer on top of it, so that you're building applications? >> So do you guys provide that data layer across that abstraction? >> That is correct, yes, so we leverage the ecosystem, but what we do is add the data management and data governance layer, we provide that abstraction, so that you can be on PREM, you can be in Cloud service provider one, or Cloud service provider two. You still have the same controls, and same governance functions as you build your data lake environment. >> And this is consistent with some of the Cube interviews we had all day today, and other Cube interviews, where when you had the Cloud, you're renting basically, but you own your data. You get to have a nice ... And that metadata seems to be the key, that's the key, right? For everything. >> That's right. And now what we're seeing is that a lot of our Enterprise customers are looking at bringing in some of the public cloud infrastructure into their on-PRAM environment as they are going to be available in appliances and things like that, right? So how do you then make sure that whatever you're doing in a non-enterprise cloud environment you are also able to extend it to the enterprise-- >> And the consequences to the enterprise is that the enterprise multiple jobs, if they don't have a consistent data layer ... >> Sure, yeah. >> It's just more redundancy. >> Exactly. >> Not redundancy, duplication actually. >> Yeah, duplication and difficulty of rationalizing it together. >> So let me drill down into a little more detail on the transition between these sort of maturity phases? And then the movement into production apps. I'm curious to know, we've heard Tableau, XL, Power BI, Click I guess, being-- sort of adapting to being front ends to big data. But they don't, for their experience to work they can't really handle big data sets. So you need the MPP sequel database on the data lake. And I guess the question there is is there value to be gotten or measurable value to be gotten just from turning the data lake into you know, interactive BI kind of platform? And sort of as the first step along that maturity model. >> One of the patterns we were seeing is that serving LIR is becoming more and more mature in the data lake, so that earlier it used to be mainly batch type of workloads. Now, with MPP engines running on the data lake itself, you are able to connect your existing BI applications, whether it's Tableau, Click, Power BI, and others, to these engines so that you are able to get low-latency query response times and are able to slice-and-dice your data sets in the data lake itself. >> But you're essentially still, you have to sample the data. You can't handle the full data set unless you're working with something like Zoom Data. >> Yeah, so there are physical limitations obviously. And then there are also this next generation of BI tools which work in a converged manner in the data lake itself. So there's like Zoom Data, Arcadia, and others that are able to kind of run inside the data lake itself instead of you having to have an external environment like the other BI tools, so we see that as a pattern. But if you already are an enterprise, you have on board a BI platform, how do you leverage that with the data lake as part of the next-generation architecture is a key trend that we are seeing. >> So that your metadata helps make that from swamp to curated data lake. >> That's right, and not only that what we have done, as Tony was mentioning, in our Micah product we have a self-service catalog and then we provide a shopping cart experience where you can actually source data sets into the shopping cart, and we let them provision a sandbox. And when they provision the sandbox, they can actually launch Tableau or whatever the BI tool of choice is on that sandbox, so that they can actually-- and that sandbox could exist in the data lake or it could exist on a relational data store or an MPP data store that's outside of the data lake. That's part of your modern data architecture. >> But further to your point, if people have to throw out all of their decision support applications and their BI applications in order to change their data infrastructure, they're not going to do it. >> Understood. >> So you have to make that environment work and that's what Ben's referring to with a lot of the new accelerator tools and things that will sit on top of the data lake. >> Guys, thanks so much for coming on The Cube. Really appreciate it. I'll give you guys the final word in the segment ... What do you expect this week? I mean, obviously, we've been seeing the consolidation. You're starting to see the swim lanes of with Spark and Open Source and you see the cloud and IOT colliding, there's a huge intersection with deep learning, AI is certainly hyped up now beyond all recognition but it's essentially deep learning. Neural networks meets machine learning. That's been around before, but now freely available with Cloud and Compute. And so kind of a interesting dynamic that's rockin' the big data world. Your thoughts on what we're going to see this week and how that relates to the industry? >> I'll take a stab at it and you may feel free to jump in. I think what we'll see is that lot of customers that have been playing with big data for a couple of years are now getting to a point where what worked for one or two use cases now needs to be scaled out and provided at an enterprise scale. So they're looking at a managed and a governance layer to put on top of the platform. So they can enable machine learning and AI and all those use cases, because business is asking for them. Right? Business is asking for how they can bring intenser flow and run on the data lake itself, right? So we see those kind of requirements coming up more and more frequently. >> Awesome. Tony? >> What he said. >> And enterprise readiness certainly has to be table-- there's a lot of table stakes in the enterprise. It's not like, easy to get into, you can see Google kind of just putting their toe in the water with the Google cloud, tenser flow, great highlight they got spanner, so all these other things like latency rearing their heads again. So these are all kind of table stakes. >> Yeah, and the other thing, moving forward with respect to machine learning and some of the advanced algorithms, what we're doing now and some of the research we're doing is actually using machine learning to manage the data lake, which is a new concept, so when we get to the optimized phase of our maturity model, a lot of that has to do with self-correcting and self-automating. >> I need some machine learning and some AI, so does George and we need machine learning to watch the machine learn, and then algorithmists for algorithms. It's a crazy world, exciting time for us. >> Are we going to have a bot next time when we come here? (all laughing) >> We're going to chat off of messenger, we just came from south by southwest. Guys, thanks for coming on The Cube. Great insight and congratulations on the continued momentum. This is The Cube breakin' it down with experts, CEOs, entrepreneurs, all here inside The Cube. Big Data Sv, I'm John for George Gilbert. We'll be back after this short break. Thanks! (upbeat electronic music)
SUMMARY :
Announcer: Live from This is the week where it What's the big discussion at the show? hydrated into the data lake But the data lake is evolving, is the difference between a and the data lake experience. Is that kind of the approach? make the data available So the data lake, you never "But at the end of the day, So the approach we have taken is seamless or is that the end goal? One of the things we provide that's in the data lake, Is that kind of the same so that the right people have access And a second follow-up to that is, and the people, the prospects customers, On the numbers of the 50% that ignore, it's Fortune, and Global 2000. of the hybrid maturity model, of the data lake, correct? John: It's a multi-Cloud the data management and And that metadata seems to be the key, some of the public cloud And the consequences of rationalizing it together. database on the data lake. in the data lake itself. You can't handle the full data set manner in the data lake itself. So that your metadata helps make that exist in the data lake But further to your point, if So you have to make and how that relates to the industry? and run on the data lake itself, right? stakes in the enterprise. a lot of that has to and some AI, so does George and we need on the continued momentum.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Tony Fischer | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
Tony | PERSON | 0.99+ |
Dave Alante | PERSON | 0.99+ |
Tony Fisher | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Ben Sharma | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Zeloni | PERSON | 0.99+ |
Zaloni | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
50% | QUANTITY | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
25% | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
eight weeks | QUANTITY | 0.99+ |
two executives | QUANTITY | 0.99+ |
first step | QUANTITY | 0.99+ |
Tableau | TITLE | 0.99+ |
eight years | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Big Data | ORGANIZATION | 0.98+ |
two | QUANTITY | 0.98+ |
this week | DATE | 0.98+ |
second one | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
each phase | QUANTITY | 0.98+ |
Ben | PERSON | 0.97+ |
NYC | LOCATION | 0.97+ |
20-16 | DATE | 0.97+ |
Cloud | TITLE | 0.97+ |
Strata | ORGANIZATION | 0.97+ |
Big Data Sv | ORGANIZATION | 0.97+ |
second | QUANTITY | 0.96+ |
two use cases | QUANTITY | 0.96+ |
Cube | ORGANIZATION | 0.96+ |
third | QUANTITY | 0.94+ |
The Cube | ORGANIZATION | 0.91+ |
single architecture | QUANTITY | 0.91+ |
Power | TITLE | 0.9+ |
Micah | LOCATION | 0.85+ |
Arcadia | TITLE | 0.83+ |
Zoom Data | TITLE | 0.83+ |
Big Data SV | ORGANIZATION | 0.82+ |
Micah | PERSON | 0.81+ |
Click | TITLE | 0.8+ |
Strata-Hadoob | TITLE | 0.8+ |
Zoom Data | TITLE | 0.78+ |
Fortune | ORGANIZATION | 0.78+ |
Spark | TITLE | 0.78+ |
Power BI | TITLE | 0.78+ |
#theCUBE | ORGANIZATION | 0.77+ |
one geo | QUANTITY | 0.76+ |
one single unified | QUANTITY | 0.75+ |
Big Data Silicon Valley | ORGANIZATION | 0.72+ |
Bond | ORGANIZATION | 0.72+ |
Hadoob | ORGANIZATION | 0.72+ |
POCs | ORGANIZATION | 0.67+ |
PRIM | TITLE | 0.66+ |
Data | ORGANIZATION | 0.65+ |
lake | ORGANIZATION | 0.6+ |
Pilot | ORGANIZATION | 0.58+ |
XL | TITLE | 0.58+ |
of years | QUANTITY | 0.56+ |
Global | ORGANIZATION | 0.55+ |
Yuanhao Sun, Transwarp Technology - BigData SV 2017 - #BigDataSV - #theCUBE
>> Announcer: Live from San Jose, California, it's theCUBE, covering Big Data Silicon Valley 2017. (upbeat percussion music) >> Okay, welcome back everyone. Live here in Silicon Valley, San Jose, is the Big Data SV, Big Data Silicon Valley in conjunction with Strata Hadoop, this is theCUBE's exclusive coverage. Over the next two days, we've got wall-to-wall interviews with thought leaders, experts breaking down the future of big data, future of analytics, future of the cloud. I'm John Furrier with my co-host George Gilbert with Wikibon. Our next guest is Yuanhao Sun, who's the co-founder and CTO of Transwarp Technologies. Welcome to theCUBE. You were on, during the, 166 days ago, I noticed, on theCUBE, previously. But now you've got some news. So let's get the news out of the way. What are you guys announcing here, this week? >> Yes, so we are announcing 5.0, the latest version of Transwarp Hub. So in this version, we will call it probably revolutionary product, because the first one is we embedded communities in our product, so we will allow people to isolate different kind of workloads, using dock and containers, and we also provide a scheduler to better support mixed workloads. And the second is, we are building a set of tools allow people to build their warehouse. And then migrate from existing or traditional data warehouse to Hadoop. And we are also providing people capability to build a data mart, actually. It allow you to interactively query data. So we build a column store in memory and on SSD. And we totally write the whole SQL engine. That is a very tiny SQL engine, allow people to query data very quickly. And so today that tiny SQL engine is like about five to ten times faster than Spark 2.0. And we also allow people to build cubes on top of Hadoop. And then, once the cube is built, the SQL performance, like the TBCH performance, is about 100 times faster than existing database, or existing Spark 2.0. So it's super-fast. And in, actually we found a Paralect customer, so they replace their data with software, to build a data mart. And we already migrate, say 100 reports, from their data to our product. So the promise is very good. And the first one is we are providing tool for people to build the machine learning pipelines and we are leveraging TensorFlow, MXNet, and also Spark for people to visualize the pipeline and to build the data mining workflows. So this is kind of like Datasense tools, it's very easy for people to use. >> John: Okay, so take a minute to explain, 'cus that was great, you got the performance there, that's the news out of the way. Take a minute to explain Transwarp, your value proposition, and when people engage you as a customer. >> Yuanhao: Yeah so, people choose our product and the major reason is our compatibility to Oracle, DV2, and teradata SQL syntax, because you know, they have built a lot of applications onto those databases, so when they migrate to Hadoop, they don't want to rewrote whole program, so our compatibility, SQL compatibility is big advantage to them, so this is the first one. And we also support full ANCIT and distribute transactions onto Hadoop. So that a lot of applications can be migrate to our product, with few modification or without any changes. So this is the first our advantage. The second is because we are providing, even the best streaming engine, that is actually derived from Spark. So we apply this technology to IOT applications. You know the IOT pretty soon, they need a very low latency but they also need very complicated models on top of streams. So that's why we are providing full SQL support and machine learning support on top of streaming events. And we are also using event-driven technology to reduce the latency, to five to ten milliseconds. So this is second reason people choose our product. And then today we are announcing 5.0, and I think people will find more reason to choose our product. >> So you have the compatibility SQL, you have the tooling, and now you have the performance. So kind of the triple threat there. So what's the customer saying, when you go out and talk with your customers, what's the view of the current landscape for customers? What are they solving right now, what are the key challenges and pain points that customers have today? >> We have customers in more than 12 vertical segments, and in different verticals they have different pain points, actually so. Take one example: in financial services, the main pain point for them is to migrate existing legacy applications to Hadoop, you know they have accumulated a lot of data, and the performance is very bad using legacy database, so they need high performance Hadoop and Spark to speed up the performance, like reports. But in another vertical, like in logistic and transportation and IOT, the pain point is to find a very low latency streaming engine. At the same time, they need very complicated programming model to write their applications. And that example, like in public sector, they actually need very complicated and large scale search engine. They need to build analytical capability on top of search engine. They can search the results and analyze the result in the same time. >> George: Yuanhao, as always, whenever we get to interview you on theCube, you toss out these gems, sort of like you know diamonds, like big rocks that under millions of years, and incredible pressure, have been squeezed down into these incredibly valuable, kind of, you know, valuable, sort of minerals with lots of goodness in them, so I need you to unpack that diamond back into something that we can make sense out of, or I should say, that's more accessible. You've done something that none of the Hadoop Distro guys have managed to do, which is to build databases that are not just decision support, but can handle OLTP, can handle operational applications. You've done the streaming, you've done what even Databricks can't do without even trying any of the other stuff, which is getting the streaming down to event at a time. Let's step back from all these amazing things, and tell us what was the secret sauce that let you build a platform this advanced? >> So actually, we are driven by our customers, and we do see the trends people are looking for, better solutions, you know there are a lot of pain to set up a habitable class to use the Hadoop technology. So that's why we found it's very meaningful and also very necessary for us to build a SQL database on top of Hadoop. Quite a lot of customers in FS side, they ask us to provide asset until the transaction can be put on top of Hadoop, because they have to guarantee the consistency of their data. Otherwise they cannot use the technology. >> At the risk of interrupting, maybe you can tell us why others have built the analytic databases on top of Hadoop, to give the familiar SQL access, and obviously have a desire also to have transactions next to it, so you can inform a transaction decision with the analytics. One of the questions is, how did you combine the two capabilities? I mean it only took Oracle like 40 years. >> Right, so. Actually our transaction capability is only for analytics, you know, so this OLTP capability it is not for short term transactional applications, it's for data warehouse kind of workloads. >> George: Okay, so when you're ingesting. >> Yes, when you're ingesting, when you modify your data, in batch, you have to guarantee the consistency. So that's the OLTP capability. But we are also building another distributed storage, and distributed database, and that are providing that with OLTP capability. That means you can do concurrent transactions, on that database, but we are still developing that software right now. Today our product providing the digital transaction capability for people to actually build their warehouse. You know quite a lot of people believe data warehouse do not need transaction capability, but we found a lot of people modify their data in data warehouse, you know, they are loading their data continuously to data warehouse, like the CRM tables, customer information, they can be changed over time. So every day people need to update or change the data, that's why we have to provide transaction capability in data warehouse. >> George: Okay, and then so then well tell us also, 'cus the streaming problem is, you know, we're told that roughly two thirds of Spark deployments use streaming as a workload. And the biggest knock on Spark is that it can't process one event at a time, you got to do a little batch. Tell us some of the use cases that can take advantage of doing one event at a time, and how you solved that problem? >> Yuanhao: Yeah so the first use case we encounter is the anti-fraud, or fraud detection application in FSI, so whenever you swipe your credit card, the bank needs to tell you if the transaction is a fraud or not in a few milliseconds. But if you are using Spark streaming, it will usually take 500 milliseconds, so the latency is too high for such kind of application. And that's why we have to provide event per time, like means event-driven processing to detect the fraud, so that we can interrupt the transaction in a few milliseconds, so that's one kind of application. The other can come from IOT applications, so we already put our streaming framework in large manufacture factory. So they have to detect the main function of their equipments in a very short time, otherwise it may explode. So if you... So if you are using Spark streaming, probably when you submit your application, it will take you hundreds of milliseconds, and when you finish your detection, it usually takes a few seconds, so that will be too long for such kind of application. And that's why we need a low latency streaming engine, but you can see it is okay to use Storm or Flink, right? And problem is, we found it is: They need a very complicated programming model, that they are going to solve equation on the streaming events, they need to do the FFT transformation. And they are also asking to run some linear regression or some neural network on top of events, so that's why we have to provide a SQL interface and we are also embedding the CEP capability into our streaming engine, so that you can use pattern to match the events and to send alerts. >> George: So, SQL to get a set of events and maybe join some in the complex event processing, CEP, to say, does this fit a pattern I'm looking for? >> Yuanhao: Yes. >> Okay, and so, and then with the lightweight OLTP, that and any other new projects you're looking at, tell us perhaps the new use cases you'd be appropriated for. >> Yuanhao: Yeah so that's our official product actually, so we are going to solve the problem of large scale OLTP transaction problems like, so you know, a lot of... You know, in China, there is so many population, like in public sector or in banks, they need build a highly scalable transaction systems so that they can support a very high concurrent transactions at the same time, so that's why we are building such kind of technology. You know, in the past, people just divide transaction into multiple databases, like multiple Oracle instances or multiple mySQL instances. But the problem is: if the application is simple, you can very easily divide a transaction over the multiple instances of databases. But if the application is very complicated, especially when the ISV already wrote the applications based on Oracle or traditional database, they already depends on the transaction systems so that's why we have to build a same kind of transaction systems, so that we can support their legacy applications, but they can scale to hundreds of nodes, and they can scale to millions of transactions per second. >> George: On the transactional stuff? >> Yuanhao: Yes. >> Just correct me if I'm wrong, I know we're running out of time but I thought Oracle only scales out when you're doing decision support work, not when you're doing OLTP, not that it, that it can only, that it can maybe stretch to ten nodes or something like that, am I mistaken? >> Yuanhao: Yes, they can scale to 16 to all 32 nodes. >> George: For transactional work? >> For transaction works, but so that's the theoretical limit, but you know, like Google F1 and Google Spanner, they can scale to hundreds of nodes. But you know, the latency is higher than Oracle because you have to use distributed particle to communicate with multiple nodes, so the latency is higher. >> On Google? >> Yes. >> On Google. The latency is higher on the Google? >> 'Cus it has to go like all the way to Europe and back. >> Oracle or Google latency, you said? >> Google, because if you are using two phase commit protocol you have to talk to multiple nodes to broadcast your request to multiple nodes, and then wait for the feedback, so that mean you have a much higher latency, but it's necessary to maintain the consistency. So in a distributed OLTP databases, the latency is usually higher, but the concurrency is also much higher, and scalability is much better. >> George: So that's a problem you've stretched beyond what Oracle's done. >> Yuanhao: Yes, so because customer can tolerant the higher latency, but they need to scale to millions of transactions per second, so that's why we have to build a distributed database. >> George: Okay, for this reason we're going to have to have you back for like maybe five or ten consecutive segments, you know, maybe starting tomorrow. >> We're going to have to get you back for sure. Final question for you: What are you excited about, from a technology, in the landscape, as you look at open source, you're working with Spark, you mentioned Kubernetes, you have micro services, all the cloud. What are you most excited about right now in terms of new technology that's going to help simplify and scale, with low latency, the databases, the software. 'Cus you got IOT, you got autonomous vehicles, you have all this data, what are you excited about? >> So actually, so this technology we already solve these problems actually, but I think the most exciting thing is we found... There's two trends, the first trend is: We found it's very exciting to find more competition framework coming out, like the AI framework, like TensorFlow and MXNet, Torch, and tons of such machine learning frameworks are coming out, so they are solving different kinds of problems, like facial recognition from video and images, like human computer interactions using voice, using audio. So it's very exciting I think, but for... And also it's very, we found it's very exciting we are embedding these, we are combining these technologies together, so that's why we are using competitors you know. We didn't use YARN, because it cannot support TensorFlow or other framework, but you know, if you are using containers and if you have good scheduler, you can schedule any kind of competition frameworks. So we found it's very interesting to, to have these new frameworks, and we can combine together to solve different kinds of problems. >> John: Thanks so much for coming onto theCube, it's an operating system world we're living in now, it's a great time to be a technologist. Certainly the opportunities are out there, and we're breaking it down here inside theCube, live in Silicon Valley, with the best tech executives, best thought leaders and experts here inside theCube. I'm John Furrier with George Gilbert. We'll be right back with more after this short break. (upbeat percussive music)
SUMMARY :
Jose, California, it's theCUBE, So let's get the news out of the way. And the first one is we are providing tool and when people engage you as a customer. And then today we are announcing 5.0, So kind of the triple threat there. the pain point is to find so I need you to unpack because they have to guarantee next to it, so you can you know, so this OLTP capability So that's the OLTP capability. 'cus the streaming problem is, you know, the bank needs to tell you Okay, and so, and then and they can scale to millions scale to 16 to all 32 nodes. so the latency is higher. The latency is higher on the Google? 'Cus it has to go like all so that mean you have George: So that's a the higher latency, but they need to scale segments, you know, to get you back for sure. like the AI framework, like it's a great time to be a technologist.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
John | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
China | LOCATION | 0.99+ |
five | QUANTITY | 0.99+ |
Europe | LOCATION | 0.99+ |
Transwarp Technologies | ORGANIZATION | 0.99+ |
40 years | QUANTITY | 0.99+ |
500 milliseconds | QUANTITY | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
hundreds of nodes | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
Today | DATE | 0.99+ |
ten nodes | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
100 reports | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
second | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Yuanhao Sun | PERSON | 0.99+ |
second reason | QUANTITY | 0.99+ |
Spark 2.0 | TITLE | 0.99+ |
today | DATE | 0.99+ |
this week | DATE | 0.99+ |
ten times | QUANTITY | 0.99+ |
16 | QUANTITY | 0.99+ |
two trends | QUANTITY | 0.99+ |
Yuanhao | PERSON | 0.99+ |
SQL | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
first trend | QUANTITY | 0.99+ |
two capabilities | QUANTITY | 0.98+ |
Silicon Valley, San Jose | LOCATION | 0.98+ |
TensorFlow | TITLE | 0.98+ |
one event | QUANTITY | 0.98+ |
32 nodes | QUANTITY | 0.98+ |
theCUBE | ORGANIZATION | 0.98+ |
Torch | TITLE | 0.98+ |
166 days ago | DATE | 0.98+ |
one example | QUANTITY | 0.98+ |
more than 12 vertical segments | QUANTITY | 0.97+ |
ten milliseconds | QUANTITY | 0.97+ |
hundreds of milliseconds | QUANTITY | 0.97+ |
two thirds | QUANTITY | 0.97+ |
MXNet | TITLE | 0.97+ |
Databricks | ORGANIZATION | 0.96+ |
ORGANIZATION | 0.96+ | |
ten consecutive segments | QUANTITY | 0.95+ |
first use | QUANTITY | 0.95+ |
Wikibon | ORGANIZATION | 0.95+ |
Big Data Silicon Valley | ORGANIZATION | 0.95+ |
Strata Hadoop | ORGANIZATION | 0.95+ |
about 100 times | QUANTITY | 0.94+ |
Big Data SV | ORGANIZATION | 0.94+ |
One of | QUANTITY | 0.94+ |
Arik Pelkey, Pentaho - BigData SV 2017 - #BigDataSV - #theCUBE
>> Announcer: Live from Santa Fe, California, it's the Cube covering Big Data Silicon Valley 2017. >> Welcome, back, everyone. We're here live in Silicon Valley in San Jose for Big Data SV in conjunct with stratAHEAD Hadoop part two. Three days of coverage here in Silicon Valley and Big Data. It's our eighth year covering Hadoop and the Hadoop ecosystem. Now expanding beyond just Hadoop into AI, machine learning, IoT, cloud computing with all this compute is really making it happen. I'm John Furrier with my co-host George Gilbert. Our next guest is Arik Pelkey who is the senior director of product marketing at Pentaho that we've covered many times and covered their event at Pentaho world. Thanks for joining us. >> Thank you for having me. >> So, in following you guys I'll see Pentaho was once an independent company bought by Hitachi, but still an independent group within Hitachi. >> That's right, very much so. >> Okay so you guys some news. Let's just jump into the news. You guys announced some of the machine learning. >> Exactly, yeah. So, Arik Pelkey, Pentaho. We are a data integration and analytics software company. You mentioned you've been doing this for eight years. We have been at Big Data for the past eight years as well. In fact, we're one of the first vendors to support Hadoop back in the day, so we've been along for the journey ever since then. What we're announcing today is really exciting. It's a set of machine learning orchestration capabilities, which allows data scientists, data engineers, and data analysts to really streamline their data science processes. Everything from ingesting new data sources through data preparation, feature engineering which is where a lot of data scientists spend their time through tuning their models which can still be programmed in R, in Weka, in Python, and any other kind of data science tool of choice. What we do is we help them deploy those models inside of Pentaho as a step inside of Pentaho, and then we help them update those models as time goes on. So, really what this is doing is it's streamlining. It's making them more productive so that they can focus their time on things like model building rather than data preparation and feature engineering. >> You know, it's interesting. The market is really active right now around machine learning and even just last week at Google Next, which is their cloud event, they had made the acquisition of Kaggle, which is kind of an open data science. You mentioned the three categories: data engineer, data science, data analyst. Almost on a progression, super geek to business facing, and there's different approaches. One of the comments from the CEO of Kaggle on the acquisition when we wrote up at Sylvan Angle was, and I found this fascinating, I want to get your commentary and reaction to is, he says the data science tools are as early as generations ago, meaning that all the advances and open source and tooling and software development is far along, but now data science is still at that early stage and is going to get better. So, what's your reaction to that, because this is really the demand we're seeing is a lot of heavy lifing going on in the data science world, yet there's a lot of runway of more stuff to do. What is that more stuff? >> Right. Yeah, we're seeing the same thing. Last week I was at the Gardener Data and Analytics conference, and that was kind of the take there from one of their lead machine learning analysts was this is still really early days for data science software. So, there's a lot of Apache projects out there. There's a lot of other open source activity going on, but there are very few vendors that bring to the table an integrated kind of full platform approach to the data science workflow, and that's what we're bringing to market today. Let me be clear, we're not trying to replace R, or Python, or MLlib, because those are the tools of the data scientists. They're not going anywhere. They spent eight years in their phD program working with these tools. We're not trying to change that. >> They're fluent with those tools. >> Very much so. They're also spending a lot of time doing feature engineering. Some research reports, say between 70 and 80% of their time. What we bring to the table is a visual drag and drop environment to do feature engineering a much faster, more efficient way than before. So, there's a lot of different kind of desperate siloed applications out there that all do interesting things on their own, but what we're doing is we're trying to bring all of those together. >> And the trends are reduce the time it takes to do stuff and take away some of those tasks that you can use machine learning for. What unique capabilities do you guys have? Talk about that for a minute, just what Pentaho is doing that's unique and added value to those guys. >> So, the big thing is I keep going back to the data preparation part. I mean, that's 80% of time that's still a really big challenge. There's other vendors out there that focus on just the data science kind of workflow, but where we're really unqiue is around being able to accommodate very complex data environments, and being able to onboard data. >> Give me an example of those environments. >> Geospatial data combined with data from your ERP or your CRM system and all kinds of different formats. So, there might be 15 different data formats that need to be blended together and standardized before any of that can really happen. That's the complexity in the data. So, Pentaho, very consistent with everything else that we do outside of machine learning, is all about helping our customers solve those very complex data challenges before doing any kind of machine learning. One example is one customer is called Caterpillar Machine Asset Intelligence. So, their doing predictive maintenance onboard container ships and on ferry's. So, they're taking data from hundreds and hundreds of sensors onboard these ships, combining that kind of operational sensor data together with geospatial data and then they're serving up predictive maintenance alerts if you will, or giving signals when it's time to replace an engine or complace a compressor or something like that. >> Versus waiting for it to break. >> Versus waiting for it to break, exactly. That's one of the real differentiators is that very complex data environment, and then I was starting to move toward the other differentiator which is our end to end platform which allows customers to deliver these analytics in an embedded fashion. So, kind of full circle, being able to send that signal, but not to an operational system which is sometimes a challenge because you might have to rewrite the code. Deploying models is a really big challenge within Pentaho because it is this fully integrated application. You can deploy the models within Pentaho and not have to jump out into a mainframe environment or something like that. So, I'd say differentiators are very complex data environments, and then this end to end approach where deploying models is much easier than ever before. >> Perhaps, let's talk about alternatives that customers might see. You have a tool suite, and others might have to put together a suite of tools. Maybe tell us some of the geeky version would be the impendent mismatch. You know, like the chasms you'd find between each tool where you have to glue them together, so what are some of those pitfalls? >> One of the challenges is, you have these data scientists working in silos often times. You have data analysts working in silos, you might have data engineers working in silos. One of the big pitfalls is not really collaborating enough to the point where they can do all of this together. So, that's a really big area that we see pitfalls. >> Is it binary not collaborating, or is it that the round trip takes so long that the quality or number of collaborations is so drastically reduced that the output is of lower quality? >> I think it's probably a little bit of both. I think they want to collaborate but one person might sit in Dearborn, Michigan and the other person might sit in Silicon Valley, so there's just a location challenge as well. The other challenge is, some of the data analysts might sit in IT and some of the data scientists might sit in an analytics department somewhere, so it kind of cuts across both location and functional area too. >> So let me ask from the point of view of, you know we've been doing these shows for a number of years and most people have their first data links up and running and their first maybe one or two use cases in production, very sophisticated customers have done more, but what seems to be clear is the highest value coming from those projects isn't to put a BI tool in front of them so much as to do advanced analytics on that data, apply those analytics to inform a decision, whether a person or a machine. >> That's exactly right. >> So, how do you help customers over that hump and what are some other examples that you can share? >> Yeah, so speaking of transformative. I mean, that's what machine learning is all about. It helps companies transform their businesses. We like to talk about that at Pentaho. One customer kind of industry example that I'll share is a company called IMS. IMS is in the business of providing data and analytics to insurance companies so that the insurance companies can price insurance policies based on usage. So, it's a usage model. So, IMS has a technology platform where they put sensors in a car, and then using your mobile phone, can track your driving behavior. Then, your insurance premium that month reflects the driving behavior that you had during that month. In terms of transformative, this is completely upending the insurance industry which has always had a very fixed approach to pricing risk. Now, they understand everything about your behavior. You know, are you turning too fast? Are you breaking too fast, and they're taking it further than that too. They're able to now do kind of a retroactive look at an accident. So, after an accident, they can go back and kind of decompose what happened in the accident and determine whether or not it was your fault or was in fact the ice on the street. So, transformative? I mean, this is just changing things in a really big way. >> I want to get your thoughts on this. I'm just looking at some of the research. You know, we always have the good data but there's also other data out there. In your news, 92% of organizations plan to deploy more predictive analytics, however 50% of organizations have difficulty integrating predictive analytics into their information architecture, which is where the research is shown. So my question to you is, there's a huge gap between the technology landscapes of front end BI tools and then complex data integration tools. That seems to be the sweet spot where the value's created. So, you have the demand and then front end BI's kind of sexy and cool. Wow, I could power my business, but the complexity is really hard in the backend. Who's accessing it? What's the data sources? What's the governance? All these things are complicated, so how do you guys reconcile the front end BI tools and the backend complexity integrations? >> Our story from the beginning has always been this one integrated platform, both for complex data integration challenges together with visualizations, and that's very similar to what this announcement is all about for the data science market. We're very much in line with that. >> So, it's the cart before the horse? Is it like the BI tools are really driven by the data? I mean, it makes sense that the data has to be key. Front end BI could be easy if you have one data set. >> It's funny you say that. I presented at the Gardner conference last week and my topic was, this just in: it's not about analytics. Kind of in jest, but it drove a really big crowd. So, it's about the data right? It's about solving the data problem before you solve the analytics problem whether it's a simple visualization or it's a complex fraud machine learning problem. It's about solving the data problem first. To that quote, I think one of the things that they were referencing was the challenging information architectures into which companies are trying to deploy models and so part of that is when you build a machine learning model, you use R and Python and all these other ones we're familiar with. In order to deploy that into a mainframe environment, someone has to then recode it in C++ or COBOL or something else. That can take a really long time. With our integrated approach, once you've done the feature engineering and the data preparation using our drag and drop environment, what's really interesting is that you're like 90% of the way there in terms of making that model production ready. So, you don't have to go back and change all that code, it's already there because you used it in Pentaho. >> So obviously for those two technologies groups I just mentioned, I think you had a good story there, but it creates problems. You've got product gaps, you've got organizational gaps, you have process gaps between the two. Are you guys going to solve that, or are you currently solving that today? There's a lot of little questions in there, but that seems to be the disconnect. You know, I can do this, I can do that, do I do them together? >> I mean, sticking to my story of one integrated approach to being able to do the entire data science workflow, from beginning to end and that's where we've really excelled. To the extent that more and more data engineers and data analysts and data scientists can get on this one platform even if their using R and WECCA and Python. >> You guys want to close those gaps down, that's what you guys are doing, right? >> We want to make the process more collaborative and more efficient. >> So Dave Alonte has a question on CrowdChat for you. Dave Alonte was in the snowstorm in Boston. Dave, good to see you, hope you're doing well shoveling out the driveway. Thanks for coming in digitally. His question is HDS has been known for mainframes and storage, but Hitachi is an industrial giant. How is Pentaho leveraging Hitatchi's IoT chops? >> Great question, thanks for asking. Hitatchi acquired Pentaho about two years ago, this is before my time. I've been with Pentaho about ten months ago. One of the reasons that they acquired Pentaho is because a platform that they've announced which is called Lumata which is their IoT platform, so what Pentaho is, is the analytics engine that drives that IoT platform Lumata. So, Lumata is about solving more of the hardware sensor, bringing data from the edge into being able to do the analytics. So, it's an incredibly great partnership between Lumata and Pentaho. >> Makes an eternal customer too. >> It's a 90 billion dollar conglomerate so yeah, the acquisition's been great and we're still very much an independent company going to market on our own, but we now have a much larger channel through Hitatchi's reps around the world. >> You've got IoT's use case right there in front of you. >> Exactly. >> But you are leveraging it big time, that's what you're saying? >> Oh yeah, absolutely. We're a very big part of their IoT strategy. It's the analytics. Both of the examples that I shared with you are in fact IoT, not by design but it's because there's a lot of demand. >> You guys seeing a lot of IoT right now? >> Oh yeah. We're seeing a lot of companies coming to us who have just hired a director or vice president of IoT to go out and figure out the IoT strategy. A lot of these are manufacturing companies or coming from industries that are inefficient. >> Digitizing the business model. >> So to the other point about Hitachi that I'll make, is that as it relates to data science, a 90 billion dollar manufacturing and otherwise giant, we have a very deep bench of phD data scientists that we can go to when there's very complex data science problems to solve at customer sight. So, if a customer's struggling with some of the basic how do I get up and running doing machine learning, we can bring our bench of data scientist at Hitatchi to bear in those engagements, and that's a really big differentiator for us. >> Just to be clear and one last point, you've talked about you handle the entire life cycle of modeling from acquiring the data and prepping it all the way through to building a model, deploying it, and updating it which is a continuous process. I think as we've talked about before, data scientists or just the DEV ops community has had trouble operationalizing the end of the model life cycle where you deploy it and update it. Tell us how Pentaho helps with that. >> Yeah, it's a really big problem and it's a very simple solution inside of Pentaho. It's basically a step inside of Pentaho. So, in the case of fraud let's say for example, a prediction might say fraud, not fraud, fraud, not fraud, whatever it is. We can then bring that kind of full lifecycle back into the data workflow at the beginning. It's a simple drag and drop step inside of Pentaho to say which were right and which were wrong and feed that back into the next prediction. We could also take it one step further where there has to be a manual part of this too where it goes to the customer service center, they investigate and they say yes fraud, no fraud, and then that then gets funneled back into the next prediction. So yeah, it's a big challenge and it's something that's relatively easy for us to do just as part of the data science workflow inside of Pentaho. >> Well Arick, thanks for coming on The Cube. We really appreciate it, good luck with the rest of the week here. >> Yeah, very exciting. Thank you for having me. >> You're watching The Cube here live in Silicon Valley covering Strata Hadoop, and of course our Big Data SV event, we also have a companion event called Big Data NYC. We program with O'Reilley Strata Hadoop, and of course have been covering Hadoop really since it's been founded. This is The Cube, I'm John Furrier. George Gilbert. We'll be back with more live coverage today for the next three days here inside The Cube after this short break.
SUMMARY :
it's the Cube covering Big Data Silicon Valley 2017. and the Hadoop ecosystem. So, in following you guys I'll see Pentaho was once You guys announced some of the machine learning. We have been at Big Data for the past eight years as well. One of the comments from the CEO of Kaggle of the data scientists. environment to do feature engineering a much faster, and take away some of those tasks that you can use So, the big thing is I keep going back to the data That's the complexity in the data. So, kind of full circle, being able to send that signal, You know, like the chasms you'd find between each tool One of the challenges is, you have these data might sit in IT and some of the data scientists So let me ask from the point of view of, the driving behavior that you had during that month. and the backend complexity integrations? is all about for the data science market. I mean, it makes sense that the data has to be key. It's about solving the data problem before you solve but that seems to be the disconnect. To the extent that more and more data engineers and more efficient. shoveling out the driveway. One of the reasons that they acquired Pentaho the acquisition's been great and we're still very much Both of the examples that I shared with you of IoT to go out and figure out the IoT strategy. is that as it relates to data science, from acquiring the data and prepping it all the way through and feed that back into the next prediction. of the week here. Thank you for having me. for the next three days here inside The Cube
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Hitachi | ORGANIZATION | 0.99+ |
Dave Alonte | PERSON | 0.99+ |
Pentaho | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
90% | QUANTITY | 0.99+ |
Arik Pelkey | PERSON | 0.99+ |
Boston | LOCATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Hitatchi | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
50% | QUANTITY | 0.99+ |
eight years | QUANTITY | 0.99+ |
Arick | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
Lumata | ORGANIZATION | 0.99+ |
Last week | DATE | 0.99+ |
two technologies | QUANTITY | 0.99+ |
15 different data formats | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
92% | QUANTITY | 0.99+ |
One example | QUANTITY | 0.99+ |
Both | QUANTITY | 0.99+ |
Three days | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
Kaggle | ORGANIZATION | 0.99+ |
one customer | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
eighth year | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
Santa Fe, California | LOCATION | 0.99+ |
two | QUANTITY | 0.99+ |
each tool | QUANTITY | 0.99+ |
90 billion dollar | QUANTITY | 0.99+ |
80% | QUANTITY | 0.99+ |
Caterpillar | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
NYC | LOCATION | 0.98+ |
first data | QUANTITY | 0.98+ |
Pentaho | LOCATION | 0.98+ |
San Jose | LOCATION | 0.98+ |
The Cube | TITLE | 0.98+ |
Big Data SV | EVENT | 0.97+ |
COBOL | TITLE | 0.97+ |
70 | QUANTITY | 0.97+ |
C++ | TITLE | 0.97+ |
IMS | TITLE | 0.96+ |
MLlib | TITLE | 0.96+ |
one person | QUANTITY | 0.95+ |
R | TITLE | 0.95+ |
Big Data | EVENT | 0.95+ |
Gardener Data and Analytics | EVENT | 0.94+ |
Gardner | EVENT | 0.94+ |
Strata Hadoop | TITLE | 0.93+ |
Yaron Haviv | BigData SV 2017
>> Announcer: Live from San Jose, California, it's the CUBE, covering Big Data Silicon Valley 2017. (upbeat synthesizer music) >> Live with the CUBE coverage of Big Data Silicon Valley or Big Data SV, #BigDataSV in conjunction with Strata + Hadoop. I'm John Furrier with the CUBE and my co-host George Gilbert, analyst at Wikibon. I'm excited to have our next guest, Yaron Haviv, who's the founder and CTO of iguazio, just wrote a post up on SiliconANGLE, check it out. Welcome to the CUBE. >> Thanks, John. >> Great to see you. You're in a guest blog this week on SiliconANGLE, and always great on Twitter, cause Dave Alante always liked to bring you into the contentious conversations. >> Yaron: I like the controversial ones, yes. (laughter) >> And you add a lot of good color on that. So let's just get right into it. So your company's doing some really innovative things. We were just talking before we came on camera here, about some of the amazing performance improvements you guys have on many different levels. But first take a step back, and let's talk about what this continuous analytics platform is, because it's unique, it's different, and it's got impact. Take a minute to explain. >> Sure, so first a few words on iguazio. We're developing a data platform which is unified, so basically it can ingest data through many different APIs, and it's more like a cloud service. It is for on-prem and edge locations and co-location, but it's managed more like a cloud platform so very similar experience to Amazon. >> John: It's software? >> It's software. We do integrate a lot with hardware in order to achieve our performance, which is really about 10 to 100 times faster than what exists today. We've talked to a lot of customers and what we really want to focus with customers in solving business problems, Because I think a lot of the Hadoop camp started with more solving IT problems. So IT is going kicking tires, and eventually failing based on your statistics and Gardner statistics. So what we really wanted to solve is big business problems. We figured out that this notion of pipeline architecture, where you ingest data, and then curate it, and fix it, et cetera, which was very good for the early days of Hadoop, if you think about how Hadoop started, was page ranking from Google. There was no time sensitivity. You could take days to calculate it and recalibrate your search engine. Based on new research, everyone is now looking for real time insights. So there is sensory data from (mumbles), there's stock data from exchanges, there is fraud data from banks, and you need to act very quickly. So this notion of and I can give you examples from customers, this notion of taking data, creating Parquet file and log files, and storing them in S3 and then taking Redshift and analyzing them, and then maybe a few hours later having an insight, this is not going to work. And what you need to fix is, you have to put some structure into the data. Because if you need to update a single record, you cannot just create a huge file of 10 gigabyte and then analyze it. So what we did is, basically, a mechanism where you ingest data. As you ingest the data, you can run multiple different processes on the same thing. And you can also serve the data immediately, okay? And two examples that we demonstrate here in the show, one is video surveillance, very nice movie-style example, that you, basically, ingest pictures for S3 API, for object API, you analyze the picture to detect faces, to detect scenery, to extract geolocation from pictures and all that, all those through different processes. TensorFlow doing one, serverless functions that we have, do other simpler tasks. And in the same time, you can have dashboards that just show everything. And you can have Spark, that basically does queries of where was this guys last seen? Or who was he with, you know, or think about the Boston Bomber example. You could just do it in real time. Because you don't need this notion of pipeline. And this solves very hard business problems for some of the customers we work with. >> So that's the key innovation, there's no pipe lining. And what's the secret sauce? >> So first, our system does about a couple of million of transactions per second. And we are a multi-modal database. So, basically, you can ingest data as a stream, exactly the same data could be read by Spark as a table. So you could, basically, issue a query on the same data. Give me everything that has a certain pattern or something, and could also be served immediately through RESTful APIs to a dashboard running AngularJS or something like that. So that's the secret sauce, is by having this integration, and this unique data model, it allows you all those things to work together. There are other aspects, like we have transactional semantics. One of the challenges is how do you make sure that a bunch of processes don't collide when they update the same data. So first you need a very low ground alert. 'cause each one may update to different field. Like this example that I gave with GeoData, the serverless function that does the GeoData extraction only updates the GeoData fields within the records. And maybe TensorFlow updates information about the image in a different location in the record or, potentially, a different record. So you have to have that, along with transaction safety, along with security. We have very tight security at the field level, identity level. So that's re-thinking the entire architecture. And I think what many of the companies you'll see at the show, they'll say, okay, Hadoop is given, let's build some sort of convenience tools around it, let's do some scripting, let's do automation. But serve the underlying thing, I won't use dirty words, but is not well-equipped to the new challenges of real time. We basically restructured everything, we took the notions of cloud-native architectures, we took the notions of Flash and latest Flash technologies, a lot of parallelism on CPUs. We didn't take anything for granted on the underlying architecture. >> So when you found the company, take a personal story here. What was the itch you were scratching, why did you get into this? Obviously, you have a huge tech advantage, which is, will double-down with the research piece and George will have some questions. What got you going with the company? You got a unique approach, people would love to do away with the pipeline, that sounds great. And the performance, you said about 100x. So how did you get here? (laughs) Tell the story. >> So if you know my background, I ran all the data center activities in Mellanox, and you know Mellanox, I know Kevin was here. And my role was to take Mellanox technology, which is 100 gig networking and silicon, and fit it into the different applications. So I worked with SAP HANA, I worked with Teradata, I worked on Oracle Exadata, I work with all the cloud service providers on building their own object storage and NoSQL and other solutions. I also owned all the open source activities around Hadoop and Saf and all those projects, and my role was to fix many of those. If a customer says I don't need 100 gig, it's too fast for me, how do I? And my role was to convince him that yes, I can open up all the bottleneck all the way up to your stack so you can leverage those new technologies. And for that we basically sowed inefficiencies in those stacks. >> So you had a good purview of the marketplace. >> Yaron: Yes. >> You had open source on one hand, and then all the-- >> All the storage players, >> vendors, network. >> all the database players and all the cloud service providers were my customers. So you're a very unique point where you see the trajectory of cloud. Doing things totally different, and sometimes I see the trajectory of enterprise storage, SAN, NAS, you know, all Flash, all that, legacy technologies where cloud providers are all about object, key value, NoSQL. And you're trying to convince those guys that maybe they were going the wrong way. But it's pretty hard. >> Are they going the wrong way? >> I think they are going the wrong way. Everyone, for example, is running to do NVMe over Fabric now that's the new fashion. Okay, I did the first implementation of NVMe over Fabric, in my team at Mellanox. And I really loved it, at that time, but databases cannot run on top of storage area networks. Because there are serialization problems. Okay, if you use a storage area network, that mean that every node in the cluster have to go and serialize an operation against the shared media. And that's not how Google and Amazon works. >> There's a lot more databases out there too, and a lot more data sources. You've got the Edge. >> Yeah, but all the new databases, all the modern databases, they basically shared the data across the different nodes so there are no serialization problems. So that's why Oracle doesn't scale, or scale to 10 nodes at best, with a lot of RDMA as a back plane, to allow that. And that's why Amazon can scale to a thousand nodes, or Google-- >> That's the horizontally-scalable piece that's happening. >> Yeah, because, basically, the distribution has to move into the higher layers of the data, and not the lower layers of the data. And that's really the trajectory where the traditional legacy storage and system vendors are going, and we sort of followed the way the cloud guys went, just with our knowledge of the infrastructure, we sort of did it better than what the cloud guys did. 'Cause the cloud guys focused more on the higher levels of the implementation, the algorithms, the Paxos, and all that. Their implementation is not that efficient. And we did both sides extremely efficient. >> How about the Edge? 'Cause Edge is now part of cloud, and you got cloud has got the compute, all the benefits, you were saying, and still they have their own consumption opportunities and challenges that everyone else does. But Edge is now exploding. The combination of those things coming together, at the intersection of that is deep learning, machine learning, which is powering the AI hype. So how is the Edge factoring into your plan and overall architectures for the cloud? >> Yeah, so I wrote a bunch of posts that are not published yet about the Edge, But my analysis along with your analysis and Pierre Levin's analysis, is that cloud have to start distribute more. Because if you're looking at the trends. Five gig, 5G Wi-Fi in wireless networking is going to be gigabit traffic. Gigabit to the homes, they're going to buy Google, 70 bucks a month. It's going to push a lot more bend with the Edge. On the same time, a cloud provider, is in order to lower costs and deal with energy problems they're going to rural areas. The traditional way we solve cloud problems was to put CDNs, so every time you download a picture or video, you got to a CDN. When you go to Netflix, you don't really go to Amazon, you got to a Netflix pop, one of 250 locations. The new work loads are different because they're no longer pictures that need to be cashed. First, there are a lot of data going up. Sensory data, upload files, et cetera. Data is becoming a lot more structured. Censored data is structured. All this car information will be structured. And you want to (mumbles) digest or summarize the data. So you need technologies like machine learning, NNI and all those things. You need something which is like CDNs. Just mini version of cloud that sits somewhere in between the Edge and the cloud. And this is our approach. And now because we can string grab the mini cloud, the mini Amazon in a way more dense approach, then this is a play that we're going to take. We have a very good partnership with Equinox. Which has 170 something locations with very good relations. >> So you're, essentially, going to disrupt the CDN. It's something that I've been writing about and tweeting about. CDNs were based on the old Yahoo days. Cashing images, you mentioned, give me 1999 back, please. That's old school, today's standards. So it's a whole new architecture because of how things are stored. >> You have to be a lot more distributive. >> What is the architecture? >> In our innovation, we have two layers of innovation. One is on the lower layers of, we, actually, have three main innovations. One is on the lower layers of what we discussed. The other one is the security layer, where we classify everything. Layer seven at 100 gig graphic rates. And the third one is all this notion of distributed system. We can, actually, run multiple systems in multiple locations and manage them as one logical entity through high level semantics, high level policies. >> Okay, so when we take the CUBE global, we're going to have you guys on every pop. This is a legit question. >> No it's going to take time for us. We're not going to do everything in one day and we're starting with the local problems. >> Yeah but this is digital transmissions. Stay with me for a second. Stay with this scenario. So video like Netflix is, pretty much, one dimension, it's video. They use CDNs now but when you start thinking in different content types. So, I'm going to have a video with, maybe, just CGI overlayed or social graph data coming in from tweets at the same time with Instagram pictures. I might be accessing multiple data everywhere to watch a movie or something. That would require beyond a CDN thinking. >> And you have to run continuous analytics because it can not afford batch. It can not afford a pipeline. Because you ingest picture data, you may need to add some subtext with the data and feed it, directly, to the consumer. So you have to move to those two elements of moving more stuff into the Edge and running into continuous analytics versus a batch on pipeline. >> So you think, based on that scenario I just said, that there's going to be an opportunity for somebody to take over the media landscape for sure? >> Yeah, I think if you're also looking at the statistics. I seen a nice article. I told George about it. That analyzing the Intel cheap distribution. What you see is that there is a 30% growth on Intel's cheap Intel Cloud which is faster than what most analysts anticipate in terms of cloud growth. That means, actually, that cloud is going to cannibalize Enterprise faster than what most think. Enterprise is shrinking about 7%. There is another place which is growing. It's Telcos. It's not growing like cloud but part of it is because of this move towards the Edge and the move of Telcos buying white boxes. >> And 5G and access over the top too. >> Yeah but that's server chips. >> Okay. >> There's going to be more and more computation in the different Telco locations. >> John: Oh you're talking about computer, okay. >> This is an opportunity that we can capitalize on if we run fast enough. >> It sounds as though because you've implemented these industry standard APIs that come from the, largely, the open source ecosystem, that you can propagate those to areas on the network that the vendors, who are behind those APIs can't, necessarily, do. Into the Telcos, towards the Edge. And, I assume, part of that is cause of the density and the simplicity. So, essentially, your footprint's smaller in terms of hardware and the operational simplicity is greater. Is that a fair assessment? >> Yes and also, we support a lot of Amazon compatible APIs which are RESTful, typically, HTTP based. Very convenient to work with in a cloud environment. Another thing is, because we're taking all the state on ourself, the different forms of states whether it's a message queue or a table or an object, et cetera, that makes the computation layer very simple. So one of the things that we are, also, demonstrating is the integration we have with Kubernetes that, basically, now simplifies Kubernetes. Cause you don't have to build all those different data services for cloud native infrastructure. You just run Kubernetes. We're the volume driver, we're the database, we're the message queues, we're everything underneath Kubernetes and then, you just run Spark or TensorFlow or a serverless function as a Kubernetes micro service. That allows you now, elastically, to increase the number of Spark jobs that you need or, maybe, you have another tenant. You just spun a Spark job. YARN has some of those attributes but YARN is very limited, very confined to the Hadoop Ecosystem. TensorFlow is not a Hadoop player and a bunch of those new tools are not in Hadoop players and everyone is now adopting a new way of doing streaming and they just call it serverless. serverless and streaming are very similar technologies. The advantage of serverless is all this pre-packaging and all this automation of the CICD. The continuous integration, the continuous development. So we're thinking, in order to simplify the developer in an operation aspects, we're trying to integrate more and more with cloud native approach around CICD and integration with Kubernetes and cloud native technologies. >> Would it be fair to say that from a developer or admin point of view, you're pushing out from the cloud towards the Edge faster than if the existing implementations say, the Apache Ecosystem or the AWS Ecosystem where AWS has something on the edge. I forgot whether it's Snowball or Green Grass or whatever. Where they at least get the lambda function. >> They're field by the way and it's interesting to see. One of the things they allowed lambda functions in their CDS which is going the direction I mentioned just for a minimal functionality. Another thing is they have those boxes where they have a single VM and they can run lambda function as well. But I think their ability to run computation is very limited and also, their focus is on shipping the boxes through mail and we want it to be always connected. >> Our final question for you, just to get your thoughts. Great save up, by the way. This is very informative. Maybe be should do a follow up on Skype in our studio for Silocon Friday show. Google Next was interesting. They're serious about the Enterprise but you can see that they're not yet there. What is the Enterprise readiness from your perspective? Cause Google has the tech and they try to flaunt the tech. We're great, we're Google, look at us, therefore, you should buy us. It's not that easy in the Enterprise. How would you size up the different players? Because they're all not like Amazon although Amazon is winning. You got Amazon, Azure and Google. Your thoughts on the cloud players. >> The way we attack Enterprise, we don't attack it from an Enterprise perspective or IT perspective, we take it from a business use case perspective. Especially, because we're small and we have to run fast. You need to identify a real critical business problem. We're working with stock exchanges and they have a lot of issues around monitoring the daily trade activities in real time. If you compare what we do with them on this continuous analytics notion to how they work with Excel's and Hadoops, it's totally different and now, they could do things which are way different. I think that one of the things that Hadook's customer, if Google wants to succeed against Amazon, they have to find the way of how to approach those business owners and say here's a problem Mr. Customer, here's a business challenge, here's what I'm going to solve. If they're just going to say, you know what? My VM's are cheaper than Amazon, it's not going to be a-- >> Also, they're doing the whole, they're calling lift and shift which is code word for rip and replace in the Enterprise. So that's, essentially, I guess, a good opportunity if you can get people to do that but not everyone's ripping and replacing and lifting and shifting. >> But a lot of Google advantages around areas of AI and things like that. So they should try and leverage, if you think about Amazon approach to AI, this fund the university to build a project and then set it's hours where Google created TensorFlow and created a lot of other IPs and Dataflow and all those solutions and consumered it to the community. I really love Google's approach of contributing Kubernetes, to contributing TensorFlow. And this way, they're planting the seeds so the new generation this is going to work with Kubernetes and TensorFlow who are going to say, "You know what?" "Why would I mess with this thing on (mumbles) just go and. >> Regular cloud, do multi-cloud. >> Right to the cloud. But I think a lot of criticism about Google is that they're too research oriented. They don't know how to monetize and approach the-- >> Enterprise is just a whole different drum beat and I think that's the only thing on my complaint with them, they got to get that knowledge and/or buy companies. Have a quick final point on Spanner or any analysis of Spanner that went from paper, pretty quickly, from paper to product. >> So before we started iguazio, I started Spanner quite a bit. All the publication was there and all the other things like Spanner. Spanner has the underlying layer called Colossus. And our data layer is very similar to how Colossus works. So we're very familiar. We took a lot of concepts from Spanner on our platform. >> And you like Spanner, it's legit? >> Yes, again. >> Cause you copied it. (laughs) >> Yaron: We haven't copied-- >> You borrowed some best practices. >> I think I cited about 300 research papers before we did the architecture. But we, basically, took the best of each one of them. Cause there's still a lot of issues. Most of those technologies, by the way, are designed for mechanical disks and we can talk about it in a different-- >> And you have Flash. Alright, Yaron, we have gone over here. Great segment. We're here, live in Silicon Valley, breakin it down, getting under the hood. Looking a 10X, 100X performance advantages. Keep an eye on iguazio, they're looking like they got some great products. Check them out. This is the CUBE. I'm John Furrier with George Gilbert. We'll be back with more after this short break. (upbeat synthesizer music)
SUMMARY :
it's the CUBE, covering Big Welcome to the CUBE. to bring you into the Yaron: I like the about some of the amazing and it's more like a cloud service. And in the same time, So that's the key innovation, So that's the secret sauce, And the performance, you said about 100x. and fit it into the purview of the marketplace. and all the cloud service that's the new fashion. You've got the Edge. Yeah, but all the new databases, That's the horizontally-scalable and not the lower layers of the data. So how is the Edge digest or summarize the data. going to disrupt the CDN. One is on the lower layers of, we're going to have you guys on every pop. the local problems. So, I'm going to have a video with, maybe, of moving more stuff into the Edge and the move of Telcos buying white boxes. in the different Telco locations. John: Oh you're talking This is an opportunity that we and the operational simplicity is greater. is the integration we have with Kubernetes the Apache Ecosystem or the AWS Ecosystem One of the things they It's not that easy in the Enterprise. to say, you know what? and replace in the Enterprise. and consumered it to the community. Right to the cloud. that's the only thing and all the other things like Spanner. Cause you copied it. and we can talk about it in a different-- This is the CUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Telcos | ORGANIZATION | 0.99+ |
Yaron Haviv | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Equinox | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Mellanox | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Kevin | PERSON | 0.99+ |
Dave Alante | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Yaron | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Pierre Levin | PERSON | 0.99+ |
100 gig | QUANTITY | 0.99+ |
AngularJS | TITLE | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
30% | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
two examples | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
third one | QUANTITY | 0.99+ |
Skype | ORGANIZATION | 0.99+ |
one day | QUANTITY | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
10 gigabyte | QUANTITY | 0.99+ |
Teradata | ORGANIZATION | 0.99+ |
two elements | QUANTITY | 0.99+ |
CUBE | ORGANIZATION | 0.99+ |
Spanner | TITLE | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
S3 | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
1999 | DATE | 0.98+ |
two layers | QUANTITY | 0.98+ |
Excel | TITLE | 0.98+ |
both sides | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
Five gig | QUANTITY | 0.98+ |
Kubernetes | TITLE | 0.98+ |
Paxos | ORGANIZATION | 0.98+ |
Intel | ORGANIZATION | 0.98+ |
100X | QUANTITY | 0.98+ |
Azure | ORGANIZATION | 0.98+ |
Colossus | TITLE | 0.98+ |
about 7% | QUANTITY | 0.98+ |
Yahoo | ORGANIZATION | 0.98+ |
Hadoop | TITLE | 0.97+ |
Boston Bomber | ORGANIZATION | 0.97+ |
Tendü Yogurtçu | BigData SV 2017
>> Announcer: Live from San Jose, California. It's The Cube, covering Big Data Silicon Valley 2017. (upbeat electronic music) >> California, Silicon Valley, at the heart of the big data world, this is The Cube's coverage of Big Data Silicon Valley in conjunction with Strata Hadoop, well of course we've been here for multiple years, covering Hadoop World for now our eighth year, now that's Strata Hadoop but we do our own event, Big Data SV in New York City and Silicon Valley, SV NYC. I'm John Furrier, my cohost George Gilbert, analyst at Wikibon. Our next guest is Tendü Yogurtçu with Syncsort, general manager of the big data, did I get that right? >> Yes, you got it right. It's always a pleasure to be at The Cube. >> (laughs) I love your name. That's so hard for me to get, but I think I was close enough there. Welcome back. >> Thank you. >> Great to see you. You know, one of the things I'm excited about with Syncsort is we've been following you guys, we talk to you guys every year, and it just seems to be that every year, more and more announcements happen. You guys are unstoppable. You're like what Amazon does, just more and more announcements, but the theme seems to be integration. Give us the latest update. You had an update, you bought Trillium, you got a hit deal with Hortonworks, you got integrated with Spark, you got big news here, what's the news here this year? >> Sure. Thank you for having me. Yes, it's very exciting times at Syncsort and I've probably say that every time I appear because every time it's more exciting than the previous, which is great. We bought Trillium Software and Trillium Software has been leading data quality over a decade in many of the enterprises. It's very complimentary to our data integration, data management portfolio because we are helping our customers to access all of their enterprise data, not just the new emerging sources in the connected devices and mobile and streaming. Also leveraging reference data, my main frame legacy systems and the legacy enterprise data warehouse. While we are doing that, accessing data, data lake is now actually, in some cases, turning into data swamp. That was a term Dave Vellante used a couple of years back in one of the crowd chats and it's becoming real. So, data-- >> Real being the data swamps, data lakes are turning into swamps because they're not being leveraged properly? >> Exactly, exactly. Because it's about also having access to write data, and data quality is very complimentary because dream has had trusted right data, so to enterprise customers in the traditional environments, so now we are looking forward to bring that enterprise trust of the data quality into data lake. In terms of the data integration, data integration has been always very critical to any organization. It's even more critical now that the data is shifting gravity and the amount of data organizations have. What we have been delivering in very large enterprise production environments for the last three years is we are hearing our competitors making announcements in those areas very recently, which is a validation because we are already running in very large production environments. We are offering value by saying "Create your applications for integrating your data," whether it's in the cloud or originating on the cloud or origination on the main frames, whether it's on the legacy data warehouse, you can deploy the same exact application without any recompilations, without any changes on your standalone Windows laptop or in Hadoop MapReduce, or Spark in the cloud. So this design once and deploy anywhere is becoming more and more critical with data, it's originating in many different places and cloud is definitely one of them. Our data warehouse optimization solution with Hortonworks and AtScale, it's a special package to accelerate this adoption. It's basically helping organizations to offload the workload from the existing Teradata or Netezza data warehouse and deploying in Hadoop. We provide a single button to automatically map the metadata, create the metadata in Hive or on Hadoop and also make the data accessible in the new environment and AtScale provides fast BI on top of that. >> Wow, that's amazing. I want to ask you a question, because this is a theme, so I just did a tweetup just now while you were talking saying "the theme this year is cleaning up the data lakes, or data swamps, AKA data lakes. The other theme is integration. Can you just lay out your premise on how enterprises should be looking at integration now because it's the multi-vendor world, it's the multi-cloud world, multi-data type and source with metadata world. How do you advise customers that have the plethora of action coming at them. IOT, you've got cloud, you've got big data, I've got Hadoop here, I got Spark over here, what's the integration formula? >> First thing is identify your business use cases. What's your business's challenge, what's your business goals, and the challenge, because that should be the real driver. We assist in some organizations, they start with the intention "we would like to create a data lake" without having that very clear understanding, what is it that I'm trying to solve with this data lake? Data as a service is really becoming a theme across multiple organizations, whether it's on the enterprise side or on some of the online retail organizations, for example. As part of that data as a service, organizations really need to adopt tools that are going to enable them to take advantage of the technology stack. The technology stack is evolving very rapidly. The skill sets are rare, and skill sets are rare because you need to be kind of making adjustments. Am I hiring Ph.D students who can program Scala in the most optimized way, or should I hire Java developers, or should I hire Python developers, the names of the tools in the stack, Spark one versus Spark two APIs, change. It's really evolving very rapidly. >> It's hard to find Scala developers, I mean, you go outside Silicon Valley. >> Exactly. So you need to be, as an organization, ours advises that you really need to find tools that are going to fit those business use cases and provide a single software environment, that data integration might be happening on premise now, with some of the legacy enterprise data warehouse, and it might happen in a hybrid, on premise and cloud environment in the near future and perhaps completely in the cloud. >> So standard tools, tools that have some standard software behind it, so you don't get stuck in the personnel hiring problem. Some unique domain expertise that's hard to hire. >> Yes, skill set is one problem, the second problem is the fact that the applications needs to be recompiled because the stack is evolving and the APIs are not compatible with the previous version, so that's the maintenance cost to keep up with things, to be able to catch up with the new versions of the stack, that's another area that the tools really help, because you want to be able to develop the application and deploy it anywhere in any complete platform. >> So Tendü, if I hear you properly, what you're saying is integration sounds great on paper, it's important, but there's some hidden costs there, and that is the skill set and then there's the stack recompiling, I'm making sure. Okay, that's awesome. >> The tools help with that. >> Take a step back and zoom out and talk about Syncsort's positioning, because you guys have been changing with the stacks as well, I mean you guys have been doing very well with the announcements, you've been just coming on the market all the time. What is the current value proposition for Syncsort today? >> The current value proposition is really we have organizations to create the next generation modern data architecture by accessing and liberating all enterprise data and delivering that data at the right time and the right quality data. It's liberate, integrate, with integrity. That's our value proposition. How do we do that? We provide that single software environment. You can have batch legacy data and streaming data sources integrated in the same exact environment and it enables you to adapt to Spark 2 or Flink or whichever complete framework is going to help them. That has been our value proposition and it is proven in many production deployments. >> What's interesting to is the way you guys have approached the market. You've locked down the legacy, so you have, we talk about the main frame and well beyond that now, you guys have and understand the legacy, so you kind of lock that down, protect it, make it secure, it's security-wise, but you do that too, but making sure it works because it's still data there, because legacy systems are really critical in the hybrid. >> Main frame expertise and heritage that we have is a critical part of our offering. We will continue to focus on innovation on the main frame side as well as on the distributed. One of the announcements that we made since our last conversation was we have partnership with Compuware and we now bring in more data types about application failures, it's a Band-Aid data to Splunk for operational intelligence. We will continue to also support more delivery types, we have batch delivery, we have streaming delivery, and now replication into Hadoop has been a challenge so our focus is now replication from the B2 on mainframe and ISA on mainframe to Hadoop environments. That's what we will continue to focus on, mainframe, because we have heritage there and it's also part of big enterprise data lake. You cannot make sense of the customer data that you are getting from mobile if you don't reference the critical data sets that are on the mainframe. With the Trillium acquisition, it's very exciting because now we are at a kind of pivotal point in the market, we can bring that data validation, cleansing, and matching superior capabilities we have to the big data environments. One of the things-- >> So when you get in low latency, you guys do the whole low latency thing too? You bring it in fast? >> Yes, we bring it, that's our current value proposition and as we are accessing this data and integrating this part of the data lake, now we have capabilities with Trillium that we can profile that data, get statistics and start using machine learning to automate the data steward's job. Data stewards are still spending 75% of their time trying to clean the data. So if we can-- >> Lot of manual work labor there, and modeling too, by the way, the modeling and just the cleaning, cleaning and modeling kind of go hand in hand. >> Exactly. If we can automate any of these steps to drive the business rules automatically and provide right data on the data lake, that would be very valuable. This is what we are hearing from our customers as well. >> We've heard probably five years about the data lake as the center of gravity of big data, but we're hearing at least a bifurcation, maybe more, where now we want to take that data and apply it, operationalize it in making decisions with machine learning, predictive analytics, but at the same time we're trying to square this strange circle of data, the data lake where you didn't say up front what you wanted it to look like but now we want ever richer metadata to make sense out of it, a layer that you're putting on it, the data prep layer, and others are trying to put different metadata on top of it. What do you see that metadata layer looking like over the next three to five years? >> The governance is a very key topic and social organizations who are ahead of the game in the big data and who already established that data lake, data governance and even analytics governance becomes important. What we are delivering here with Trillium, we will have generally available by end of Q1. We are basically bringing business rules to the data. Instead of bringing data to business rules, we are taking the business rules and deploying them where the data exists. That will be key because of the data gravity you mentioned because the data might be in the Hadoop environment, there might be in a, like I said, enterprise data warehouse, and it might be originating in the cloud, and you don't want to move the data to the business rules. You want to move the business rules to where the data exists. Cloud is an area that we see more and more of our customers are moving forward. Two main use cases around our integration is one, because the data is originating in cloud, and the second one is archiving data to cloud, and we announced actually, tighter integration with cloud with our director earlier this week for this event, and that we have been in cloud deployments and we have actually an offering, an elastic MapReduce already and on AC too for couple of years now, and also on the Google cloud storage, but this announcement is primarily making deployments even easier by leveraging cloud director's elasticity for increasing and reducing the deployment. Now our customers will also take advantage of integration jobs from that elasticity. >> Tendü, it's great to have you on The Cube because you have an engineering mind but you're also now general manager of the business, and your business is changing. You're in the center of the action, so I want to get your expertise and insight into enterprise readiness concept and we saw last week at Google Cloud 2017, you know, Google going down the path of being enterprise ready, or taking steps, I don't think they're fully ready, but they're certainly serious about the cloud on the enterprise, and that's clear from Diane Green, who knows the enterprise. It sparked the conversation last week, around what does enterprise readiness mean for cloud players, because there's so many details in between the lines, if you will, of what products are, that integration, certification, SLAs. What's your take on the notion of cloud readiness? Vizaviz, Google and others that are bringing cloud compute, a lot of resources, with an IOT market that's now booming, big data evolving very, very fast, lot of realtime, lot of analytics, lot of innovation happening. What's the enterprise picture look like from a readiness standpoint? How do these guys get ready? >> From a big picture, for enterprise there are couple of things that these cannot be afterthought. Security, metadata lineage is part of data governance, and being able to have flexibility in the architecture, that they will not be kind of recreating the jobs that they might have all the way to deployed and on premise environments, right? To be able to have the same application running from on premise to cloud will be critical because it gives flexibility for adaptation in the enterprise. Enterprise may have some MapReduce jobs running on premise with the Spark jobs on cloud because they are really doing some predictive analytics, graph analytics on those, they want to be able to kind of have that flexible architecture where we hear this concept of a hybrid environment. You don't want to be deploying a completely different product in the cloud and redo your jobs. That flexibility of architecture, flexibility-- >> So having different code bases in the cloud versus on prem requires two jobs to do the same thing. >> Two jobs for maintaining, two jobs for standardizing, and two different skill sets of people potentially. So security, governance, and being able to access easily and have applications move in between environments will be very critical. >> So seamless integration between clouds and on prem first, and then potentially multi-cloud. That's table stakes in your mind. >> They are absolutely table stakes. A lot of vendors are trying to focus on that, definitely Hadoop vendors are also focusing on that. Also, one of the things, like when people talk about governance, the requirements are changing. We have been talking about single view and customer 360 for a while now, right? Do we have it right yet? The enrichment is becoming a key. With Trillium we made the recent announcement, the precise enriching, it's not just the address that you want to deliver and make sure that address should be correct, it's also the email address, and the phone number, is it mobile number, is it landline? It's enriched data sets that we have to be really dealing, and there's a lot of opportunity, and we are really excited because data quality, discovery and integration are coming together and we have a good-- >> Well Tendü, thank you for joining us, and congratulations as Syncsort broadens their scope to being a modern data platform solution provider for companies, congratulations. >> Thank you. >> Thanks for coming. >> Thank you for having me. >> This is The Cube here live in Silicon Valley and San Jose, I'm John Furrier, George Gilbert, you're watching our coverage of Big Data Silicon Valley in conjunction with Strata Hadoop. This is Silicon Angles, The Cube, we'll be right back with more live coverage. We've got two days of wall to wall coverage with experts and pros talking about big data, the transformations here inside The Cube. We'll be right back. (upbeat electronic music)
SUMMARY :
It's The Cube, covering Big Data Silicon Valley 2017. general manager of the big data, did I get that right? Yes, you got it right. That's so hard for me to get, but more announcements, but the theme seems to be integration. a decade in many of the enterprises. on Hadoop and also make the data accessible in it's the multi-cloud world, multi-data type it's on the enterprise side or on some It's hard to find Scala developers, I mean, the near future and perhaps completely in the cloud. get stuck in the personnel hiring problem. another area that the tools really help, So Tendü, if I hear you properly, what you're coming on the market all the time. and delivering that data at the right the legacy, so you kind of lock that down, One of the announcements that we made since automate the data steward's job. the modeling and just the cleaning, and provide right data on the data lake, data, the data lake where you didn't say the data to the business rules. many details in between the lines, if you will, kind of recreating the jobs that they might code bases in the cloud versus on prem So security, governance, and being able to on prem first, and then potentially multi-cloud. it's also the email address, and Well Tendü, thank you for the transformations here inside The Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
two jobs | QUANTITY | 0.99+ |
Two jobs | QUANTITY | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
75% | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
New York City | LOCATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Diane Green | PERSON | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Scala | TITLE | 0.99+ |
Syncsort | ORGANIZATION | 0.99+ |
San Jose | LOCATION | 0.99+ |
second problem | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
Compuware | ORGANIZATION | 0.99+ |
two days | QUANTITY | 0.99+ |
Spark 2 | TITLE | 0.99+ |
one | QUANTITY | 0.99+ |
one problem | QUANTITY | 0.99+ |
Vizaviz | ORGANIZATION | 0.99+ |
Tendü Yogurtçu | PERSON | 0.99+ |
Spark | TITLE | 0.99+ |
eighth year | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
five years | QUANTITY | 0.99+ |
Two main use cases | QUANTITY | 0.98+ |
Trillium | ORGANIZATION | 0.98+ |
Python | TITLE | 0.98+ |
Netezza | ORGANIZATION | 0.98+ |
Trillium Software | ORGANIZATION | 0.98+ |
this year | DATE | 0.98+ |
Wikibon | ORGANIZATION | 0.97+ |
Hortonworks | ORGANIZATION | 0.97+ |
Hadoop | TITLE | 0.97+ |
earlier this week | DATE | 0.96+ |
today | DATE | 0.96+ |
Teradata | ORGANIZATION | 0.95+ |
Big Data Silicon Valley 2017 | EVENT | 0.94+ |
First thing | QUANTITY | 0.94+ |
single view | QUANTITY | 0.94+ |
big data | ORGANIZATION | 0.92+ |
Hive | TITLE | 0.92+ |
Java | TITLE | 0.92+ |
The Cube | ORGANIZATION | 0.92+ |
single button | QUANTITY | 0.91+ |
AtScale | ORGANIZATION | 0.91+ |
end of Q1 | DATE | 0.9+ |
single software | QUANTITY | 0.9+ |
second one | QUANTITY | 0.89+ |
first | QUANTITY | 0.89+ |
California, | LOCATION | 0.89+ |
Flink | TITLE | 0.88+ |
Big Data | TITLE | 0.88+ |
two different skill | QUANTITY | 0.87+ |
Silicon Valley, | LOCATION | 0.84+ |
360 | QUANTITY | 0.83+ |
three | QUANTITY | 0.82+ |
last three years | DATE | 0.8+ |
Valley | TITLE | 0.79+ |
Google Cloud 2017 | EVENT | 0.79+ |
Windows | TITLE | 0.78+ |
prem | ORGANIZATION | 0.76+ |
couple of years back | DATE | 0.76+ |
NYC | LOCATION | 0.75+ |
two APIs | QUANTITY | 0.75+ |
Amit Walia | BigData SV 2017
>> Announcer: Live from San Jose, California, it's the Cube, covering Big Data Silicon Valley 2017. (upbeat music) >> Hello and welcome to the Cube's special coverage of Big Data SV, Big Data in Silicon Valley in conjunction with Strata + Hadoop. I'm John Furrier with George Gilbert, with Mickey Bonn and Peter Burns as well. We'll be doing interviews all day today and tomorrow, here in Silicon Valley in San Jose. Our next guest is Amit Walia who's the Executive Vice President and Chief Product Officer of Informatica. Kicking of the day one of our coverage. Great to see you. Thanks for joining us on our kick off. >> Good to be here with you, John. >> So obviously big data. this is like the eighth year of us covering, what was once Hadoop World, now it's Strata + Hadoop, Big Data SV. We also do Big Data NYC with the Cube and it's been an interesting transformation over the past eight years. This year has been really really hot with you're starting to see Big Data starting to get a clear line of sight of where it's going. So I want to get your thoughts, Amit, on where the view of the marketplace is from your standpoint. Obviously Informatica's got a big place in the enterprise. And the real trends on how the enterprises are taking analytics and specifically with the cloud. You got the AI looming, all buzzed up on AI. That really seized, people had to get their arms around that. And you see IoT. Intel announced an acquisition, $15 billion for autonomous vehicles, which is essentially data. What's your views? >> Amit: Well I think it's a great question. 10 years have happened since Hadoop started right? I think what has happened as we see is that today what enterprises are trying to encapsulate is what they call digital transformation. What does it mean? I mean think about it, digital transformation for enterprises, it means three unique things. They're transforming their business models to serve their customers better, they're transforming their operational models for their own execution internally, if I'm a manufacturing or an execution-oriented company. The third one is basically making sure that their offerings are also tailored to their customers. And in that context, if you think about it, it's all a data-driven world. Because it's data that helps customers be more insightful, be more actionable, and be a lot more prepared for the future. And that covers the things that you said. Look, that's where Hadoop came into play with big data. But today the three things that organizations are catered around big data is just a lot of data right? How do I bring actionable insights out of it? So in that context, ML and AI are going to play a meaningful role. Because to me as you talk about IoT, IoT is the big game changer of big data becoming big or huge data if I may for a minute. So machine learning, AI, self-service analytics is a part of that, and the third one would be big data and Hadoop going to cloud. That's going to be very fast. >> John: And so the enterprises now are also transforming, so this digital transformation, as you point out, is absolutely real, it's happening. And you start to see a lot more focus on the business models of companies where it's not just analytics as a IT function, it's been talked about for a while, but now it's really more relevant because you're starting to see impactful applications. >> Exactly. >> So with cloud and (chuckles) the new IoT stuff you start to say okay apps matter. And so the data becomes super important. How is that changing the enterprises' readiness in terms of how they're consuming cloud and data and what not? What's you're view on that? Because you guys are deep in this. >> Amit: Yep. >> What's the enterprises' orientation these days? >> So slight nuance to that, as an answer. I think what organizations have realized is that today two things happened that never happened in the last 20 years. Massive fragmentation of the persistence layer, you see Hadoop itself fragmented the whole database layer. And a massive fragmentation of the app layer. So there are 3,000 enterprise size apps today. So just think about it, you're not restricted to one app. So what customers and enterprises are realizing is that, the data layer is where you need to organize yourself. So you need to own the data layer, you cannot just be in the app layer and the database layer because you got to be understanding your data. Because you could be anywhere and everywhere. And the best example I give in the world of cloud is, you don't own anything, you rent it. So what do you own? You own the darn data. So in that context, enterprise readiness as you came to, becomes very important. So understanding and owning your data is the critical secret sauce. And that's where companies are getting disrupted. So the new guys are leveraging data, which by the way the legacy companies had, but they couldn't figure it out. >> What is that? This is important. I want to just double-click on that. Because you mentioned the data layer, what's the playbook? Because that's like the number one question that I get. >> Mm-hmm. >> On Cube interviews or off camera is that okay, I want to have a data strategy. Now that's empty in its statement, but what is the playbook? I mean, is it architecture? Because the data is the strategic advantage. >> Amit: Yes. >> What are they doing? What's the architecture? What are some of the things that enterprises do? Now obviously they care about service level agreements and having potentially multicloud, for instance, as a key thing. But what is that playbook for this data layer? >> That's a very good question, sir. Enterprise readiness has a couple of dimensions. One you said is that there will be hybrid doesn't mean a ground cloud multicloud. I mean you're going to be in multi SAS apps, multi platform apps, multi databases in the cloud. So there is a hybrid world over there. Second is that organizations need to figure out a data platform of their own. Because ultimately what they care for is that, do I have a full view of my customer? Do I have a full view of the products that I'm selling and how they are servicing my customers? That can only happen if you have what I call a meta-data driven data platform. Third one is, boy oh boy, you talked about self-service analytics, you need to know answers today. Having analytics be more self-serving for the business user, not necessarily the IT user, and then leveraging AI to make all these things a lot more powerful. Otherwise, you're going to be spending, what? Hours and hours doing statistical analysis, and you won't be able to get to it given the scale and size of data models. And SLAs will play a big role in the world of cloud. >> Just to follow up on that, so it sounds like you've got the self-service analytics to help essentially explore and visualize. >> Amit: Mm-hmm. >> You've got the data governance and cataloging and lineage to make sure it is high quality and navigable, and then you want to operationalize it once you've built the models. But there's this tension between I want what made the data lake great, which was just dump it all in there so we have this one central place, but all the governance stuff on top of that is sort of just well, we got to organize it anyway. >> Yeah. >> How do you resolve that tension? >> That is a very good question. And that's where enterprises kind of woke up to. So a good example I'll give you, what everybody wanted to make a data lake. I mean if you remember two years ago, 80% of the data lakes fell apart and the reason was for the fact that you just said is that people made the data lake a data swamp if I may. Just dump a lot of data into my loop cluster, and life will be great. But the thing is that, and what customers of large enterprises realized is they became system integrators of their own. I got to bring data, catalog it, prepare it, surface it. So the belief of customers now is that, I need a place to go where basically it can easily bring in all the data, meta-data driven catalog, so I can use AI and ML to surface that data. So it's very easy at the preparation layer for my analysts to go around and play with data and then I can visualize anything. But it's all integrated out of the box, then each layer, each component being self-integrated, then it falls apart very quickly when you want to, to your question, at an enterprise level operationalize it. Large enterprises care about two things. Is it operationalizable? And is it scalable? That's where this could fall apart. And that's what our belief is. And that's where governance happens behind the scenes. You're not doing anything. Security of your data, governance of their data is driven through the catalog. You don't even feel it. It's there. >> I never liked the data lakes term. Dave Vellante knows I've always been kind of against, even from day one, 'cause data's more fluid, I call it a data ocean, but to your point, I want to get on that point because I think data lakes is one dimension, right? >> Yeah. >> And we talked about this at Informatica World, last year I think. And this year it's May 15th. >> Yes. >> I think your event is coming up, but you guys introduced meta-data intelligence. >> Yep. >> So there was, the old model was throw it centralized, do some data governance, data management, fence it out, call, make some queries, get some reports. I'm over simplifying but it was like, it was like a side function. You're getting at now is making that data valuable. >> Amit: Yep. >> So if it's in a lake or it's stored, you never know when the data's going to be relevant, so you have to have it addressable. Could you just talk about where this meta-data intelligence is going? Because you mentioned machine learning and AI. 'Cause this seems to be what everyone is talking about. In real time, how do I make the data really valuable when I need it? And what's the secret sauce that you guys have, specifically, to make that happen? >> So that, to contextualize that question, think about it. So if you. What you don't want to do is keep make everything manual. Our belief is that the intelligence around data has to be at the meta-data level, right? Across the enterprise, which is why, when we invested in the catalog, I used the word, "It's the google of data for the enterprise." No place in an enterprise you can go search for all your data, and given that the fast, rapid-changing sources of data, think about IoT, as you talked about, John. Or think about your customer data, for you and me may come from a new source tomorrow. Do you want the analyst to figure out where the data is coming from? Or the machine learning or AI to contextualize and tell you, you know what, I just discovered a great new source for where John is going to go shop. Do you want to put that as a part of analytics to give him an offer? That's where the organizing principle for data sits. The catalog and all the meta-data, which is where ML and AI will converge to give the analyst self-discovery of data sets, recommendations like in Amazon environment, recommendations like Facebook, find other people or other common data that's like a Facebook or a LinkedIn, that is where everything is going, and that's why we are putting all our efforts on AI. >> So you're saying, you want to abstract the way the complexity of where the data sits? So that the analyst or app can interface with that? >> That's exactly right. Because to me, those are the areas that are changing so rapidly, let that be. You can pick whatever data sets based on what you want, you can pick whichever app you want to use, wherever you want to go, or wherever your business wants to go. You can pick whichever analytical tool you like, but you want to be able to take all of those tools but be able to figure out what data is there, and that should change all the time. >> I'm trying to ask you a lot while you're here. What's going to be the theme this year at Informatica World? How do you take it to the next level? Can you just give us a teaser of what we might expect this year? 'Cause this seems to be the hottest trend. >> This is, so first, at Informatica World this year, we will be unveiling our whole new strategy, branding, and messaging, there's a whole amount of push on that one. But the two things that will be focused a lot on is, one is around that intelligent data platform. Which is basically what I'm talking about. The organizing principle of every enterprise for the next decade, and within that, where AI is going to play a meaningful role for people to spring forward, discover things, self-service, and be able to create sense from this mountains of data that's going to sit around us. But we won't even know what to do. >> All right, so what do you guys have in the product, just want to drill into this dynamic you just mentioned, which is new data sources. With IoT, this is going to completely make it more complex. You never know what data's going to be coming off the cars, the wearables, the smart cities. You have all these new killer use-cases that are going to be transformational. How do you guys handle that, and what's the secret sauce of? 'Cause that seems to be the big challenge, okay, I'm used to dealing with data, its structure, whether it's schemas, now we got unstructured. So okay, now I got new data coming in very fast, I don't even know when or where it's going to come in, so I have to be ready for these new data. What is the Informatica solution there? >> So in terms of taking data from any source, that's never been a challenge for us, because Informatica, one of the bread and butter for us is that we connect and bring data from any potential source on the planet, that's what we do. >> John: And you automate that? >> We automate that process, so any potential new source of data, whether it's IoT, unstructured, semi-structured, log, we connect to that. What I think the key is, where we are heavily invested, once you've brought all that. By the way, you can use Kafka Cues for that, you can use back-streaming, all of that stuff you could do. Question is, how do you make sense out of it? I can get all the data, dump it in a Kafka Cue, and then I take it to do some processing on Spark. But the intelligence is where all the Informatica secret sauce is, right? The meta-data, the transformations, that's what we are invested in, but in terms of connecting anything to everything? That we do for a living, we have done that for one quarter of a century, and we keep doing it. >> I mean, I love having a chat with you, Amit, you're a product guy, and we love product guys, 'cause they can give us a little teaser on the roadmap, but I got to ask you the question, with all this automation, you know, the big buzz out in the world is, "Oh machine learning and AI is replacing jobs." So where is the shift going to be, because you can almost connect the dots and say, "Okay, you're going to put some people out of work, "some developer, some automation, "maybe the systems management layer or wherever." Where are those jobs shifting to? Because you could almost say, "Okay, if you're going to abstract away and automate, "who loses their job?" Who gets shifted and what are those new opportunities, because you could almost say that if you automate in, that should create a new developer class. So one gets replaced, one gets created possibly. Your thoughts on this personnel transformation? >> Yeah, I think, I think what we see is that value creation will change. So the jobs will go to the new value. New areas where value is created. A great example of that is, look at developers today, right. Absolutely, I think they did a terrific job in making sure that the Hadoop ecosystem got legitimized, right? But in my opinion, where enterprise scalability comes, enterprises don't want lots of different things to be integrated and just plumbed together. They want things to work out of the box, which is why, you know, software works for them. But what happens is that they want that development community to go work on what I call value-added areas of the stack. So think about it, in connected car, they're working with lots of customers on the connected car issue, right? They don't want developers to work on the plumbing. They want us to kind of give that out of the box, because SLA is operational scale, and enterprise scalability matters, but in terms of the top-layer analytics, to make sure we can make sense out of it, that's what they're, that's where they want innovation. So what you will see is that, I don't think the jobs will go in vapor, but I do think the jobs will get migrated to a different part of the stack, which today it has not been, but that's, you know, we live in Silicon Valley, that's a natural evolution we see, so I think that will happen. In general in the larger industry, again I'd say, look, driverless cars, I don't think they've driven away jobs. What they've done is created a new class of people who work. So I do think that will be a big change. >> Yeah there's a fallacy there. I mean with the ATM argument was ATM's are going to replace tellers, yet more branches opened up. >> That's exactly it. >> So therefore creating new jobs. I want to get to the quick question, I know George has a question, but I want to get on the cost of ownership, because one of the things that's been criticized in some of these emerging areas, like Hadoop and Open Stack, for instance, just to pick two random examples. It's great, looks good, you know, all peace and love. An industry's being created, legitimized, but the cost of ownership has been critical to get that done, it's been expensive, talent, to find talent and deploying it was hard. We heard that on the Cube many times. How does the cost of ownership equation change? As you go after these more value, as developers and businesses go after these more value-creating activities in the Stack? >> See look, I always say, there is no free lunch. Nothing is free. And customers realize that, that open source, if you completely wanted to, to your point, as enterprises wanted to completely scale out and create an end-to-end operational infrastructure, open source ends up being pretty expensive. For all the reasons, right, because you throw in a lot of developers, and it's not necessarily scalable, so what we're seeing right now is that enterprises, as they have figured that this works for me, but when they want to go scale it out, they want to go back to what I call a software provider, who has the scale, who has the supportability, who also has the ability to react to changes and also for them to make sure that they get the comfort that it will work. So to me, that's where they find it cheaper. Just building it, experimenting with that, it's cheaper here, but scaling it out is cheaper with a software provider, so we see a lot of our customers when we start a little bit experimenting to developers, downloading something, works great, but would I really want to take it across Nordstrom or a JP Morgan or a Morgan Stanley. I need security, I need scalability, I need somebody to call to, at that point on those equations become very important. >> And that's where the out of box experience comes in, where you have the automation, that kind of. >> Exactly. >> Does that ease up some of the cost of ownership? >> Exactly, and the talent is a big issue, right? See we live in Silicon Valley, so we. By the way, Silicon Valley hiring talent is hard. Just think about it, if you go to Kansas City, hiring a scholar developer, that's a rare breed. So just, when I go around the globe and talk to customers, they don't see that talent at all that we here just somehow take for granted. They don't, so it's hard for them to kind of put their energy behind it. >> Let me ask. More on the meta-data layer. There's an analogy that's come up from the IIoT world where they're building these digital twins, and it's not just GE. IBM's talking about it, and actually, we've seen more and more vendors where the digital twin is this, it's a digital representation now of some physical object. But you could think of it as meta-data, you know, for a physical object, and it gets richer over time. So my question is, meta-data in the old data warehouse world, was we want one representation of the customer. But now it's, there's a customer representation for a prospect, and one for an account, and one for, you know, in warranty, and one for field service. Is that, how does that change what you offer? >> That's a very very good question. Because that's where the meta-data becomes so much more important because its manifestation is changing. I'll give you a great example, take Transamerica, Transamerica is a customer of ours leveraging big data at scale, and what they're doing is that, to your question, they have existing customers who have insurance through them. But they're looking for white space analysis, who could be potential opportunities? Two distinct ones, and within that, they're looking at relationships. I know you, John, you have Transamerica, could you be an influencer with me? Or within your family, extended family. I'm a friend, but what about a family member that you've declared out there on social media? So they are doing all that stuff in the context of a data lake. How are they doing it? So in that context, think about that complexity of the job, pumping data into a lake won't solve it for them, but that's a necessary first step. The second step is where all of that meta-data through ML and AI, starts giving them that relationship graph. To say, you know what, John in itself has this white space opportunity for you, but John is related to me in one way, him and me are connected on Facebook. John's related to you a little bit more differently, he has a stronger bond with you, and within his family, he has different strong bonds. So that's John's relationship graph. Leverage him, if he has been a good customer of yours. All of that stuff is now at the meta-data level, not just the monolithic meta-data, relationship graph. His relationship graph of what he has bought from you, so that you can just see that discovery becomes a very important element. Do you want to do that in different places? You want to do that in one place. I may be in a cloud environment, I may be on prem, so that's where when I say that meta-data becomes the organized principle, that's where it becomes real. >> Just a quick follow-up on that, then. It doesn't seem obvious that every end customer of yours, not the consumer but the buyer of the software, would have enough data to start building that graph. >> I don't think, to me, what happened was, the word big data, I thought got massively abused. A lot of Hadoop customers are not necessarily big data customers. I know a lot of banking customers, enterprise banking, whose data volumes will surprise you, but they're using Hadoop. What they want is intelligence. That's why I keep saying that the meta-data part, they are more interested in a deeper understanding of the data. A great example is, if John. I had a customer, who basically had a big bank. Rich net worth customer. In their will, the daughter was listed. When the daughter went to school, by the way, went to the bank branch in that city, she had no idea, she walked up, she basically wanted to open an account. Three more friends in the line. Manager comes out because at that point, the teller said, "This is somebody you should take special care of." Boom, she goes in a special cabin, the other friends are standing in a line. Think of the customer service perception, you just created a new millennia right? That's important. >> Well this brings up the interesting comment. The whole graph thing, we love, but this brings back the neural network trend. Which is a concept that's been around for a long long time, but now it's front and center. I remember talking to Diane Green who runs Google Cloud, she was saying that you couldn't hire neural network, they couldn't get jobs 15 years ago. Now you can't hire enough of them. So that brings up the ML conversation. So, I want to take that to a question and ask about the data lake, 'cause you guys have announced a new cloud data lake. >> Yes. >> So it sounds like, from what you're saying, is you're going beyond the data lake. So talk about what that is. Because data lake, people get, you throw stuff into a lake. And hopefully it doesn't become a swamp. How are you guys going beyond just the basic concept of a data lake with your new cloud data lake? >> Yeah, so, data lake. If you remember last year, actually at Strata San Jose we chatted, and we had announced the data lake because we realized customers, to your point John, as you said, were struggling on how to even build a data lake, and they were all over the place, and they were failing. And we announced the first data lake there, and then in Strata New York, basically we brought the meta-data ML part to the data lake. And then obviously right now we're taking it to the cloud, and what we see in the world of data lake is that customers ask for three things. First, they want the prebuilt integrated solution. Data can come in, but I want the intelligence of meta-data and I want data preparation baked in. I don't want to have three different tools that I will go around, so out of the box. But we also saw, as they become successful with our customers, they want to scale up, scale down. Cloud is just a great place to go. You can basically put a data lake out there, by the way in the context of data, a lot of new data sources are in the cloud, so it's easy for them to scale in and out in the cloud, experiment there and all that stuff. Also you know Amazon, we supported Amazon Kinesis, all of these new sources and technologies in the world of cloud are allowing experimentation in the data lake, so that allowed our customers to basically get ahead of the curve very quickly. So in some ways, cloud allowed customers to do things a lot faster, better, and cheaper. So that's what we basically put in the hands of our customers. Now that they are feeling comfortable, they can do a secured and governed data lake without feeling that it's still not self-served. They want to put it in the cloud and be a lot more faster and cheaper about it. >> John: And more analytics on it. >> More analytics. And now, because our ML, our AI, the meta-data part, connects cloud, ground, everything. So they have an organizing principle, whatever they put wherever, they can still get intelligence out of it. >> Amit, we got to break, but I want to get one final comment for you to kind of end the segment, and it's been fun watching you guys work over the past couple years. And I want to get your perspective because the product decisions always have kind of a time table to them, it's not like you made this up last night because it's trendy, but you guys have made some good product choices. It seems like the wind's at your back right now at Informatica. What, specifically, are bets that you guys made a couple years ago that are now bearing fruit? Can you just take a minute to end the segment, share some of those product bets. Because it's not always that obvious to make those product bets years earlier, seems to be a tail wind for you. You agree, and can you share some of those bets? >> I think you said it rightly, product bets are hard, right? Because you got to see three, four years ahead. The one big bet that we made is that we saw, as I said to you, the decoupling of the data layer. So we realized that, look, the app layer's getting fragmented. The cloud platforms are getting fragmented. Databases are getting fragmented. That that whole old monolithic architecture is getting fundamentally blown up, and the customers will be in a multi, multi, multi spread out hybrid world. Data is the organizing principle, so three years ago, we bet on the intelligent data platform. And we said that the intelligent data platform will be intelligent because of the meta-data driven layer, and at that point, AI was nowhere in sight. We put ML in that picture, and obviously, AI has moved, so the bet on the data platform. Second bet that, in that data platform, it'll all be AI, ML driven meta-data intelligence. And the third one is, we bet big on cloud. Big data we had already bet big on, by the way. >> John: You were already there. >> We knew the cloud. Big data will move to the cloud far more rapidly than the old technology moved to the cloud. So we saw that coming. We saw the (mumbles) wave coming. We worked so closely with AWS and the Azure team. With Google now, as well. So we saw three things, and that's what we bet. And you can see the rich offerings we have, the rich partnerships we have, and the rich customers that are live in those platforms. >> And the market's right on your doorstep. I mean, AI is hot, ML, you're seeing all this stuff converge with IoT. >> So those were, I think, forward-looking bets that paid out for us. (chuckles) And but there's so much more to do, and so much more upside for all of us right now. >> A lot more work to do. Amit, thank you for coming on, sharing your insight. Again, you guys got in good pole position in the market, and again it's right on your doorstep, so congratulations. This is the Cube, I'm John Furrier with George Gilbert. With more coverage in Silicon Valley for Big Data SV and Strata + Hadoop after this short break.
SUMMARY :
it's the Cube, covering Big Data Silicon Valley 2017. Kicking of the day one of our coverage. And the real trends on how the enterprises And that covers the things that you said. on the business models of companies where How is that changing the enterprises' readiness the data layer is where you need to organize yourself. Because that's like the number one question that I get. Because the data is the strategic advantage. What are some of the things that enterprises do? Second is that organizations need to figure out Just to follow up on that, and then you want to operationalize it and the reason was for the fact that you just said I never liked the data lakes term. And we talked about this is coming up, but you guys introduced So there was, the old model was 'Cause this seems to be what everyone is talking about. and given that the fast, rapid-changing sources of data, and that should change all the time. How do you take it to the next level? But the two things that will be focused a lot on is, All right, so what do you guys have in the product, because Informatica, one of the bread and butter for us By the way, you can use Kafka Cues for that, but I got to ask you the question, So what you will see is that, ATM's are going to replace tellers, We heard that on the Cube many times. So to me, that's where they find it cheaper. where you have the automation, that kind of. Exactly, and the talent is a big issue, right? Is that, how does that change what you offer? so that you can just see that discovery not the consumer but the buyer of the software, I don't think, to me, what happened was, the data lake, 'cause you guys have announced How are you guys going beyond just the basic concept a lot of new data sources are in the cloud, And now, because our ML, our AI, the meta-data part, and it's been fun watching you guys work And the third one is, we bet big on cloud. than the old technology moved to the cloud. And the market's right on your doorstep. And but there's so much more to do, This is the Cube, I'm John Furrier with George Gilbert.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
John | PERSON | 0.99+ |
Amit Walia | PERSON | 0.99+ |
Diane Green | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Mickey Bonn | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Peter Burns | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Transamerica | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
George | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
$15 billion | QUANTITY | 0.99+ |
Amit | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Second | QUANTITY | 0.99+ |
Nordstrom | ORGANIZATION | 0.99+ |
80% | QUANTITY | 0.99+ |
May 15th | DATE | 0.99+ |
Kansas City | LOCATION | 0.99+ |
last year | DATE | 0.99+ |
second step | QUANTITY | 0.99+ |
Informatica | ORGANIZATION | 0.99+ |
JP Morgan | ORGANIZATION | 0.99+ |
Morgan Stanley | ORGANIZATION | 0.99+ |
first step | QUANTITY | 0.99+ |
third one | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
each component | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
First | QUANTITY | 0.99+ |
each layer | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
one | QUANTITY | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
today | DATE | 0.99+ |
eighth year | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
GE | ORGANIZATION | 0.99+ |
three years ago | DATE | 0.99+ |
3,000 enterprise | QUANTITY | 0.98+ |
Big Data SV | ORGANIZATION | 0.98+ |
this year | DATE | 0.98+ |
next decade | DATE | 0.98+ |
two years ago | DATE | 0.98+ |
three | QUANTITY | 0.97+ |