Wikibon Big Data Market Update pt. 2 - Spark Summit East 2017 - #SparkSummit - #theCUBE

(lively music) >> [Announcer] Live from Boston, Massachusetts, this is the Cube, covering Sparks Summit East 2017. Brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Sparks Summit in Boston, everybody. This is the Cube, the worldwide leader in live tech coverage. We've been here two days, wall-to-wall coverage of Sparks Summit. George Gilbert, my cohost this week, and I are going to review part two of the Wikibon Big Data Forecast. Now, it's very preliminary. We're only going to show you a small subset of what we're doing here. And so, well, let me just set it up. So, these are preliminary estimates, and we're going to look at different ways to triangulate the market. So, at Wikibon, what we try to do is focus on disruptive markets, and try to forecast those over the long term. What we try to do is identify where the traditional market research estimates really, we feel, might be missing some of the big trends. So, we're trying to figure out, what's the impact, for example, of real time. And, what's the impact of this new workload that we've been talking about around continuous streaming. So, we're beginning to put together ways to triangulate that, and we're going to show you, give you a glimpse today of what we're doing. So, if you bring up the first slide, we showed this yesterday in part one. This is our last year's big data forecast. And, what we're going to do today, is we're going to focus in on that line, that S-curve. That really represents the real time component of the market. The Spark would be in there. The Streaming analytics would be in there. Add some color to that, George, if you would. >> [George] Okay, for 60 years, since the dawn of computing, we have two ways of interacting with computers. You put your punch cards in, or whatever else and you come back and you get your answer later. That's batch. Then, starting in the early 60's, we had interactive, where you're at a terminal. And then, the big revolution in the 80's was you had a PC, but you still were either interactive either with terminal or batch, typically for reporting and things like that. What's happening is the rise of a new interaction mode. Which is continuous processing. Streaming is one way of looking at it but it might be more effective to call it continuous processing because you're not going to get rid of batch or interactive but your apps are going to have a little of each. So, what we're trying to do, since this is early, early in its life cycle, we're going to try and look at that streaming component from a couple of different angles. >> Okay, as I say, that's represented by this Ogive curve, or the S-curve. On the next slide, we're at the beginning when you think about these continuous workloads. We're at the early part of that S-curve, and of course, most of you or many of you know how the S-curve works. It's slow, slow, slow. For a lot of effort, you don't get much in return. Then you hit the steep part of that S-curve. And that's really when things start to take off. So, the challenge is, things are complex right now. That's really what this slide shows. And Spark is designed, really, to reduce some of that complexity. We've heard a lot about that, but take us through this. Look at this data flow from ingest, to explore, to process, to serve. We talked a lot about that yesterday, but this underscores the complexity in the marketplace. >> [George] Right, and while we're just looking mostly at numbers today, the point of the forecast is to estimate when the barriers, representing complexities, start to fall. And then, when we can put all these pieces together, in just explore, process, serve. When that becomes an end-to-end pipeline. When you can start taking the data in on one end, get a scientist to turn it into a model, inject it into an application, and that process becomes automated. That's when it's mature enough for the knee in the curve to start. >> And that's when we think the market's going to explode. But now so, how do you bound this. Okay, when we do forecasts, we always try to bound things. Because if they're not bounded, then you get no foundation. So, if you look at the next slide, we're trying to get a sense of real-time analytics. How big can it actually get? That's what this slide is really trying to-- >> [George] So this one was one firm's take on real-time analytics, where by 2027, they see it peaking just under-- >> [Dave] When you say one firm, you mean somebody from the technology district? >> [George] Publicly available data. And we take it as as a, since they didn't have a lot of assumptions published, we took it as, okay one data point. And then, we're going to come at it with some bottoms-up end top-down data points, and compare. >> [Dave] Okay, so the next slide we want to drill into the DBMS market and when you think about DBMS, you think about the traditional RDBMS and what we know, or the Oracle, SQL Server, IBMDB2's, etc. And then, you have this emergent NewSQL, and noSQL entrance, which are, obviously, we talked today to a number of folks. The number of suppliers is exploding. The revenue's still relatively small. Certainly small relative to the RDBMS marketplace. But, take us through what your expectations is here, and what some of the assumptions are behind this. >> [George] Okay, so the first thing to understand is the DBMS market, overall, is about $40 billion of which 30 billion goes to online transaction processing supporting real operational apps. 10 billion goes to Orlap or business intelligence type stuff. The Orlap one is shrinking materially. The online transaction processing one, new sales is shrinking materially but there's a huge maintenance stream. >> [Dave] Yeah which companies like Oracle and IBM and Microsoft are living off of that trying to fund new development. >> We modeled that declining gently and beginning to accelerate more going out into the latter years of the tenure period. >> What's driving that decline? Obviously, you've got the big sucking sound of a dup in part, is driving that. But really, increasingly it's people shifting their resources to some of these new emergent applications and workloads and new types of databases to support them right? But these are still, those new databases, you can see here, the NewSQL and noSQL still, relatively, small. A lot of it's open source. But then it starts to take off. What's your assumption there? >> So here, what's going on is, if you look at dollars today, it's, actually, interesting. If you take the noSQL databases, you take DynamoDB, you take Cassandra, Hadoop, HBase, Couchbase, Mongo, Kudu and you add all those up, it's about, with DynamoDB, it's, probably, about 1.55 billion out of a $40 billion market today. >> [Dave] Okay but it's starting to get meaningful. We were approaching two billion. >> But where it's meaningful is the unit share. If that were translated into Oracle pricing. The market would be much, much bigger. So the point it. >> Ten X? >> At least, at least. >> Okay, so in terms of work being done. If there's a measure of work being done. >> [George] We're looking at dollars here. >> Operations per second or etcetera, it would be enormous. >> Yes, but that's reflective of the fact that the data volumes are exploding but the prices are dropping precipitously. >> So do you have a metric to demonstrate that. We're, obviously, not going to show it today but. >> [George] Yes. >> Okay great, so-- >> On the business intelligence side, without naming names, the data warehouse appliance vendors are charging anywhere from 25,000 per terabyte up to, when you include running costs, as high as 100,000 a terabyte. That their customers are estimating. That's not the selling cost but that's the cost of ownership per terabyte. Whereas, if you look at, let's say Hadoop, which is comparable for the off loading some of the data warehouse work loads. That's down to the 5K per terabyte range. >> Okay great, so you expect that these platforms will have a bigger and bigger impact? What's your pricing assumption? Is prices going to go up or is it just volume's going to go through the roof? >> I'm, actually, expecting pricing. It's difficult because we're going to add more and more functionality. Volumes go up and if you add sufficient functionality, you can maintain pricing. But as volumes go up, typically, prices go down. So it's a matter of how much do these noSQL and NewSQL databases add in terms of functionality and I distinguish between them because NewSQL databases are scaled out version of Oracle or Teradata but they are based on the more open source pricing model. >> Okay and NoSQL, don't forget, stands for not only SQL, not not SQL. >> If you look at the slides, big existing markets never fall off a cliff when they're in the climb. They just slowly fade. And, eventually, that accelerates. But what's interesting here is, the data volumes could explode but the revenue associated with the NoSQL which is the dark gray and the NewSQL which is the blue. Those don't explode. You could take, what's the DBMS cost of supporting YouTube? It would be in the many, many, many billions of dollars. It would support 1/2 of an Oracle itself probably. But it's all open source there so. >> Right, so that's minimizing the opportunity is what you're saying? >> Right. >> You can see the database market is flat, certainly flattish and even declining but you do expect some growth in the out years as part of that evasion, that volume, presumably-- >> And that's the next slide which is where we've seen that growth come from. >> Okay so let's talk about that. So the next slide, again, I should have set this up better. The X-axis year is worldwide dollars and the horizontal axis is time. And we're talking here about these continuous application work loads. This new work load that you talked about earlier. So take us through the three. >> [George] There's three types of workloads that, in large part, are going to be driving most of this revenue. Now, these aren't completely, they are completely comparable to the DBMS market because some of these don't use traditional databases. Or if they do, they're Torry databases and I'll explain that. >> [Dave] Sure but if I look at the IoT Edge, the Cloud and the micro services and streaming, that's a tail wind to the database forecast in the previous slide, is that right? >> [George] It's, actually, interesting but the application and infrastructure telemetry, this is what Splunk pioneered. Which is all the torrents of data coming out of your data center and your applications and you're trying to manage what's going on. That is a database application. And we know Splunk, for 2016, was 400 million. In software revenue Hadoop was 750 million. And the various other management vendors, New Relic, AppDynamics, start ups and 5% of Azure and AWS revenue. If you add all that up, it comes out to $1.7 billion for 2016. And so, we can put a growth rate on that. And we talked to several vendors to say, okay, how much will that work load be compared to IoT Edge Cloud. And the IoT Edge Cloud is the smart devices at the Edge and the analytics are in the fog but not counting the database revenue up in the Cloud. So it's everything surrounding the Cloud. And that, actually, if you look out five years, that's, maybe, 20% larger than the app and infrastructure telemetry but growing much, much faster. Then the third one where you were talking about was this a tail wind to the database. Micro server systems streaming are very different ways of building applications from what we do now. Now, people build their logic for the application and everyone then, stores their data in this centralized external database. In micro services, you build a little piece of the app and whatever data you need, you store within that little piece of the app. And so the database requirements are, rather, primitive. And so that piece will not drive a lot of database revenue. >> So if you could go back to the previous slide, Patrick. What's driving database growth in the out years? Why wouldn't database continue to get eaten away and decline? >> [George] In broad terms, the overall database market, it staying flat. Because as prices collapse but the data volumes go up. >> [Dave] But there's an assumption in here that the NoSQL space, actually, grows in the out years. What's driving that growth? >> [George] Both the NoSQL and the NewSQL. The NoSQL, probably, is best serving capturing the IoT data because you don't need lots of fancy query capabilities for concurrency. >> [Dave] So it is a tail wind in a sense in that-- >> [George] IoT but that's different. >> [Dave] Yeah sure but you've got the overall market growing. And that's because the new stuff, NewSQL and NoSQL is growing faster than the decline of the old stuff. And it's not in the 2020 to 2022 time frame. It's not enough to offset that decline. And then they have it start growing again. You're saying that's going to be driven by IoT and other Edge use cases? >> Yes, IoT Edge and the NewSQL, actually, is where when they mature, you start to substitute them for the traditional operational apps. For people who want to write database apps not who want to write micro service based apps. >> Okay, alright good. Thank you, George, for setting it up for us. Now, we're going to be at Big Data SV in mid March? Is that right? Middle of March. And George is going to be releasing the actual final forecast there. We do it every year. We use Spark Summit to look at our preliminary numbers, some of the Spark related forecasts like continuous work loads. And then we harden those forecasts going into Big Data SV. We publish our big data report like we've done for the past, five, six, seven years. So check us out at Big Data SV. We do that in conjunction with the Strada events. So we'll be there again this year at the Fairmont Hotel. We got a bunch of stuff going on all week there. Some really good programs going on. So check out siliconangle.tv for all that action. Check out Wikibon.com. Look for new research coming out. You're going to be publishing this quarter, correct? And of course, check out siliconangle.com for all the news. And, really, we appreciate everybody watching. George, been a pleasure co-hosting with you. As always, really enjoyable. >> Alright, thanks Dave. >> Alright, to that's a rap from Sparks. We're going to try to get out of here, hit the snow storm and work our way home. Thanks everybody for watching. A great job everyone here. Seth, Ava, Patrick and Alex. And thanks to our audience. This is the Cube. We're out, see you next time. (lively music)

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. of the Wikibon Big Data Forecast. What's happening is the rise of a new interaction mode. On the next slide, we're at the beginning for the knee in the curve to start. So, if you look at the next slide, And then, we're going to come at it with some bottoms-up [Dave] Okay, so the next slide we want to drill into the [George] Okay, so the first thing to understand and IBM and Microsoft are living off of that going out into the latter years of the tenure period. you can see here, the NewSQL and you add all those up, [Dave] Okay but it's starting to get meaningful. So the point it. Okay, so in terms of work being done. it would be enormous. that the data volumes are exploding So do you have a metric to demonstrate that. some of the data warehouse work loads. the more open source pricing model. Okay and NoSQL, don't forget, but the revenue associated with the NoSQL And that's the next slide which is where and the horizontal axis is time. in large part, are going to be driving of the app and whatever data you need, What's driving database growth in the out years? the data volumes go up. that the NoSQL space, actually, grows is best serving capturing the IoT data because And it's not in the 2020 to 2022 time frame. and the NewSQL, actually, And George is going to be releasing This is the Cube.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
George	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Seth	PERSON	0.99+
30 billion	QUANTITY	0.99+
Alex	PERSON	0.99+
two billion	QUANTITY	0.99+
2016	DATE	0.99+
$40 billion	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
2027	DATE	0.99+
20%	QUANTITY	0.99+
five years	QUANTITY	0.99+
New Relic	ORGANIZATION	0.99+
Orlap	ORGANIZATION	0.99+
$1.7 billion	QUANTITY	0.99+
10 billion	QUANTITY	0.99+
2020	DATE	0.99+
Boston	LOCATION	0.99+
Ava	PERSON	0.99+
mid March	DATE	0.99+
third one	QUANTITY	0.99+
last year	DATE	0.99+
AppDynamics	ORGANIZATION	0.99+
2022	DATE	0.99+
yesterday	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
60 years	QUANTITY	0.99+
two days	QUANTITY	0.99+
siliconangle.com	OTHER	0.99+
400 million	QUANTITY	0.99+
750 million	QUANTITY	0.99+
YouTube	ORGANIZATION	0.99+
today	DATE	0.99+
5%	QUANTITY	0.99+
Middle of March	DATE	0.99+
Sparks Summit	EVENT	0.99+
first slide	QUANTITY	0.99+
three	QUANTITY	0.99+
two ways	QUANTITY	0.98+
Boston, Massachusetts	LOCATION	0.98+
early 60's	DATE	0.98+
about $40 billion	QUANTITY	0.98+
one firm	QUANTITY	0.98+
this year	DATE	0.98+
Ten X	QUANTITY	0.98+
Spark Summit	EVENT	0.97+
25,000 per terabyte	QUANTITY	0.97+
80's	DATE	0.97+
Databricks	ORGANIZATION	0.97+
DynamoDB	TITLE	0.97+
three types	QUANTITY	0.97+
Both	QUANTITY	0.96+
Sparks Summit East 2017	EVENT	0.96+
Spark Summit East 2017	EVENT	0.96+
this week	DATE	0.95+
Spark	TITLE	0.95+

Wikibon Big Data Market Update Pt. 1 - Spark Summit East 2017 - #sparksummit - #theCUBE

>> [Announcer] Live from Boston, Massachusetts, this is theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> We're back, welcome to Boston, everybody, this is a special presentation that George Gilbert and I are going to provide to you now. SiliconANGLE Media is the umbrella brand of our company, and we've got three sub-brands. One of them is Wikibon, it's the research organization that Gorge works in, and then of course, we have theCUBE and then SiliconANGLE, which is the tech publication, and then we extensively, as you may know, use CrowdChat and other social data, but we want to drill down now on the Wikibon, Wikibon research side of things. Wikibon was the first research company ever to do a big data forecast. Many, many years ago, our friend Jeff Kelly produced that for several years, we opensourced it, and it really, I think helped the industry a lot, sort of framing the big data opportunity, and then George last year did the first Spark forecast, really Spark adoption, so what we want to do now is talk about some of the trends in the marketplace, this is going to be done in two parts, today's part one, and we're really going to talk about the overall market trends and the market conditions, and then we're going to go to part two tomorrow, where you're going to release some of the numbers, right? And we'll share some of the numbers today. So, we're going to start on the first slide here, we're going to share with you some slides. The Wikibon forecast review, and George is going to, I'm going to ask you to talk about where we are at with big data apps, everybody's saying it's peaked, big data's now going mainstream, where are we at with big data apps? >> [George] Okay, so, I want to quote, just to provide context, the former CTO on VMware, Steve Herrod. He said, "In the end, it wasn't big data, "it was big analytics." And what's interesting is that when we start thinking about it, there have been three classes of, there have been traditionally two classes of workloads, one batch, and in the context of analytics, that means running reports in the background, doing offline business intelligence, but then there was also the interactive-type work. What's emerging is something that's continuously happening, and it doesn't mean that all apps are going to be always on, it just means that there are, all apps will have a batch component, an interactive component, like with the user, and then a streaming, or continuous component. >> [Dave] So it's a new type of workload. >> Yes. >> Okay. Anything else you want to point out here? >> Yeah, what's worth mentioning, this is, it's not like it's going to burst fully-formed out of the clouds, and become sort of a new standard, there's two things that has to happen, the technology has to mature, so right now you have some pretty tough trade-offs between integration, which provides simplicity, and choice and optimization, which gives you fragmentation, and then skillset, and both of those need to develop. >> [Dave] Alright, we're going to talk about both of those a little bit later in this segment. Let's go to the next slide, which really talks to some of the high-level forecast that we released last year, so these are last year's numbers, correct? >> Yes, yes. >> [Dave] Okay, so, what's changed? You've got the ogive curve, which is sort of the streaming penetration, Spark/streaming, that's what, was last year, this is now reflective of continuous, you'll be updating that, how is this changing, what do you want us to know here? >> [George] Okay, so the key takeaways here are, first, we took three application patterns, the first being the data lake, which is sort of the original canonical repository of all your data. That never goes away, but on top of it, you layer what we were calling last year systems of engagement, which is where you've got the interactive machine learning component helping to anticipate and influence a user's decision, and then on top of that, which was the aqua color, was the self-tuning systems, which is probably more IIoT stuff, where you've got a whole ecosystem of devices and intelligence in the cloud and at the edge, and you don't necessarily need a human in the loop. But, these now, when you look at them, you can break them down as having three types of workloads, the batch, the interactive, and the continuous. >> Okay, and that is sort of a new workload here, and this is a real big theme of your research now is, we all remember, no, we don't all remember, I remember punch cards, that's the ultimate batch, and then of course, the terminals were interactive, and you think of that as closer to real time, but now, this notion of continuous, if you go to the next slide, Patrick, we can take a look at how workloads are changing, so George, take us through that dynamic. >> [George] Okay so, to understand where we're going, sometimes it helps to look at where we've come from, and the traditional workloads, if we talk about applications, they were divided into, now, we talked about sort of batch versus interactive, but now, they were also divided into online transaction processing, operational application, systems of record, and then there was the analytic side, which was reporting on it, but this was sort of backward-looking reporting, and we begin to see some convergence between the two with web and mobile apps, where a user was interacting both with the analytics that informed an interaction that they might have. That's looking backwards, and we're going to take a quick look at some of the new technologies that augmented those older application patterns. Then we're going to go look at the emergent workloads and what they look like. >> Okay so, let's have a quick conversation about this before we go on to the next segment. Hadoop obviously was batch. It really was a way, as we've talked about today and many other dates in theCUBE, a way to reduce the expense of doing data warehousing and business intelligence, I remember we were interviewing Jeff Hammerbacher, and he said, "When I was at Facebook, "my mission was to break the dependency "and the container, the storage container." So he really wanted to, needed to reduce costs, he saw that infrastructure needed to change, so if you look at the next slide, which is really sort of talking to Hadoop doing batch in traditional BI, take us through that, and then we'll sort of evolve to the future. >> Okay, so this is an example of traditional workloads, batch business intelligence, because Hadoop has not really gotten to the maturity point of view where you can really do interactive business intelligence. It's going to take a little more work. But here, you've basically put in a repository more data than you could possibly ever fit in a data warehouse, and the key is, this environment was very fragmented, there were many different engines involved, and so there was a high developer complexity, and a high operational complexity, and we're getting to the point where we can do somewhat better on the integration, and we're getting to the point where we might be able to do interactive business intelligence and start doing a little bit of advanced analytics like machine learning. >> Okay. Let's talk a little bit about why we're here, we're here 'cause it's Spark Summit, Spark was designed to simplify big data, simplify a lot of the complexity in Hadoop, so on the next slide, you've got this red line of Spark, so what is Spark's role, what does that red line represent? >> Okay, so the key takeaway from this slide is, couple things. One, it's interesting, but when you listen to Matei Zaharia, who is the creator of Spark, he said, "I built this to be a better MapReduce than MapReduce," which was the old crufty heart of Hadoop. And of course, they've stretched it far beyond their original intentions, but it's not the panacea yet, and if you put it in the context of a data lake, it can help you with what a data engineer does with exploring and munging the data, and what a data scientist might do in terms of processing the data and getting it ready for more advanced analytics, but it doesn't give you an end-to-end solution, not even within the data lake. The point of explaining this is important, because we want to explain how, even in the newer workloads, Spark isn't yet mature to handle the end-to-end integration, and by making that point, we'll show where it needs still more work, and where you have to substitute other products. >> Okay, so let's have a quick discussion about those workloads. Workloads really kind of drive everything, a lot of decisions for organizations, where to put things, and how to protect data, where the value is, so in this next slide you've got, you're juxtaposing traditional workloads with emerging workloads, so let's talk about these new continuous apps. >> Okay, so, this tees it up well, 'cause we focused on the traditional workloads. The emerging ones are where data is always coming in. You could take a big flow of data and sort of end it and bucket it, and turn it into a batch process, but now that we have the capability to keep processing it, and you want answers from it very near real time, you don't want to stop it from flowing, so the first one that took off like this was collecting telemetry about the operation and performance of your apps and your infrastructure, and Splunk sort of conquered that workload first. And then the second one, the one that everyone's talking about now is sort of Internet of Things, but more accurately, the Industrial Internet of Things, and that stream of data is, again, something you'll want to analyze and act on with as little delay as possible. The third one is interesting, asynchronous microservices. This is difficult, because this doesn't necessarily require a lot of new technology, so much as a new skillset for developers, and that's going to mean it takes off fairly slowly. Maybe new developers coming out of school will adopt it whole cloth, but this is where you don't rely on a big central database, this is where you break things into little pieces, and each piece manages itself. >> So you say the components of these arrows that you're showing in just explore processor, these are all sort of discrete elements of the data flow that you have to then integrate as a customer? >> [George] Yes, frankly, these are all steps that could be an end-to-end integrative process, but it's not yet mature enough really to do it end-to-end. For example, we don't even have a data store that can go all the way from ingest to serve, and by ingest, I mean taking the millions, potentially millions or more, events per second coming in from your Internet of Things devices, the explorer would be in that same data store, letting you visualize what's there, and process doing the analysis, and serving then is, from that same data store, letting your industrial devices, or your business intelligence workloads get real-time updates. For this to work as one whole, we need a data store, for example, that can go from end-to-end, in addition to the compute and analytic capabilities that go end-to-end. The point of this is, for continuous workloads, we do want to get to this integrated point somehow, sometime, but we're not there yet. >> Okay, let's go deeper, and take a look at the next slide, you've got this data feedback loop, and you've got this prediction on top of this, what does all that mean, let's double-click on that. >> Okay, so now we're unpacking the slide we just looked at, in that we're unpacking it into two different elements, one is what you're doing when you're running the system, and the next one will be what you're doing when you're designing it. And so for this one, what you're doing when you're running the system, I've grayed out the where's the data coming from and where's it going to, just to focus on how we're operating on the data, and again, to repeat the green part, which is storage, we don't have an end-to-end integrated store that could cost-effectively, scalably handle this whole chain of steps, but what we do have is that in the runtime, you're going to ingest the data, you're going to process it and make it ready for prediction, then there's a step that's called devops for data science, we know devops for developers, but devops for data science, as we're going to see, actually unpacks a whole 'nother level of complexity, but this devops for data science, this is where you get the prediction, of, okay, so, if this turbine is vibrating and has a heat spike, it means shut it down because something's going to fail. That's the prediction component, and the serve part then takes that prediction, and makes sure that that device gets it fast. >> So you're putting that capability in the hands of the data science component so they can effect that outcome virtually instantaneously? >> Yes, but in this case, the data scientist will have done that at design time. We're still at run time, so this is, once the data scientist has built that model, here, it's the engineer who's keeping it running. >> Yeah, but it's designed into the process, that's the devops analogy. Okay great, well let's go to that sort of next piece, which is design, so how does this all affect design, what are the implications there? >> So now, before we had ingest process, then prediction with devops for data science, and then serving, now when you're at design time, you ingest the data, and there's a whole unpacking of steps, which requires a handful, or two fistfuls of tools right now to make operate. This is to acquire the data, explore it, prepare it, model it, assess it, distribute it, all those things are today handled by a collection of tools that you have to stitch together, and then you have process at which could be typically done in Spark, where you do the analysis, and then serving it, Spark isn't ready to serve, that's typically a high-speed database, one that either has tons of data for history, or gets very, very fast updates, like a Redis that's almost like a cache. So the point of this is, we can't yet take Spark as gospel from end to end. >> Okay so, there's a lot of complexity here. >> [George] Right, that's the trade-off. >> So let's take a look at the next slide, which talks to where that complexity comes from, let's look at it first from the developer side, and then we'll look at the admin, so, so on the next slide, we're looking at the complexity from the dev perspective, explain the axes here. >> Okay, okay. So, there's two axes. If you look at the x-axis at the bottom, there's ingest, explore, process, serve. Those were the steps at a high level that we said a developer has to master, and it's going to be in separate products, because we don't have the maturity today. Then on the y-axis, we have some, but not all, this is not an exhaustive list of all the different things a developer has to deal with, with each product, so the complexity is multiplying all the steps on the y-axis, data model, addressing, programming model, persistence, all the stuff's on the y-axis, by all the products he needs on the x-axis, it's a mess, which is why it's very, very hard to build these types of systems today. >> Well, and why everybody's pushing on this whole unified integration, that was a major thing that we heard throughout the day today. What about from the admin's side, let's take a look at the next slide, which is our last slide, in terms of the operational complexity, take us through that. >> [George] Okay, so, the admin is when the system's running, and reading out the complexity, or inferring the complexity, follows the same process. On the y-axis, there's a separate set of tasks. These are admin-related. Governance, scheduling and orchestration, a high availability, all the different types of security, resource isolation, each of these is done differently for each product, and the products are on the x-axis, ingest, explore, process, serve, so that when you multiply those out, and again, this isn't exhaustive, you get, again, essentially a mess of complexity. >> Okay, so we got the message, if you're a practitioner of these so-called big data technologies, you're going to be dealing with more complexity, despite the industry's pace of trying to address that, but you're seeing new projects pop up, but nonetheless, it feels like the complexity curve is growing faster than customer's ability to absorb that complexity. Okay, well, is there hope? >> Yes. But here's where we've had this conundrum. The Apache opensource community has been the most amazing source of innovation I think we've ever seen in the industry, but the problem is, going back to the amazing book, The Cathedral and the Bazaar, about opensource innovation versus top-down, the cathedral has this central architecture that makes everything fit together harmoniously, and beautifully, with simplicity. But the bazaar is so much faster, 'cause it's sort of this free market of innovation. The Apache ecosystem is the bazaar, and the burden is on the developer and the administrator to make it work together, and it was most appropriate for the big internet companies that had the skills to do that. Now, the companies that are distributing these Apache opensource components are doing a Herculean job of putting them together, but they weren't designed to fit together. On the other hand, you've got the cloud service providers, who are building, to some extent, services that have standard APIs that might've been supported by some of the Apache products, but they have proprietary implementations, so you have lock-in, but they have more of the cathedral-type architecture that-- >> And they're delivering 'em their services, even though actually, many of those data services are discrete APIs, as you point out, are proprietary. Okay, so, very useful, George, thank you, if you have questions on this presentation, you can hit Wikibon.com and fire off a question to us, we'll make sure it gets to George and gets answered. This is part one, part two tomorrow is we're going to dig into some of the numbers, right? So if you care about where the trends are, what the numbers look like, what the market size looks like, we'll be sharing that with you tomorrow, all this stuff, of course, will be available on-demand, we'll be doing CrowdChats on this, George, excellent job, thank you very much for taking us through this. Thanks for watching today, it is a wrap of day one, Spark Summit East, we'll be back live tomorrow from Boston, this is theCUBE, so check out siliconangle.com for a review of all the action today, all the news, check out Wikibon.com for all the research, siliconangle.tv is where we house all these videos, check that out, we start again tomorrow at 11 o'clock east coast time, right after the keynotes, this is theCUBE, we're at Spark Summit, #SparkSummit, we're out, see you tomorrow. (electronic music jingle)

Published Date : Feb 8 2017

SUMMARY :

brought to you by Databricks. and the market conditions, and then we're going to go and it doesn't mean that all apps are going to be always on, Anything else you want to point out here? the technology has to mature, so right now Let's go to the next slide, which really and at the edge, and you don't necessarily need and you think of that as closer to real time, and the traditional workloads, "and the container, the storage container." and we're getting to the point where so on the next slide, you've got this red line of Spark, but it's not the panacea yet, and if you put it Okay, so let's have a quick discussion and you want answers from it very near real time, and by ingest, I mean taking the millions, and take a look at the next slide, and the next one will be what you're doing here, it's the engineer who's keeping it running. Yeah, but it's designed into the process, So the point of this is, we can't yet take Spark so on the next slide, we're looking of all the different things a developer has to deal with, let's take a look at the next slide, and the products are on the x-axis, it feels like the complexity curve is growing faster and the burden is on the developer and the administrator of all the action today, all the news,

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Patrick	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jeff Hammerbacher	PERSON	0.99+
Steve Herrod	PERSON	0.99+
Jeff Kelly	PERSON	0.99+
George	PERSON	0.99+
Matei Zaharia	PERSON	0.99+
Boston	LOCATION	0.99+
last year	DATE	0.99+
Wikibon	ORGANIZATION	0.99+
SiliconANGLE	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
millions	QUANTITY	0.99+
VMware	ORGANIZATION	0.99+
Spark	TITLE	0.99+
Gorge	ORGANIZATION	0.99+
one batch	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
two classes	QUANTITY	0.99+
Dave	PERSON	0.99+
three classes	QUANTITY	0.99+
first	QUANTITY	0.99+
two parts	QUANTITY	0.99+
each	QUANTITY	0.99+
second one	QUANTITY	0.99+
two different elements	QUANTITY	0.99+
first slide	QUANTITY	0.99+
two	QUANTITY	0.99+
The Cathedral and the Bazaar	TITLE	0.99+
each product	QUANTITY	0.99+
each piece	QUANTITY	0.99+
third one	QUANTITY	0.99+
One	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
today	DATE	0.98+
Facebook	ORGANIZATION	0.98+
first one	QUANTITY	0.98+
both	QUANTITY	0.98+
Apache	ORGANIZATION	0.98+
SiliconANGLE Media	ORGANIZATION	0.98+
first research	QUANTITY	0.98+
Spark Summit East 2017	EVENT	0.97+
Hadoop	TITLE	0.97+
two things	QUANTITY	0.97+
two fistfuls of tools	QUANTITY	0.96+
theCUBE	ORGANIZATION	0.96+
one	QUANTITY	0.96+
day one	QUANTITY	0.95+
#SparkSummit	EVENT	0.93+
siliconangle.com	OTHER	0.93+
two axes	QUANTITY	0.92+

Dr. Tendü Yoğurtçu, Syncsort | CUBEConversation, November 2019

(energetic music) >> Hi, and welcome to another Cube conversation, where we go in-depth with the thought leaders in the industry that are making significant changes to how we conduct digital business and the likelihood of success with digital business transformations. I'm Peter Burris. Every organization today has some experience with the power of analytics. But, they're also warning that the value of their analytics systems are, in part, constrained and determined by their access to core information. Some of the most important information that any business can start to utilize within their new advanced analytic systems, quite frankly, is that operational business information, that the business has been using to run the business on for years. Now, we've looked at that as silos and maybe it is. Although, partly, that's in response to the need to have good policy, good governance, and good certainty and practicably in how the system behaves and how secure it's going to be. So, the question is, how do we marry the new world of advanced analytics with the older, but, nonetheless, extremely valuable world of operational processing to create new types of value within digital business today? It's a great topic and we've got a great conversation. Tendu Yogurtcu is the CTO of Syncsort. Tendu, welcome back to The Cube! >> Hi Peter. It's great to be back here in The Cube. >> Excellent! So, look, let's start with the, let's start with a quick update on Syncsort. How are you doing, what's going on? >> Oh, it's been really exciting time at Syncsort. We have seen a tremendous growth in the last three years. We quadrupled our revenue, and also number of employees, through both organic innovation and growth, as well as through acquisitions. So, we now have 7,000 plus customers in over 100 countries, and, we still have the eight 40 Fortune 100, serving large enterprises. It's been a really great journey. >> Well, so, let's get into the specific distinction that you guys have. At Wikibon theCube, we've observed, we predicted that 1919, 2019 rather, 2019 was going to be the year that the enterprise assert itself in the cloud. We had seen a lot of developers drive cloud forward. We've seen a lot of analytics drive cloud forward. But, now as enterprises are entering into cloud in a big way, they're generating, or bringing with them, new types of challenges and issues that have to be addressed. So, when you think about where we are in the journey to more advanced analytics, better operational certainty, greater use of information, what do you think the chief challenges that customers face today are? >> Of course, as you mentioned, that everybody, every organization is trying to take advantage of the data. Data is the core. And, take advantage of the digital transformation to enable them for taking, getting more value out of their data. And, in doing so, they are moving into cloud, into hybrid cloud architectures. We have seen early implementations, starting with the data lake. Everybody started creating the centralized data hub, enabling advanced analytics and creating a data marketplace for their internal, or external clients. And, the early data lakes were for utilizing Hadoop on premise architectures. Now, we are also seeing data lakes, sometimes, expanding over hybrid or cloud architectures. The challenges that these organizations also started realizing is around, once I create this data marketplace, the access to the data, critical customer data, critical product data, >> Order data. >> Order data, is a bigger challenge than I thought that it would be in the pilot project. Because, these critical data sets, and core data sets, often in financial services, banking and insurance, and health care, are in environments, data platforms that these companies have invested over multiple decades. And, I'm not referring to that as legacy because definition of legacy changes. These environment's platforms have been holding this current critical data assets for decades successfully. So-- >> We call them high-value traditional applications. >> High-valude traditional sounds great. >> Because, they're traditional. We know what they do, and there's a certain operational certainty, and we've built up the organization around them to take care of those assets. >> But, they still are very very high-value. >> Exactly. And, making those applications and data available for next generation, next wave platforms, is becoming a challenge, for couple of different reasons. One, accessing this data. And, accessing this data, making sure the policies and the security, and the privacy around these data stores are preserved when the data is available for advanced analytics. Whether it's in the cloud or on premise deployments. >> So, before we go to the second one, I want to make sure I'm understanding that, because it seems very very important. >> Yes. >> That, what you're saying is, if I may, the data is not just the ones and the zeroes in the file. The data really start, needs to start being thought of as the policies, the governance, the security, and all the other attributes and elements, the metadata, if you will, has to be preserved as the data's getting used. >> Absolutely. And, there are challenges around that, because now you have to have skill sets to understand the data in those different types of stores. Relational data warehouses. Mainframe, IBMI, SQL, Oracle. Many different data owners, and different teams in the organization. And, then, you have to make sense of it and preserve the policies around each of these data assets, while bringing it to the new analytics environments. And, make sure that everybody's aligned with the access to privacy, and the policies, and the governance around that data. And also, mapping to metadata, to the target systems, right? That's a big challenge, because somebody who understands these data sets in a mainframe environment is not necessarily understanding the cloud data stores or the new data formats. So, how do you, kind of, bridge that gap, and map into the target-- >> And, vice-versa, right? >> Yes. >> So. >> Likewise, yes. >> So, this is where Syncsort starts getting really interesting. Because, as you noted, a lot of the folks in the mainframe world may not have the familiarity of how the cloud works, and a lot of the folks, at least from a data standpoint. >> Yes. >> And, a lot of the folks in the cloud that have been doing things with object stores and whatnot, may not, and Hadoop, may not have the knowledge of how the mainframe works. And, so, those two sides are seeing silos, but, the reality is, both sides have set up policies and governance models, and security regimes, and everything else, because it works for the workloads that are in place on each side. So, Syncsort's an interesting company, because, you guys have experience of crossing that divide. >> Absolutely. And, we see both the next phase, and the existing data platforms, as a moving, evolving target. Because, these challenges have existed 20 years ago, 10 years ago. It's just the platforms were different. The volume, the variety, complexity was different. However, Hadoop, five, ten years ago, was the next wave. Now, it's the cloud. Blockchain will be the next platform that we have to, still, kind of, adopt and make sure that we are advancing our data and creating value out of data. So, that's, accessing and preserving those policies is one challenge. And, then, the second challenge is that as you are making these data sets available for analytics, or machine learning, data science applications, deduplicating, standardizing, cleansing, making sure that you can deliver trusted data becomes a big challenge. Because, if you train the models with the bad data, if you create the models with the bad data, you have bad model, and then bad data inside. So, machine learning and artificial intelligence depends on the data, and the quality of the data. So, it's not just bringing all enterprise data for analytics. It's also making sure that the data is delivered in a trusted way. That's the big challenge. >> Yeah. Let me build on that, if I may, Tendu. Because, a lot of these tools involve black box belief in what the tool's performing. >> Correct. >> So, you really don't have a lot of visibility in the inner workings of how the algorithm is doing things. It's, you know, that's the way it is. So, in many respects, your only real visibility into the quality of the outcome of these tools is visibility into the quality of the data that's going into the building of these models. >> Correct. >> Have I got that right? >> Correct. And, in machine learning, the effect of bad data is, really, it multiplies. Because of the training of the model, as well as insights. And, with Blockchain, in the future, it will also become very critical because, once you load the data into Blockchain platform, it's immutable. So, data quality comes at a higher price, in some sense. That's another big challenge. >> Which is to say, that if you load bad data into a Blockchain, it's bad forever. >> Yes. That's very true. So, that's, obviously, another area that Syncsort, as we are accessing all of the enterprise data, delivering high-quality data, discovering and understanding the data, and delivering the duplicated standardized enriched data to the machine learning and AI pipeline, and analytics pipeline, is an area that we are focused with our products. And, a third challenge is that, as you are doing it, the speed starts mattering. Because, okay, I created the data lake or the data hub. The next big use case we started seeing is that, "Oh yeah, but I have 20 terabyte data, "only 10% is changing on a nightly basis. "So, how do I keep my data lake in sync? "Not only that, I want to keep my data lake in sync, "I also would like to feed that change data "and keep my downstream applications in sync. "I want to feed the change data to the microservices "in the cloud." That speed of delivery started really becoming a very critical requirement for the business. >> Speed, and the targeting of the delivery. >> Speed of the targeting, exactly. Because, I think the bottom line is, you really want to create an architecture that you can be agnostic. And, also be able to deliver at the speed the business is going to require at different times. Sometimes, it's near real-time, and at batch, sometimes it's real-time, and you have to feed the changes as quickly as possible to the consumer applications and the microservices in the cloud. >> Well, we've got a lot of CIO's who are starting to ask us questions about, especially, since they start thinking about Kubernetes, and Istio, and other types of platforms that are intended to facilitate the orchestration, and ultimately, the management of how these container-based applications work. And, we're starting to talk more about the idea of data assurance. Make sure the data's good. Make sure it's been high-quality. Make sure it's being taken care of. But, also make sure that it's targeted where it needs to be. Because, you don't want a situation where you spin up a new cluster, which you could do very quickly with Kubernetes. But, you haven't made the data available to that Kubernetes-based application, so that it can, actually, run. And, a lot of CIO's, and a lot of application development leaders, and a lot of business people, are now starting to think about that. "How do I make sure the data is where it needs to be, "so that the applications run when they need to run?" >> That's a great point. And, going back to your, kind of, comment around cloud, and taking advantage of cloud architectures. One of the things we have observed is organizations, for sure, looking at cloud, in terms of scalability, elasticity, and reducing costs. They did lift and shift of applications. And, not all applications can be taking advantage of cloud elasticity, then you do that. Most of these applications are created for the existing on-premise fixed architectures. So, they are not designed to take advantage of that. And, we are seeing a shift now. And, the shift is around, instead of, trying to, kind of, lift and shift existing applications. One, for new applications, let me try and adopt the technology assets, like you mentioned Kubernetes, that I can stay vendor-agnostic, for cloud vendors. But, more importantly, let me try to have some best practices in the organization. The new applications can be created to take advantage of the elasticity. Even though, they may not be running in the cloud yet. So, some organizations refer to this as cloud native, cloud first, some different terms. And, make the data. Because, the core asset here, is always the data. Make the data available, instead of going after the applications. Make the data from these existing on-premise and different platforms available for cloud. We are definitely seeing that the shift. >> Yeah, and make sure that it, and assure, that that data is high-quality, carries the policies, carries the governance, doesn't break in security models, all those other things. >> That is a big difference between how, actually, organizations ran into their Hadoop data lake implementations, versus the cloud architectures now. Because, when initial Hadoop data lake implementations happened, it was dump all the data. And, then, "Oh, I have to deal with the data quality now." >> It was also, "Oh, those mainframe people just would, "they're so difficult to work with." Meanwhile, you're still closing the books on a monthly basis, on a quarterly basis. You're not losing orders. Your customers aren't calling you on the phone angry. And, that, at the end of the day, is what a business has to do. You have to be able to extend what you can currently do, with a digital business approach. And, if you can replace certain elements of it, okay. But, you can't end up with less functionality as you move forward in the cloud. >> Absolutely. And, it's not just mainframe. It's IBMI, it's the Oracle, it's the teledata, it's the TDZa. It's growing rapidly, in terms of the complex stuff, that data infrastructure. And, for cloud, we are seeing now, a lot of pilots are happening with the cloud data warehouses. And, trying to see if the cloud data warehouses can accommodate some of these hybrid deployments. And, also, we are seeing, there's more focus, not after the fact, but, more focus on data quality from day one. "How am I going to ensure that "I'm delivering trusted data, and populating "the cloud data stores, or delivering trusted data "to microservices in the cloud?" There's greater focus for both governance and quality. >> So, high-quality data movement, that leads to high-quality data delivery, in ways that the business can be certain that whatever derivative work is done remains high-quality. >> Absolutely. >> Tendu Yogurtcu, thank you very much for being, once again, on The Cube. It's always great to have you here. >> Thank you Peter. It's wonderful to be here! >> Tandu Yogurtcu's the CTO of Syncsort, and once again, I want to thank you very much, for participating in this cloud, or this Cube conversation. Cloud on the mind, this Cube conversation. Until next time. (upbeat electronic music)

Published Date : Nov 20 2019

SUMMARY :

and the likelihood of success It's great to be back here in The Cube. How are you doing, what's going on? So, we now have 7,000 plus customers in over 100 countries, Well, so, let's get into the specific distinction the access to the data, critical customer data, And, I'm not referring to that as legacy to take care of those assets. and the privacy around these data stores are preserved So, before we go to the second one, the metadata, if you will, and preserve the policies around each and a lot of the folks, And, a lot of the folks in the cloud It's also making sure that the data Because, a lot of these tools involve into the quality of the outcome of these tools And, in machine learning, the effect of bad data is, Which is to say, that if you load bad data and delivering the duplicated standardized enriched data and the microservices in the cloud. "How do I make sure the data is where it needs to be, We are definitely seeing that the shift. that that data is high-quality, carries the policies, And, then, "Oh, I have to deal with the data quality now." And, that, at the end of the day, it's the teledata, it's the TDZa. So, high-quality data movement, that leads to It's always great to have you here. Thank you Peter. Cloud on the mind, this Cube conversation.

ENTITIES

Entity	Category	Confidence
Peter Burris	PERSON	0.99+
Peter	PERSON	0.99+
2019	DATE	0.99+
Syncsort	ORGANIZATION	0.99+
second challenge	QUANTITY	0.99+
Tandu Yogurtcu	PERSON	0.99+
November 2019	DATE	0.99+
20 terabyte	QUANTITY	0.99+
two sides	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
one challenge	QUANTITY	0.99+
both sides	QUANTITY	0.99+
Tendu Yogurtcu	PERSON	0.99+
1919	DATE	0.99+
third challenge	QUANTITY	0.99+
second one	QUANTITY	0.99+
10 years ago	DATE	0.99+
7,000 plus customers	QUANTITY	0.98+
over 100 countries	QUANTITY	0.98+
each side	QUANTITY	0.98+
both	QUANTITY	0.98+
today	DATE	0.98+
10%	QUANTITY	0.98+
20 years ago	DATE	0.98+
Tendü Yoğurtçu	PERSON	0.97+
Wikibon theCube	ORGANIZATION	0.97+
each	QUANTITY	0.97+
ten years ago	DATE	0.96+
The Cube	ORGANIZATION	0.95+
Dr.	PERSON	0.95+
five	DATE	0.94+
One	QUANTITY	0.92+
Kubernetes	TITLE	0.91+
Tendu	PERSON	0.91+
first	QUANTITY	0.9+
Kubernetes	ORGANIZATION	0.89+
TDZa	ORGANIZATION	0.89+
SQL	TITLE	0.87+
Hadoop	TITLE	0.84+
Istio	TITLE	0.84+
IBMI	ORGANIZATION	0.82+
last three years	DATE	0.81+
day one	QUANTITY	0.79+
Mainframe	ORGANIZATION	0.77+
eight 40 Fortune 100	QUANTITY	0.76+
wave	EVENT	0.74+
Cube	ORGANIZATION	0.71+
Blockchain	TITLE	0.69+
decades	QUANTITY	0.54+
Tendu	ORGANIZATION	0.52+
CTO	PERSON	0.5+
IBMI	TITLE	0.45+

Tendü Yogurtçu, Syncsort

(upbeat music) >> Hi and welcome to another Cube Conversation, where we go in depth with the thought leaders in the industry that are making significant changes to how we conduct digital business, and the likelihood of success with digital business transformations. I'm Peter Burris. Every organization today has some experience with the power of analytics, but they're also learning that the value of their analytic systems are, in part, constrained and determined by their access to core information. Some of the most important information that any business can start to utilize within their new advanced analytic systems, quite frankly, is that operational business information that the business has been using to run the business on for years. Now, we've looked at that as silos, and maybe it is, although partly that's in response to the need to have good policy, good governance, and good certainty and predictably in how the system behaves, and how secure it's going to be. So, the question is, how do we marry the new world of advanced analytics with the older, but, nonetheless, extremely valuable world of operational processing to create new types of value within digital business today? It's a great topic and we've got a great conversation. Tendü Yogurtçu is the CTO of Syncsort. Tendü, welcome back to theCube. >> Hi Peter, it's great to be back in theCube. >> Excellent. So, look, let's start with a quick update on Syncsort. How are you doing? What's going on? >> Oh, it's been really a exciting time at Syncsort. We have seen tremendous growth in the last three years. We quadrupled our revenue and also number of employees, tripled organic innovation and growth, as well as true acquisitions. So, we now have 7,000 plus customers in over 100 countries, and we still have the 84 of Fortune 100 serving large enterprises. It's been a really great journey. >> Well, so let's get into the specific distinction that you guys have. At Wikibon theCube, we've observed, we predicted that 1919, 2019, rather, 2019 was going to be the year that the enterprise asserted itself in the cloud. We had seen a lot of developers drive cloud forward, we've seen a lot of analytics drive cloud forward, but now as enterprises are entering into cloud in a big way, they're generating or bringing with them new types of challenges and issues that have to be addressed. So, when you think about where we are in the journey to more advanced analytics, better operational certainty, greater use of information, what do you think the chief challenges that customers face today are? >> Of course, as you mentioned, that everybody, every organization is trying to take advantage of the data, data is the core, and take advantage of the digital transformation to enable them for taking, getting more value out of their data. And, in doing so, they are moving into cloud, into hybrid cloud architectures. We have seen early implementations starting with the data lake. Everybody started creating this centralized data hub enabling advanced analytics and creating a data marketplace for their internal or external clients. And the early data lakes were utilizing Hadoop on on-premise architectures, now we are also seeing data lakes sometimes expanding over hybrid or cloud architectures. The challenges that these organizations also started realizing is around once I create this data marketplace, the access to the data, critical customer data, critical product data-- >> Order data. >> Order data, is a bigger challenge that I told that it will be in the pilot project because these critical data sets and core data sets often in financial services, banking, and insurance, and healthcare are environments, data platforms that these companies have invested over multiple decades. And I'm not referring to that as legacy because definition of legacy changes, these environments, platforms have been holding these critical data assets for decades successfully. >> We call them high value traditional applications because the traditional we know what they do, there's a certain operational certainty, and we've built up, you know, the organization around them to take care of those assets, but they still are very, very high value. >> Exactly, and making those applications and data available for next generation, next wave platforms is becoming a challenge for couple of different reasons. One, accessing this data, and accessing this data making sure the policies and the security and the privacy around these data stores are preserved when the data is available for advanced analytics, whether it's in the cloud or on-premise deployments. >> So, before you go to the second one, I want to make sure I understand that because it seems very, very important, that what you're saying is, if I may, the data is not just the ones and the zeros in the file, the data really needs to start being thought of as the policies, the governance, the security, and all the other attributes and elements, the metadata, if you will, has to be preserved as the data is getting used. >> Absolutely, and there are challenges around that because now you have to have skillsets to understand the data in those different types of stores, relational data warehouses, Mainframe, IBM i, SQL, Oracle, many different data owners and different teams in the organization, and then, you have to make sense of it and preserve the policies around each of these data assets while bringing it to the new analytics environments. And make sure that everybody is aligned with the access to privacy and the policies and the governance around that data. And also, mapping the metadata to the target systems, right? That's a big challenge because somebody who understands these data sets in a Mainframe environment is not necessarily understanding the cloud data stores or the new data formats, so how do you kind of bridge that gap and map into the target environment? >> And vice versa, right? >> Likewise, yes. >> This is where Syncsort starts getting really interesting because, as you noted, a lot of the folks in the Mainframe world may not have the familiarity of how the cloud works, and a lot of the folks, at least from a data standpoint, and a lot of folks in the cloud that have been doing things with object stores and whatnot, may not, in Hadoop, may not have the knowledge of how the Mainframe works. And so those two sides are seeing silos, but the reality is both sides have set up policies and governance models and security regimes and everything else because it works for the workloads that are in place on each side. >> Absolutely. >> So Syncsort's an interesting company because you guys have experience of crossing that divide. >> Absolutely, and we see both the next wave and existing data platforms as a moving, evolving target because these challenges have existed twenty years ago, ten years ago, it's just the platforms were different, the volume, the variety, complex was different, however, Hadoop, five, ten years ago was the next wave, now it's the cloud, blockchain will be the next platform that we have to still kind of adapt and make sure that we are advancing our data and creating value out of data. So that's accessing and preserving those policies is one challenge. And then the second challenge is that as you are making these data sets available for analytics or mission learning, data science applications, you're duplicating, standardizing, cleansing, making sure that you can deliver trusted data becomes a big challenge because if you train the models with the bad data, if you create the models with the bad data you have bad model and then bad data inside. So, mission learning and artificial intelligence depends on the data and the quality of the data, so it's not just bringing all enterprise data for analytics, it's also making sure that the data is delivered in a trusted way. That's a big challenge. >> Yeah, let me build on that if I may, Tendü, because a lot of these tools involve black box belief in what the tool's performing. >> Correct. >> So you really don't have a lot of visibility in the inner workings of how the algorithm is doing things. It's, you know, that's the way it is. So, in many respects, you're only real visibility into the quality of the outcome of these tools is visibility into the quality of data that's going into the building of these models. Have I got that right? >> Correct. And in mission learning, the effect of bad data it really multiplies because of the training of the model, as well as the insights. And with blockchain, in the future, it will also become very critical because once you load the data into blockchain platform, it's immutable. So, data quality comes at a higher price in some sense. So that's another big challenge. >> Which is to say that if you load bad data into a blockchain, it's bad forever. >> Yes, that's very true. So that's obviously another area that Syncsort, as we are accessing all of the enterprise data, delivering high quality data, discovering and understanding the data, and delivering the duplicated, standardized, enriched data to the mission learning and AI pipeline and analytics pipeline is an area that we are focused with our products. And the third challenge is that as you are doing it, the speed starts mattering because, okay, I created the data lake or the data hub, the next big use case we started seeing is that oh yeah, but I have twenty terabyte data, only ten percent is changing on a nightly basis, so how do I keep my data lake in sync? Not only that, I want to keep my data lake in sync, I also would like to feed that changed data and keep my downstream applications in sync. I want to feed the changed data to the micro services in the cloud. That speed of delivery started really becoming very critical requirement for the businesses. >> Speed and the targeting of the delivery. >> Speed of the targeting grid, exactly. Because I think the bottom line is you really want to create an architecture that you can be agnostic and also be able to deliver at the speed the business is going to require at different times. Sometimes it's near real time in a batch, sometimes it's real time and you have to feed the changes as quickly as possible to the consumer applications and the micro services in the cloud. >> Well, we've got a lot of CIOs who are starting to ask us questions about, especially as they start thinking about Kubernetes and Istio and other types of platforms that are intended to facilitate the orchestration and ultimately the management of how these container-based applications work. And we're starting to talk more about the idea of data assurance. Make sure the data is good, make sure it's high quality, make sure it's being taken care of, but also make sure that it's targeted where it needs to be, because you don't want a situation where you spin up a new cluster, which you could do very quickly with Kubernetes, but you haven't made the data available to that Kubernetes based application so that they can actually run. And a lot of CIOs and a lot of application development leaders and a lot of business people are now starting to think about that. How do I make sure the data is where it needs to be so that the applications run when they need to run? >> That's a great point, and going back to your kind of comment around the cloud and taking advantage of cloud architectures, one of the things we have observed is organizations looking at cloud in terms of scalable elasticity and reducing costs, dated lift and shift of applications, and not all applications can be taking advantage of cloud elasticity when you do that. Most of these applications are created for the existing on premise fixed architectures, so they are not designed to take advantage of that. And we are seeing a shift now, and the shift is around instead of trying to kind of lift and shift the existing applications, one, for new applications, let me try to adopt the technology assets, like you mentioned Kubernetes, that I can stay vendor agnostic for cloud vendors, but, more importantly, let me try to have some best practices in organization that new applications can be created to take advantage of the elasticity, even though they may not be running in the cloud yet. So some organizations refer to this as cloud native, cloud first, some different terms. And make the data, because the core asset here is always the data, make the data available, instead of going after the applications, make the data from these existing on premise and different platforms available for cloud. We are definitely seeing that shift. >> Yeah, and make sure and assure that that data is high quality, carries the policies, carries the governance, doesn't break the security models, all those other things. >> That is a big difference between how actual organizations ran into their Hadoop data lake implementations versus the cloud architectures now, because when initial Hadoop data lake implementations happened, it was dump all the data. And then, oh, I have to deal with the data quality now. >> No, it was also, oh, those Mainframe people just would, they're so difficult to work with, meanwhile, you're still closing the books on a monthly basis, on a quarterly basis, you're not losing orders, your customers aren't calling you on the phone angry, and that, at the end of the day, is what business has to do. You have to be able to extend what you can currently do with a digital business approach, and if you can replace certain elements of it, okay. But you can't end up with less functionality as you move forward into the cloud. >> Absolutely, and it's not just Mainframe, it's IBM i, it's the Oracle, it's the teradata, it's the DTSA, it's growing rapidly in terms of the complexity of that data infrastructure. And for cloud, we are seeing now a lot of pilots are happening with the cloud data warehouses, and trying to see if the cloud data warehouses can accommodate some of these hybrid deployments, and also we are seeing there's more focus, not after the fact, but more focus on data quality from day one. How am I going to insure that I'm delivering trusted data and populating the cloud data stores, or delivering trusted data to micro services in the cloud. There is a greater focus for both governance and quality. >> So, high quality data movement that leads to high quality data delivery in ways that the business can be certain that whatever derivative of work is done, remains high quality. >> Absolutely. >> Tendü Yogurtçu, thank you very much for being once again on theCube, it's always great to have you here. >> Thank you, Peter, it's wonderful to be here. >> Tendü Yogurtçu is the CTO of Syncsort, and, once again, I want to thank you very much for participating in this cloud, or this Cube Conversation, cloud on the mind, this Cube Conversation. Until next time. (upbeat music)

Published Date : Nov 5 2019

SUMMARY :

and the likelihood of success with How are you doing? and we still have the 84 of Fortune 100 in the journey to more advanced analytics, data is the core, and take advantage And I'm not referring to that as legacy because the traditional we know what they do, making sure the policies and the security and the privacy and elements, the metadata, if you will, and preserve the policies around each of these data assets and a lot of folks in the cloud that have been have experience of crossing that divide. for analytics, it's also making sure that the data because a lot of these tools involve into the quality of the outcome of these tools And in mission learning, the effect of bad data Which is to say that if you load bad data And the third challenge is that as you are doing it, at the speed the business is going to so that the applications run when they need to run? And make the data, because the core asset here carries the governance, doesn't break the security models, the cloud architectures now, because when and that, at the end of the day, it's the Oracle, it's the teradata, it's the DTSA, the business can be certain that whatever once again on theCube, it's always great to have you here. Tendü Yogurtçu is the CTO of Syncsort,

ENTITIES

Entity	Category	Confidence
Peter Burris	PERSON	0.99+
Peter	PERSON	0.99+
2019	DATE	0.99+
Syncsort	ORGANIZATION	0.99+
84	QUANTITY	0.99+
second challenge	QUANTITY	0.99+
twenty terabyte	QUANTITY	0.99+
two sides	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
both sides	QUANTITY	0.99+
each side	QUANTITY	0.99+
one challenge	QUANTITY	0.99+
third challenge	QUANTITY	0.99+
IBM	ORGANIZATION	0.99+
1919	DATE	0.99+
both	QUANTITY	0.99+
ten years ago	DATE	0.99+
ten percent	QUANTITY	0.98+
Kubernetes	TITLE	0.98+
second one	QUANTITY	0.98+
twenty years ago	DATE	0.98+
Wikibon theCube	ORGANIZATION	0.98+
over 100 countries	QUANTITY	0.98+
today	DATE	0.98+
7,000 plus customers	QUANTITY	0.97+
Mainframe	ORGANIZATION	0.97+
one	QUANTITY	0.94+
Hadoop	TITLE	0.92+
five	DATE	0.9+
each	QUANTITY	0.9+
Tendü Yogurtçu	ORGANIZATION	0.9+
first	QUANTITY	0.89+
Tendü	ORGANIZATION	0.88+
Kubernetes	ORGANIZATION	0.87+
wave	EVENT	0.86+
last three years	DATE	0.84+
next	EVENT	0.77+
One	QUANTITY	0.74+
Conversation	EVENT	0.72+
day one	QUANTITY	0.71+
IBM i	ORGANIZATION	0.7+
Cube	ORGANIZATION	0.67+
Yogurtçu	PERSON	0.62+
100	QUANTITY	0.52+
Istio	TITLE	0.52+
Fortune	ORGANIZATION	0.52+
theCube	ORGANIZATION	0.51+
SQL	TITLE	0.5+
DTSA	ORGANIZATION	0.48+
Cube	COMMERCIAL_ITEM	0.46+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Wikibon theCube: