Yaron Haviv | BigData SV 2017

>> Announcer: Live from San Jose, California, it's the CUBE, covering Big Data Silicon Valley 2017. (upbeat synthesizer music) >> Live with the CUBE coverage of Big Data Silicon Valley or Big Data SV, #BigDataSV in conjunction with Strata + Hadoop. I'm John Furrier with the CUBE and my co-host George Gilbert, analyst at Wikibon. I'm excited to have our next guest, Yaron Haviv, who's the founder and CTO of iguazio, just wrote a post up on SiliconANGLE, check it out. Welcome to the CUBE. >> Thanks, John. >> Great to see you. You're in a guest blog this week on SiliconANGLE, and always great on Twitter, cause Dave Alante always liked to bring you into the contentious conversations. >> Yaron: I like the controversial ones, yes. (laughter) >> And you add a lot of good color on that. So let's just get right into it. So your company's doing some really innovative things. We were just talking before we came on camera here, about some of the amazing performance improvements you guys have on many different levels. But first take a step back, and let's talk about what this continuous analytics platform is, because it's unique, it's different, and it's got impact. Take a minute to explain. >> Sure, so first a few words on iguazio. We're developing a data platform which is unified, so basically it can ingest data through many different APIs, and it's more like a cloud service. It is for on-prem and edge locations and co-location, but it's managed more like a cloud platform so very similar experience to Amazon. >> John: It's software? >> It's software. We do integrate a lot with hardware in order to achieve our performance, which is really about 10 to 100 times faster than what exists today. We've talked to a lot of customers and what we really want to focus with customers in solving business problems, Because I think a lot of the Hadoop camp started with more solving IT problems. So IT is going kicking tires, and eventually failing based on your statistics and Gardner statistics. So what we really wanted to solve is big business problems. We figured out that this notion of pipeline architecture, where you ingest data, and then curate it, and fix it, et cetera, which was very good for the early days of Hadoop, if you think about how Hadoop started, was page ranking from Google. There was no time sensitivity. You could take days to calculate it and recalibrate your search engine. Based on new research, everyone is now looking for real time insights. So there is sensory data from (mumbles), there's stock data from exchanges, there is fraud data from banks, and you need to act very quickly. So this notion of and I can give you examples from customers, this notion of taking data, creating Parquet file and log files, and storing them in S3 and then taking Redshift and analyzing them, and then maybe a few hours later having an insight, this is not going to work. And what you need to fix is, you have to put some structure into the data. Because if you need to update a single record, you cannot just create a huge file of 10 gigabyte and then analyze it. So what we did is, basically, a mechanism where you ingest data. As you ingest the data, you can run multiple different processes on the same thing. And you can also serve the data immediately, okay? And two examples that we demonstrate here in the show, one is video surveillance, very nice movie-style example, that you, basically, ingest pictures for S3 API, for object API, you analyze the picture to detect faces, to detect scenery, to extract geolocation from pictures and all that, all those through different processes. TensorFlow doing one, serverless functions that we have, do other simpler tasks. And in the same time, you can have dashboards that just show everything. And you can have Spark, that basically does queries of where was this guys last seen? Or who was he with, you know, or think about the Boston Bomber example. You could just do it in real time. Because you don't need this notion of pipeline. And this solves very hard business problems for some of the customers we work with. >> So that's the key innovation, there's no pipe lining. And what's the secret sauce? >> So first, our system does about a couple of million of transactions per second. And we are a multi-modal database. So, basically, you can ingest data as a stream, exactly the same data could be read by Spark as a table. So you could, basically, issue a query on the same data. Give me everything that has a certain pattern or something, and could also be served immediately through RESTful APIs to a dashboard running AngularJS or something like that. So that's the secret sauce, is by having this integration, and this unique data model, it allows you all those things to work together. There are other aspects, like we have transactional semantics. One of the challenges is how do you make sure that a bunch of processes don't collide when they update the same data. So first you need a very low ground alert. 'cause each one may update to different field. Like this example that I gave with GeoData, the serverless function that does the GeoData extraction only updates the GeoData fields within the records. And maybe TensorFlow updates information about the image in a different location in the record or, potentially, a different record. So you have to have that, along with transaction safety, along with security. We have very tight security at the field level, identity level. So that's re-thinking the entire architecture. And I think what many of the companies you'll see at the show, they'll say, okay, Hadoop is given, let's build some sort of convenience tools around it, let's do some scripting, let's do automation. But serve the underlying thing, I won't use dirty words, but is not well-equipped to the new challenges of real time. We basically restructured everything, we took the notions of cloud-native architectures, we took the notions of Flash and latest Flash technologies, a lot of parallelism on CPUs. We didn't take anything for granted on the underlying architecture. >> So when you found the company, take a personal story here. What was the itch you were scratching, why did you get into this? Obviously, you have a huge tech advantage, which is, will double-down with the research piece and George will have some questions. What got you going with the company? You got a unique approach, people would love to do away with the pipeline, that sounds great. And the performance, you said about 100x. So how did you get here? (laughs) Tell the story. >> So if you know my background, I ran all the data center activities in Mellanox, and you know Mellanox, I know Kevin was here. And my role was to take Mellanox technology, which is 100 gig networking and silicon, and fit it into the different applications. So I worked with SAP HANA, I worked with Teradata, I worked on Oracle Exadata, I work with all the cloud service providers on building their own object storage and NoSQL and other solutions. I also owned all the open source activities around Hadoop and Saf and all those projects, and my role was to fix many of those. If a customer says I don't need 100 gig, it's too fast for me, how do I? And my role was to convince him that yes, I can open up all the bottleneck all the way up to your stack so you can leverage those new technologies. And for that we basically sowed inefficiencies in those stacks. >> So you had a good purview of the marketplace. >> Yaron: Yes. >> You had open source on one hand, and then all the-- >> All the storage players, >> vendors, network. >> all the database players and all the cloud service providers were my customers. So you're a very unique point where you see the trajectory of cloud. Doing things totally different, and sometimes I see the trajectory of enterprise storage, SAN, NAS, you know, all Flash, all that, legacy technologies where cloud providers are all about object, key value, NoSQL. And you're trying to convince those guys that maybe they were going the wrong way. But it's pretty hard. >> Are they going the wrong way? >> I think they are going the wrong way. Everyone, for example, is running to do NVMe over Fabric now that's the new fashion. Okay, I did the first implementation of NVMe over Fabric, in my team at Mellanox. And I really loved it, at that time, but databases cannot run on top of storage area networks. Because there are serialization problems. Okay, if you use a storage area network, that mean that every node in the cluster have to go and serialize an operation against the shared media. And that's not how Google and Amazon works. >> There's a lot more databases out there too, and a lot more data sources. You've got the Edge. >> Yeah, but all the new databases, all the modern databases, they basically shared the data across the different nodes so there are no serialization problems. So that's why Oracle doesn't scale, or scale to 10 nodes at best, with a lot of RDMA as a back plane, to allow that. And that's why Amazon can scale to a thousand nodes, or Google-- >> That's the horizontally-scalable piece that's happening. >> Yeah, because, basically, the distribution has to move into the higher layers of the data, and not the lower layers of the data. And that's really the trajectory where the traditional legacy storage and system vendors are going, and we sort of followed the way the cloud guys went, just with our knowledge of the infrastructure, we sort of did it better than what the cloud guys did. 'Cause the cloud guys focused more on the higher levels of the implementation, the algorithms, the Paxos, and all that. Their implementation is not that efficient. And we did both sides extremely efficient. >> How about the Edge? 'Cause Edge is now part of cloud, and you got cloud has got the compute, all the benefits, you were saying, and still they have their own consumption opportunities and challenges that everyone else does. But Edge is now exploding. The combination of those things coming together, at the intersection of that is deep learning, machine learning, which is powering the AI hype. So how is the Edge factoring into your plan and overall architectures for the cloud? >> Yeah, so I wrote a bunch of posts that are not published yet about the Edge, But my analysis along with your analysis and Pierre Levin's analysis, is that cloud have to start distribute more. Because if you're looking at the trends. Five gig, 5G Wi-Fi in wireless networking is going to be gigabit traffic. Gigabit to the homes, they're going to buy Google, 70 bucks a month. It's going to push a lot more bend with the Edge. On the same time, a cloud provider, is in order to lower costs and deal with energy problems they're going to rural areas. The traditional way we solve cloud problems was to put CDNs, so every time you download a picture or video, you got to a CDN. When you go to Netflix, you don't really go to Amazon, you got to a Netflix pop, one of 250 locations. The new work loads are different because they're no longer pictures that need to be cashed. First, there are a lot of data going up. Sensory data, upload files, et cetera. Data is becoming a lot more structured. Censored data is structured. All this car information will be structured. And you want to (mumbles) digest or summarize the data. So you need technologies like machine learning, NNI and all those things. You need something which is like CDNs. Just mini version of cloud that sits somewhere in between the Edge and the cloud. And this is our approach. And now because we can string grab the mini cloud, the mini Amazon in a way more dense approach, then this is a play that we're going to take. We have a very good partnership with Equinox. Which has 170 something locations with very good relations. >> So you're, essentially, going to disrupt the CDN. It's something that I've been writing about and tweeting about. CDNs were based on the old Yahoo days. Cashing images, you mentioned, give me 1999 back, please. That's old school, today's standards. So it's a whole new architecture because of how things are stored. >> You have to be a lot more distributive. >> What is the architecture? >> In our innovation, we have two layers of innovation. One is on the lower layers of, we, actually, have three main innovations. One is on the lower layers of what we discussed. The other one is the security layer, where we classify everything. Layer seven at 100 gig graphic rates. And the third one is all this notion of distributed system. We can, actually, run multiple systems in multiple locations and manage them as one logical entity through high level semantics, high level policies. >> Okay, so when we take the CUBE global, we're going to have you guys on every pop. This is a legit question. >> No it's going to take time for us. We're not going to do everything in one day and we're starting with the local problems. >> Yeah but this is digital transmissions. Stay with me for a second. Stay with this scenario. So video like Netflix is, pretty much, one dimension, it's video. They use CDNs now but when you start thinking in different content types. So, I'm going to have a video with, maybe, just CGI overlayed or social graph data coming in from tweets at the same time with Instagram pictures. I might be accessing multiple data everywhere to watch a movie or something. That would require beyond a CDN thinking. >> And you have to run continuous analytics because it can not afford batch. It can not afford a pipeline. Because you ingest picture data, you may need to add some subtext with the data and feed it, directly, to the consumer. So you have to move to those two elements of moving more stuff into the Edge and running into continuous analytics versus a batch on pipeline. >> So you think, based on that scenario I just said, that there's going to be an opportunity for somebody to take over the media landscape for sure? >> Yeah, I think if you're also looking at the statistics. I seen a nice article. I told George about it. That analyzing the Intel cheap distribution. What you see is that there is a 30% growth on Intel's cheap Intel Cloud which is faster than what most analysts anticipate in terms of cloud growth. That means, actually, that cloud is going to cannibalize Enterprise faster than what most think. Enterprise is shrinking about 7%. There is another place which is growing. It's Telcos. It's not growing like cloud but part of it is because of this move towards the Edge and the move of Telcos buying white boxes. >> And 5G and access over the top too. >> Yeah but that's server chips. >> Okay. >> There's going to be more and more computation in the different Telco locations. >> John: Oh you're talking about computer, okay. >> This is an opportunity that we can capitalize on if we run fast enough. >> It sounds as though because you've implemented these industry standard APIs that come from the, largely, the open source ecosystem, that you can propagate those to areas on the network that the vendors, who are behind those APIs can't, necessarily, do. Into the Telcos, towards the Edge. And, I assume, part of that is cause of the density and the simplicity. So, essentially, your footprint's smaller in terms of hardware and the operational simplicity is greater. Is that a fair assessment? >> Yes and also, we support a lot of Amazon compatible APIs which are RESTful, typically, HTTP based. Very convenient to work with in a cloud environment. Another thing is, because we're taking all the state on ourself, the different forms of states whether it's a message queue or a table or an object, et cetera, that makes the computation layer very simple. So one of the things that we are, also, demonstrating is the integration we have with Kubernetes that, basically, now simplifies Kubernetes. Cause you don't have to build all those different data services for cloud native infrastructure. You just run Kubernetes. We're the volume driver, we're the database, we're the message queues, we're everything underneath Kubernetes and then, you just run Spark or TensorFlow or a serverless function as a Kubernetes micro service. That allows you now, elastically, to increase the number of Spark jobs that you need or, maybe, you have another tenant. You just spun a Spark job. YARN has some of those attributes but YARN is very limited, very confined to the Hadoop Ecosystem. TensorFlow is not a Hadoop player and a bunch of those new tools are not in Hadoop players and everyone is now adopting a new way of doing streaming and they just call it serverless. serverless and streaming are very similar technologies. The advantage of serverless is all this pre-packaging and all this automation of the CICD. The continuous integration, the continuous development. So we're thinking, in order to simplify the developer in an operation aspects, we're trying to integrate more and more with cloud native approach around CICD and integration with Kubernetes and cloud native technologies. >> Would it be fair to say that from a developer or admin point of view, you're pushing out from the cloud towards the Edge faster than if the existing implementations say, the Apache Ecosystem or the AWS Ecosystem where AWS has something on the edge. I forgot whether it's Snowball or Green Grass or whatever. Where they at least get the lambda function. >> They're field by the way and it's interesting to see. One of the things they allowed lambda functions in their CDS which is going the direction I mentioned just for a minimal functionality. Another thing is they have those boxes where they have a single VM and they can run lambda function as well. But I think their ability to run computation is very limited and also, their focus is on shipping the boxes through mail and we want it to be always connected. >> Our final question for you, just to get your thoughts. Great save up, by the way. This is very informative. Maybe be should do a follow up on Skype in our studio for Silocon Friday show. Google Next was interesting. They're serious about the Enterprise but you can see that they're not yet there. What is the Enterprise readiness from your perspective? Cause Google has the tech and they try to flaunt the tech. We're great, we're Google, look at us, therefore, you should buy us. It's not that easy in the Enterprise. How would you size up the different players? Because they're all not like Amazon although Amazon is winning. You got Amazon, Azure and Google. Your thoughts on the cloud players. >> The way we attack Enterprise, we don't attack it from an Enterprise perspective or IT perspective, we take it from a business use case perspective. Especially, because we're small and we have to run fast. You need to identify a real critical business problem. We're working with stock exchanges and they have a lot of issues around monitoring the daily trade activities in real time. If you compare what we do with them on this continuous analytics notion to how they work with Excel's and Hadoops, it's totally different and now, they could do things which are way different. I think that one of the things that Hadook's customer, if Google wants to succeed against Amazon, they have to find the way of how to approach those business owners and say here's a problem Mr. Customer, here's a business challenge, here's what I'm going to solve. If they're just going to say, you know what? My VM's are cheaper than Amazon, it's not going to be a-- >> Also, they're doing the whole, they're calling lift and shift which is code word for rip and replace in the Enterprise. So that's, essentially, I guess, a good opportunity if you can get people to do that but not everyone's ripping and replacing and lifting and shifting. >> But a lot of Google advantages around areas of AI and things like that. So they should try and leverage, if you think about Amazon approach to AI, this fund the university to build a project and then set it's hours where Google created TensorFlow and created a lot of other IPs and Dataflow and all those solutions and consumered it to the community. I really love Google's approach of contributing Kubernetes, to contributing TensorFlow. And this way, they're planting the seeds so the new generation this is going to work with Kubernetes and TensorFlow who are going to say, "You know what?" "Why would I mess with this thing on (mumbles) just go and. >> Regular cloud, do multi-cloud. >> Right to the cloud. But I think a lot of criticism about Google is that they're too research oriented. They don't know how to monetize and approach the-- >> Enterprise is just a whole different drum beat and I think that's the only thing on my complaint with them, they got to get that knowledge and/or buy companies. Have a quick final point on Spanner or any analysis of Spanner that went from paper, pretty quickly, from paper to product. >> So before we started iguazio, I started Spanner quite a bit. All the publication was there and all the other things like Spanner. Spanner has the underlying layer called Colossus. And our data layer is very similar to how Colossus works. So we're very familiar. We took a lot of concepts from Spanner on our platform. >> And you like Spanner, it's legit? >> Yes, again. >> Cause you copied it. (laughs) >> Yaron: We haven't copied-- >> You borrowed some best practices. >> I think I cited about 300 research papers before we did the architecture. But we, basically, took the best of each one of them. Cause there's still a lot of issues. Most of those technologies, by the way, are designed for mechanical disks and we can talk about it in a different-- >> And you have Flash. Alright, Yaron, we have gone over here. Great segment. We're here, live in Silicon Valley, breakin it down, getting under the hood. Looking a 10X, 100X performance advantages. Keep an eye on iguazio, they're looking like they got some great products. Check them out. This is the CUBE. I'm John Furrier with George Gilbert. We'll be back with more after this short break. (upbeat synthesizer music)

Published Date : Mar 14 2017

SUMMARY :

it's the CUBE, covering Big Welcome to the CUBE. to bring you into the Yaron: I like the about some of the amazing and it's more like a cloud service. And in the same time, So that's the key innovation, So that's the secret sauce, And the performance, you said about 100x. and fit it into the purview of the marketplace. and all the cloud service that's the new fashion. You've got the Edge. Yeah, but all the new databases, That's the horizontally-scalable and not the lower layers of the data. So how is the Edge digest or summarize the data. going to disrupt the CDN. One is on the lower layers of, we're going to have you guys on every pop. the local problems. So, I'm going to have a video with, maybe, of moving more stuff into the Edge and the move of Telcos buying white boxes. in the different Telco locations. John: Oh you're talking This is an opportunity that we and the operational simplicity is greater. is the integration we have with Kubernetes the Apache Ecosystem or the AWS Ecosystem One of the things they It's not that easy in the Enterprise. to say, you know what? and replace in the Enterprise. and consumered it to the community. Right to the cloud. that's the only thing and all the other things like Spanner. Cause you copied it. and we can talk about it in a different-- This is the CUBE.

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Telcos	ORGANIZATION	0.99+
Yaron Haviv	PERSON	0.99+
Google	ORGANIZATION	0.99+
Equinox	ORGANIZATION	0.99+
John	PERSON	0.99+
Mellanox	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Telco	ORGANIZATION	0.99+
Kevin	PERSON	0.99+
Dave Alante	PERSON	0.99+
George Gilbert	PERSON	0.99+
Yaron	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
Pierre Levin	PERSON	0.99+
100 gig	QUANTITY	0.99+
AngularJS	TITLE	0.99+
San Jose, California	LOCATION	0.99+
30%	QUANTITY	0.99+
John Furrier	PERSON	0.99+
One	QUANTITY	0.99+
two examples	QUANTITY	0.99+
First	QUANTITY	0.99+
third one	QUANTITY	0.99+
Skype	ORGANIZATION	0.99+
one day	QUANTITY	0.99+
Netflix	ORGANIZATION	0.99+
10 gigabyte	QUANTITY	0.99+
Teradata	ORGANIZATION	0.99+
two elements	QUANTITY	0.99+
CUBE	ORGANIZATION	0.99+
Spanner	TITLE	0.99+
Oracle	ORGANIZATION	0.99+
S3	TITLE	0.99+
first	QUANTITY	0.99+
1999	DATE	0.98+
two layers	QUANTITY	0.98+
Excel	TITLE	0.98+
both sides	QUANTITY	0.98+
Spark	TITLE	0.98+
Five gig	QUANTITY	0.98+
Kubernetes	TITLE	0.98+
Paxos	ORGANIZATION	0.98+
Intel	ORGANIZATION	0.98+
100X	QUANTITY	0.98+
Azure	ORGANIZATION	0.98+
Colossus	TITLE	0.98+
about 7%	QUANTITY	0.98+
Yahoo	ORGANIZATION	0.98+
Hadoop	TITLE	0.97+
Boston Bomber	ORGANIZATION	0.97+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Boston Bomber: