Image Title

Search Results for hundred, 500 node Hadoop clusters:

John Thomas, IBM | IBM CDO Summit Spring 2018


 

>> Narrator: Live from downtown San Francisco, it's theCUBE, covering IBM Chief Data Officer Strategy Summit 2018, brought to you by IBM. >> We're back in San Francisco, we're here at the Parc 55 at the IBM Chief Data Officer Strategy Summit. You're watching theCUBE, the leader in live tech coverage. My name is Dave Vellante and IBM's Chief Data Officer Strategy Summit, they hold them on both coasts, one in Boston and one in San Francisco. A couple times each year, about 150 chief data officers coming in to learn how to apply their craft, learn what IBM is doing, share ideas. Great peer networking, really senior audience. John Thomas is here, he's a distinguished engineer and director at IBM, good to see you again John. >> Same to you. >> Thanks for coming back in theCUBE. So let's start with your role, distinguished engineer, we've had this conversation before but it just doesn't happen overnight, you've got to be accomplished, so congratulations on achieving that milestone, but what is your role? >> The road to distinguished engineer is long but today, these days I spend a lot of my time working on data science and in fact am part of what is called a data science elite team. We work with clients on data science engagements, so this is not consulting, this is not services, this is where a team of data scientists work collaboratively with a client on a specific use case and we build it out together. We bring data science expertise, machine learning, deep learning expertise. We work with the business and build out a set of tangible assets that are relevant to that particular client. >> So this is not a for-pay service, this is hey you're a great customer, a great client of ours, we're going to bring together some resources, you'll learn, we'll learn, we'll grow together, right? >> This is an investment IBM is making. It's a major investment for our top clients working with them on their use cases. >> This is a global initiative? >> This is global, yes. >> We're talking about, what, hundreds of clients, thousands of clients? >> Well eventually thousands but we're starting small. We are trying to scale now so obviously once you get into these engagements, you find out that it's not just about building some models. There are a lot of challenges that you've got to deal with in an enterprise setting. >> Dave: What are some of the challenges? >> Well in any data science engagement the first thing is to have clarity on the use case that you're engaging in. You don't want to build models for models' sake. Just because Tensorflow or scikit-learn is great and build models, that doesn't serve a purpose. That's the first thing, do you have clarity of the business use case itself? Then comes data, now I cannot stress this enough, Dave, there is no data science without data, and you might think this is the most obvious thing, of course there has to be data, but when I say data I'm talking about access to the right data. Do we have governance over the data? Do we know who touched the data? Do we have lineage on that data? Because garbage in, garbage out, you know this. Do we have access to the right data in the right control setting for my machine learning models we built. These are challenges and then there's another challenge around, okay, I built my models but how do I operationalize them? How do I weave those models into the fabric of my business? So these are all challenges that we have to deal with. >> That's interesting what you're saying about the data, it does sound obvious but having the right data model as well. I think about when I interact with Netflix, I don't talk to their customer service department or their marketing department or their sales department or their billing department, it's one experience. >> You just have an experience, exactly. >> This notion of incumbent disruptors, is that a logical starting point for these guys to get to that point where they have a data model that is a single data model? >> Single data model. (laughs) >> Dave: What does that mean, right? At least from an experienced standpoint. >> Once we know this is the kind of experience we want to target, what are the relevant data sets and data pieces that are necessary to make their experience happen or come together. Sometimes there's core enterprise data that you have in many cases, it has been augmented with external data. Do you have a strategy around handling your internal, external data, your structured transactional data, your semi-structured data, your newsfeeds. All of these need to come together in a consistent fashion for that experience to be true. It is not just about I've got my credit card transaction data but what else is augmenting that data? You need a model, you need a strategy around that. >> I talk to a lot of organizations and they say we have a good back-end reporting system, we have Cognos we can build cubes and all kinds of financial data that we have, but then it doesn't get down to the front line. We have an instrument at the front line, we talk about IOT and that portends change there but there's a lot of data that either isn't persisted or not stored or doesn't even exist, so is that one of the challenges that you see enterprises dealing with? >> It is a challenge. Do I have access to the right data, whether that is data at rest or in motion? Am I persisting it the way I can consume it later? Or am I just moving big volumes of data around because analytics is there, or machine learning is there and I have to move data out of my core systems into that area. That is just a waste of time, complexity, cost, hidden costs often, 'cause people don't usually think about the hidden costs of moving large volumes of data around. But instead of that can I bring analytics and machine learning and data science itself to where my data is. Not necessarily to move it around all the time. Whether you're dealing with streaming data or large volumes of data in your Hadoop environment or mainframes or whatever. Can I do ML in place and have the most value out of the data that is there? >> What's happening with all that Hadoop? Nobody talks about Hadoop anymore. Hadoop largely became a way to store data for less, but there's all this data now and a data lake. How are customers dealing with that? >> This is such an interesting thing. People used to talk about the big data, you're right. We jumped from there to the cognitive It's not like that right? No, without the data then there is no cognition there is no AI, there is no ML. In terms of existing investments in Hadoop for example, you have to absolutely be able to tap in and leverage those investments. For example, many large clients have investments in large Cloudera or Hortonworks environment, or Hadoop environments so if you're doing data science, how do you push down, how do you leverage that for scale, for example? How do you access the data using the same access control mechanisms that are already in place? Maybe you have Carbros as your mechanism how do you work with that? How do you avoid moving data off of that environment? How do you push down data prep into the spar cluster? How do you do model training in that spar cluster? All of these become important in terms of leveraging your existing investments. It is not just about accessing data where it is, it's also about leveraging the scale that the company has already invested in. You have hundred, 500 node Hadoop clusters well make the most of them in terms of scaling your data science operations. So push down and access data as much as possible in those environments. >> So Beth talked today, Beth Smith, about Watson's law, and she made a little joke about that, but to me its poignant because we are entering a new era. For decades this industry marched to the cadence of Moore's law, then of course Metcalfe's law in the internet era. I want to make an observation and see if it resonates. It seems like innovation is no longer going to come from doubling microprocessor speed and the network is there, it's built out, the internet is built. It seems like innovation comes from applying AI to data together to get insights and then being able to scale, so it's cloud economics. Marginal costs go to zero and massive network effects, and scale, ability to track innovation. That seems to be the innovation equation, but how do you operationalize that? >> To your point, Dave, when we say cloud scale, we want the flexibility to do that in an off RAM public cloud or in a private cloud or in between, in a hybrid cloud environment. When you talk about operationalizing, there's a couple different things. People think that, say I've got a super Python programmer and he's great with Tensorflow or scikit-learn or whatever and he builds these models, great, but what happens next, how do you actually operationalize those models? You need to be able to deploy those models easily. You need to be able to consume those models easily. For example you have a chatbot, a chatbot is dumb until it actually calls these machine learning models, real time to make decisions on which way the conversation should go. So how do you make that chatbot intelligent? It's when it consumes the ML models that have been built. So deploying models, consuming models, you create a model, you deploy it, you've got to push it through the development test staging production phases. Just the same rigor that you would have for any applications that are deployed. Then another thing is, a model is great on day one. Let's say I built a fraud detection model, it works great on day one. A week later, a month later it's useless because the data that it trained on is not what the fraudsters are using now. So patterns have changed, the model needs to be retrained How do I understand the performance of the model stays good over time? How do I do monitoring? How do I retrain the models? How do I do the life cycle management of the models and then scale? Which is okay I deployed this model out and its great, every application is calling it, maybe I have partners calling these models. How do I automatically scale? Whether what you are using behind the scenes or if you are going to use external clusters for scale? Technology is like spectrum connector from our HPC background are very interesting counterparts to this. How do I scale? How do I burst? How do I go from an on-frame to an off-frame environment? How do I build something behind the firewall but deploy it into the cloud? We have a chatbot or some other cloud-native application, all of these things become interesting in the operationalizing. >> So how do all these conversations that you're having with these global elite clients and the challenges that you're unpacking, how do they get back into innovation for IBM, what's that process like? >> It's an interesting place to be in because I am hearing and experiencing first hand real enterprise challenges and there we see our product doesn't handle this particular thing now? That is an immediate circling back with offering management and development. Hey guys we need this particular function because I'm seeing this happening again and again in customer engagements. So that helps us shape our products, shape our data science offerings, and sort of running with the flow of what everyone is doing, we'll look at that. What do our clients want? Where are they headed? And shape the products that way. >> Excellent, well John thanks very much for coming back in theCUBE and it's a pleasure to see you again. I appreciate your time. >> Thank you Dave. >> All right good to see you. Keep it right there everybody we'll be back with our next guest. We're live from the IBM CDO strategy summit in San Francisco, you're watching theCUBE.

Published Date : May 1 2018

SUMMARY :

brought to you by IBM. to see you again John. but what is your role? that are relevant to This is an investment IBM is making. into these engagements, you find out the first thing is to have but having the right data model as well. Single data model. Dave: What does that mean, right? for that experience to be true. so is that one of the challenges and I have to move data out but there's all this that the company has already invested in. and scale, ability to track innovation. How do I do the life cycle management to be in because I am hearing pleasure to see you again. All right good to see you.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Dave VellantePERSON

0.99+

DavePERSON

0.99+

IBMORGANIZATION

0.99+

JohnPERSON

0.99+

John ThomasPERSON

0.99+

BostonLOCATION

0.99+

Beth SmithPERSON

0.99+

San FranciscoLOCATION

0.99+

BethPERSON

0.99+

NetflixORGANIZATION

0.99+

oneQUANTITY

0.99+

A week laterDATE

0.99+

a month laterDATE

0.99+

thousandsQUANTITY

0.99+

HadoopTITLE

0.99+

WatsonPERSON

0.99+

one experienceQUANTITY

0.99+

MoorePERSON

0.98+

todayDATE

0.98+

PythonTITLE

0.98+

MetcalfePERSON

0.98+

Parc 55LOCATION

0.97+

both coastsQUANTITY

0.97+

zeroQUANTITY

0.96+

SingleQUANTITY

0.96+

about 150 chief data officersQUANTITY

0.96+

day oneQUANTITY

0.94+

CognosORGANIZATION

0.94+

each yearQUANTITY

0.93+

hundreds of clientsQUANTITY

0.92+

HortonworksORGANIZATION

0.91+

first thingQUANTITY

0.9+

TensorflowTITLE

0.9+

IBM CDO SummitEVENT

0.87+

Strategy SummitEVENT

0.86+

hundred, 500 node Hadoop clustersQUANTITY

0.85+

thousands of clientsQUANTITY

0.84+

single data modelQUANTITY

0.81+

Strategy Summit 2018EVENT

0.81+

Chief Data OfficerEVENT

0.79+

IBM CDO strategy summitEVENT

0.79+

Chief Data Officer Strategy SummitEVENT

0.79+

couple timesQUANTITY

0.77+

ClouderaORGANIZATION

0.75+

decadesQUANTITY

0.74+

Spring 2018DATE

0.72+

Data OfficerEVENT

0.67+

CarbrosORGANIZATION

0.63+

TensorflowORGANIZATION

0.61+

scikitORGANIZATION

0.58+

theCUBEORGANIZATION

0.58+

Yaron Haviv, iguazio | BigData NYC 2017


 

>> Announcer: Live from midtown Manhattan, it's theCUBE, covering BigData New York City 2017, brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay, welcome back everyone, we're live in New York City, this is theCUBE's coverage of BigData NYC, this is our own event for five years now we've been running it, been at Hadoop World since 2010, it's our eighth year covering the Hadoop World which has evolved into Strata Conference, Strata Hadoop, now called Strata Data, and of course it's bigger than just Strata, it's about big data in NYC, a lot of big players here inside theCUBE, thought leaders, entrepreneurs, and great guests. I'm John Furrier, the cohost this week with Jim Kobielus, who's the lead analyst on our BigData and our Wikibon team. Our next guest is Yaron Haviv, who's with iguazio, he's the founder and CTO, hot startup here at the show, making a lot of waves on their new platform. Welcome to theCUBE, good to see you again, congratulations. >> Yes, thanks, thanks very much. We're happy to be here again. >> You're known in the theCUBE community as the guy on Twitter who's always pinging me and Dave and team, saying, "Hey, you know, you guys got to "get that right." You really are one of the smartest guys on the network in our community, you're super-smart, your team has got great tech chops, and in the middle of all that is the hottest market which is cloud native, cloud native as it relates to the integration of how apps are being built, and essentially new ways of engineering around these solutions, not just repackaging old stuff, it's really about putting things in a true cloud environment, with an application development, with data at the center of it, you got a whole complex platform you've introduced. So really, really want to dig into this. So before we get into some of my pointed questions I know Jim's got a ton of questions, is give us an update on what's going on so you guys got some news here at the show, let's get to that first. >> So since the last time we spoke, we had tons of news. We're making revenues, we have customers, we've just recently GA'ed, we recently got significant investment from major investors, we raised about $33 million recently from companies like Verizon Ventures, Bosch, you know for IoT, Chicago Mercantile Exchange, which is Dow Jones and other properties, Dell EMC. So pretty broad. >> John: So customers, pretty much. >> Yeah, so that's the interesting thing. Usually you know investors are sort of strategic investors or partners or potential buyers, but here it's essentially our customers that it's so strategic to the business, we want to... >> Let's go with GA of the projects, just get into what's shipping, what's available, what's the general availability, what are you now offering? >> So iguazio is trying to, you know, you alluded to cloud native and all that. Usually when you go to events like Strata and BigData it's nothing to do with cloud native, a lot of hard labor, not really continuous development and integration, it's like continuous hard work, it's continuous hard work. And essentially what we did, we created a data platform which is extremely fast and integrated, you know has all the different forms of states, streaming and events and documents and tables and all that, into a very unique architecture, won't dive into that today. And on top of it we've integrated cloud services like Kubernetes and serverless functionality and others, so we can essentially create a hybrid cloud. So some of our customers they even deploy portions as an Opix-based settings in the cloud, and some portions in the edge or in the enterprise deployed the software, or even a prepackaged appliance. So we're the only ones that provide a full hybrid experience. >> John: Is this a SAS product? >> So it's a software stack, and it could be delivered in three different options. One, if you don't want to mess with the hardware, you can just rent it, and it's deployed in Equanix facility, we have very strong partnerships with them globally. If you want to have something on-prem, you can get a software reference architecture, you go and deploy it. If you're a telco or an IoT player that wants a manufacturing facility, we have a very small 2U box, four servers, four GPUs, all the analytics tech you could think of. You just put it in the factory instead of like two racks of Hadoop. >> So you're not general purpose, you're just whatever the customer wants to deploy the stack, their flexibility is on them. >> Yeah. Now it is an appliance >> You have a hosting solution? >> It is an appliance even when you deploy it on-prem, it's a bunch of Docker containers inside that you don't even touch them, you don't SSH to the machine. You have APIs and you have UIs, and just like the cloud experience when you go to Amazon, you don't open the Kimono, you know, you just use it. So our experience that's what we're telling customers. No root access problems, no security problems. It's a hardened system. Give us servers, we'll deploy it, and you go through consoles and UIs, >> You don't host anything for anyone? >> We host for some customers, including >> So you do whatever the customer was interested in doing? >> Yes. (laughs) >> So you're flexible, okay. >> We just want to make money. >> You're pretty good, sticking to the product. So on the GA, so here essentially the big data world you mentioned that there's data layers, like data piece. So I got to ask you the question, so pretend I'm an idiot for a second, right. >> Yaron: Okay. >> Okay, yeah. >> No, you're a smart guy. >> What problem are you solving. So we'll just go to the simple. I love what you're doing, I assume you guys are super-smart, which I can say you are, but what's the problem you're solving, what's in it for me? >> Okay, so there are two problems. One is the challenge everyone wants to transform. You know there is this digital transformation mantra. And it means essentially two things. One is, I want to automate my operation environment so I can cut costs and be more competitive. The other one is I want to improve my customer engagement. You know, I want to do mobile apps which are smarter, you know get more direct content to the user, get more targeted functionality, et cetera. These are the two key challenges for every business, any industry, okay? So they go and they deploy Hadoop and Hive and all that stuff, and it takes them two years to productize it. And then they get to the data science bit. And by the time they finished they understand that this Hadoop thing can only do one thing. It's queries, and reporting and BI, and data warehousing. How do you do actionable insights from that stuff, okay? 'Cause actionable insights means I get information from the mobile app, and then I translate it into some action. I have to enrich the vectors, the machine learning, all that details. And then I need to respond. Hadoop doesn't know how to do it. So the first generation is people that pulled a lot of stuff into data lake, and started querying it and generating reports. And the boss said >> Low cost data link basically, was what you say. >> Yes, and the boss said, "Okay, what are we going to do with this report? "Is it generating any revenue to the business?" No. The only revenue generation if you take this data >> You're fired, exactly. >> No, not all fired, but now >> John: Look at the budget >> Now they're starting to buy our stuff. So now the point is okay, how can I put all this data, and in the same time generate actions, and also deal with the production aspects of, I want to develop in a beta phase, I want to promote it into production. That's cloud native architectures, okay? Hadoop is not cloud, How do I take a Spark, Zeppelin, you know, a notebook and I turn it into production? There's no way to do that. >> By the way, depending on which cloud you go to, they have a different mechanism and elements for each cloud. >> Yeah, so the cloud providers do address that because they are selling the package, >> Expands all the clouds, yeah. >> Yeah, so cloud providers are starting to have their own offerings which are all proprietary around this is how you would, you know, forget about HDFS, we'll have S3, and we'll have Redshift for you, and we'll have Athena, and again you're starting to consume that into a service. Still doesn't address the continuous analytics challenge that people have. And if you're looking at what we've done with Grab, which is amazing, they started with using Amazon services, S3, Redshift, you know, Kinesis, all that stuff, and it took them about two hours to generate the insights. Now the problem is they want to do driver incentives in real time. So they want to incent the driver to go and make more rides or other things, so they have to analyze the event of the location of the driver, the event of the location of the customers, and just throwing messages back based on analytics. So that's real time analytics, and that's not something that you can do >> They got to build that from scratch right away. I mean they can't do that with the existing. >> No, and Uber invested tons of energy around that and they don't get the same functionality. Another unique feature that we talk about in our PR >> This is for the use case you're talking about, this is the Grab, which is the car >> Grab is the number one ride-sharing in Asia, which is bigger than Uber in Asia, and they're using our platform. By the way, even Uber doesn't really use Hadoop, they use MemSQL for that stuff, so it's not really using open source and all that. But the point is for example, with Uber, when you have a, when they monetize the rides, they do it just based on demand, okay. And with Grab, now what they do, because of the capability that we can intersect tons of data in real time, they can also look at the weather, was there a terror attack or something like that. They don't want to raise the price >> A lot of other data points, could be traffic >> They don't want to raise the price if there was a problem, you know, and all the customers get aggravated. This is actually intersecting data in real time, and no one today can do that in real time beyond what we can do. >> A lot of people have semantic problems with real time, they don't even know what they mean by real time. >> Yaron: Yes. >> The data could be a week old, but they can get it to them in real time. >> But every decision, if you think if you generalize round the problem, okay, and we have slides on that that I explain to customers. Every time I run analytics, I need to look at four types of data. The context, the event, okay, what happened, okay. The second type of data is the previous state. Like I have a car, was it up or down or what's the previous state of that element? The third element is the time aggregation, like, what happened in the last hour, the average temperature, the average, you know, ticker price for the stock, et cetera, okay? And the fourth thing is enriched data, like I have a car ID, but what's the make, what's the model, who's driving it right now. That's secondary data. So every time I run a machine learning task or any decision I have to collect all those four types of data into one vector, it's called feature vector, and take a decision on that. You take Kafka, it's only the event part, okay, you take MemSQL, it's only the state part, you take Hadoop it's only like historical stuff. How do you assemble and stitch a feature vector. >> Well you talked about complex machine learning pipeline, so clearly, you're talking about a hybrid >> It's a prediction. And actions based on just dumb things, like the car broke and I need to send a garage, I don't need machine learning for that. >> So within your environment then, do you enable the machine learning models to execute across the different data platforms, of which this hybrid environment is composed, and then do you aggregate the results of those models, runs into some larger model that drives the real time decision? >> In our solution, everything is a document, so even a picture is a document, a lot of things. So you can essentially throw in a picture, run tensor flow, embed more features into the document, and then query those features on another platform. So that's really what makes this continuous analytics extremely flexible, so that's what we give customers. The first thing is simplicity. They can now build applications, you know we have tier one now, automotive customer, CIO coming, meeting us. So you know when I have a project, one year, I need to have hired dozens of people, it's hugely complex, you know. Tell us what's the use case, and we'll build a prototype. >> John: All right, well I'm going to >> One week, we gave them a prototype, and he was amazed how in one week we created an application that analyzed all the streams from the data from the cars, did enrichment, did machine learning, and provided predictions. >> Well we're going to have to come in and test you on this, because I'm skeptical, but here's why. >> Everyone is. >> We'll get to that, I mean I'm probably not skeptical but I kind of am because the history is pretty clear. If you look at some of the big ideas out there, like OpenStack. I mean that thing just morphed into a beast. Hadoop was a cost of ownership nightmare as you mentioned early on. So people have been conceptually correct on what they were trying to do, but trying to get it done was always hard, and then it took a long time to kind of figure out the operational model. So how are you different, if I'm going to play the skeptic here? You know, I've heard this before. How are you different than say OpenStack or Hadoop Clusters, 'cause that was a nightmare, cost of ownership, I couldn't get the type of value I needed, lost my budget. Why aren't you the same? >> Okay, that's interesting. I don't know if you know but I ran a lot of development for OpenStack when I was in Matinox and Hadoop, so I patched a lot of those >> So do you agree with what I said? That that was a problem? >> They are extremely complex, yes. And I think one of the things that first OpenStack tried to bite on too much, and it's sort of a huge tent, everyone tries to push his agenda. OpenStack is still an infrastructure layer, okay. And also Hadoop is sort of a something in between an infrastructure and an application layer, but it was designed 10 years ago, where the problem that Hadoop tried to solve is how do you do web ranking, okay, on tons of batch data. And then the ecosystem evolved into real time, and streaming and machine learning. >> A data warehousing alternative or whatever. >> So it doesn't fit the original model of batch processing, 'cause if an event comes from the car or an IoT device, and you have to do something with it, you need a table with an index. You can't just go and build a huge Parquet file. >> You know, you're talking about complexity >> John: That's why he's different. >> Go ahead. >> So what we've done with our team, after knowing OpenStack and all those >> John: All the scar tissue. >> And all the scar tissues, and my role was also working with all the cloud service providers, so I know their internal architecture, and I worked on SAP HANA and Exodata and all those things, so we learned from the bad experiences, said let's forget about the lower layers, which is what OpenStack is trying to provide, provide you infrastructure as a service. Let's focus on the application, and build from the application all the way to the flash, and the CPU instruction set, and the adapters and the networking, okay. That's what's different. So what we provide is an application and service experience. We don't provide infrastructure. If you go buy VMware and Nutanix, all those offerings, you get infrastructure. Now you go and build with the dozen of dev ops guys all the stack above. You go to Amazon, you get services. Just they're not the most optimized in terms of the implementation because they also have dozens of independent projects that each one takes a VM and starts writing some >> But they're still a good service, but you got to put it together. >> Yeah right. But also the way they implement, because in order for them to scale is that they have a common layer, they found VMs, and then they're starting to build up applications so it's inefficient. And also a lot of it is built on 10-year-old baseline architecture. We've designed it for a very modern architecture, it's all parallel CPUs with 30 cores, you know, flash and NVMe. And so we've avoided a lot of the hardware challenges, and serialization, and just provide and abstraction layer pretty much like a cloud on top. >> Now in terms of abstraction layers in the cloud, they're efficient, and provide a simplification experience for developers. Serverless computing is up and coming, it's an important approach, of course we have the public clouds from AWS and Google and IBM and Microsoft. There are a growing range of serverless computing frameworks for prem-based deployment. I believe you are behind one. Can you talk about what you're doing at iguazio on serverless frameworks for on-prem or public? >> Yes, it's the first time I'm very active in CNC after Cloud Native Foundation. I'm one of the authors of the serverless white paper, which tries to normalize the definitions of all the vendors and come with a proposal for interoperable standard. So I spent a lot of energy on that, 'cause we don't want to lock customers to an API. What's unique, by the way, about our solution, we don't have a single proprietary API. We just emulate all the other guys' stuff. We have all the Amazon APIs for data services, like Kinesis, Dynamo, S3, et cetera. We have the open source APIs, like Kafka. So also on the serverless, my agenda is trying to promote that if I'm writing to Azure or AWS or iguazio, I don't need to change my app. I can use any developer tools. So that's my effort there. And we recently, a few weeks ago, we launched our open source project, which is a sort of second generation of something we had before called Nuclio. It's designed for real time >> John: How do you spell that? >> N-U-C-L-I-O. I even have the logo >> He's got a nice slick here. >> It's really fast because it's >> John: Nuclio, so that's open source that you guys just sponsor and it's all code out in the open? >> All the code is in the open, pretty cool, has a lot of innovative ideas on how to do stream processing and best, 'cause the original serverless functionality was designed around web hooks and HTTP, and even many of the open source projects are really designed around HTTP serving. >> I have a question. I'm doing research for Wikibon on the area of serverless, in fact we've recently published a report on serverless, and in terms of hybrid cloud environments, I'm not seeing yet any hybrid serverless clouds that involve public, you know, serverless like AWS Lambda, and private on-prem deployment of serverless. Do you have any customers who are doing that or interested in hybridizing serverless across public and private? >> Of course, and we have some patents I don't want to go into, but the general idea is, what we've done in Nuclio is also the decoupling of the data from the computation, which means that things can sort of be disjoined. You can run a function in Raspberry Pi, and the data will be in a different place, and those things can sort of move, okay. >> So the persistence has to happen outside the serverless environment, like in the application itself? >> Outside of the function, the function acts as the persistent layer through APIs, okay. And how this data persistence is materialized, that server separate thing. So you can actually write the same function that will run against Kafka or Kinesis or Private MQ, or HTTP without modifying the function, and ad hoc, through what we call function bindings, you define what's going to be the thing driving the data, or storing the data. So that can actually write the same function that does ETL drop from table one to table two. You don't need to put the table information in the function, which is not the thing that Lambda does. And it's about a hundred times faster than Lambda, we do 400,000 events per second in Nuclio. So if you write your serverless code in Nuclio, it's faster than writing it yourself, because of all those low-level optimizations. >> Yaron, thanks for coming on theCUBE. We want to do a deeper dive, love to have you out in Palo Alto next time you're in town. Let us know when you're in Silicon Valley for sure, we'll make sure we get you on camera for multiple sessions. >> And more information re:Invent. >> Go to re:Invent. We're looking forward to seeing you there. Love the continuous analytics message, I think continuous integration is going through a massive renaissance right now, you're starting to see new approaches, and I think things that you're doing is exactly along the lines of what the world wants, which is alternatives, innovation, and thanks for sharing on theCUBE. >> Great. >> That's very great. >> This is theCUBE coverage of the hot startups here at BigData NYC, live coverage from New York, after this short break. I'm John Furrier, Jim Kobielus, after this short break.

Published Date : Sep 27 2017

SUMMARY :

brought to you by SiliconANGLE Media I'm John Furrier, the cohost this week with Jim Kobielus, We're happy to be here again. and in the middle of all that is the hottest market So since the last time we spoke, we had tons of news. Yeah, so that's the interesting thing. and some portions in the edge or in the enterprise all the analytics tech you could think of. So you're not general purpose, you're just Now it is an appliance and just like the cloud experience when you go to Amazon, So I got to ask you the question, which I can say you are, So the first generation is people that basically, was what you say. Yes, and the boss said, and in the same time generate actions, By the way, depending on which cloud you go to, and that's not something that you can do I mean they can't do that with the existing. and they don't get the same functionality. because of the capability that we can intersect and all the customers get aggravated. A lot of people have semantic problems with real time, but they can get it to them in real time. the average temperature, the average, you know, like the car broke and I need to send a garage, So you know when I have a project, an application that analyzed all the streams from the data Well we're going to have to come in and test you on this, but I kind of am because the history is pretty clear. I don't know if you know but I ran a lot of development is how do you do web ranking, okay, and you have to do something with it, and build from the application all the way to the flash, but you got to put it together. it's all parallel CPUs with 30 cores, you know, Now in terms of abstraction layers in the cloud, So also on the serverless, my agenda is trying to promote I even have the logo and even many of the open source projects on the area of serverless, in fact we've recently and the data will be in a different place, So if you write your serverless code in Nuclio, We want to do a deeper dive, love to have you is exactly along the lines of what the world wants, I'm John Furrier, Jim Kobielus, after this short break.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jim KobielusPERSON

0.99+

MicrosoftORGANIZATION

0.99+

IBMORGANIZATION

0.99+

BoschORGANIZATION

0.99+

UberORGANIZATION

0.99+

JohnPERSON

0.99+

John FurrierPERSON

0.99+

Verizon VenturesORGANIZATION

0.99+

Yaron HavivPERSON

0.99+

AsiaLOCATION

0.99+

NYCLOCATION

0.99+

GoogleORGANIZATION

0.99+

New York CityLOCATION

0.99+

JimPERSON

0.99+

Palo AltoLOCATION

0.99+

30 coresQUANTITY

0.99+

New YorkLOCATION

0.99+

AWSORGANIZATION

0.99+

two yearsQUANTITY

0.99+

BigDataORGANIZATION

0.99+

Silicon ValleyLOCATION

0.99+

AmazonORGANIZATION

0.99+

five yearsQUANTITY

0.99+

two problemsQUANTITY

0.99+

Dell EMCORGANIZATION

0.99+

YaronPERSON

0.99+

OneQUANTITY

0.99+

DavePERSON

0.99+

KafkaTITLE

0.99+

third elementQUANTITY

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

Dow JonesORGANIZATION

0.99+

two thingsQUANTITY

0.99+

two racksQUANTITY

0.99+

todayDATE

0.99+

GrabORGANIZATION

0.99+

NuclioTITLE

0.99+

two key challengesQUANTITY

0.99+

Cloud Native FoundationORGANIZATION

0.99+

about $33 millionQUANTITY

0.99+

eighth yearQUANTITY

0.99+

HadoopTITLE

0.98+

second typeQUANTITY

0.98+

LambdaTITLE

0.98+

10 years agoDATE

0.98+

each cloudQUANTITY

0.98+

Strata ConferenceEVENT

0.98+

EquanixLOCATION

0.98+

10-year-oldQUANTITY

0.98+

first thingQUANTITY

0.98+

first generationQUANTITY

0.98+

oneQUANTITY

0.98+

second generationQUANTITY

0.98+

Hadoop WorldEVENT

0.98+

first timeQUANTITY

0.98+

theCUBEORGANIZATION

0.97+

NutanixORGANIZATION

0.97+

MemSQLTITLE

0.97+

each oneQUANTITY

0.97+

2010DATE

0.97+

KinesisTITLE

0.97+

SASORGANIZATION

0.96+

WikibonORGANIZATION

0.96+

Chicago Mercantile ExchangeORGANIZATION

0.96+

about two hoursQUANTITY

0.96+

this weekDATE

0.96+

one thingQUANTITY

0.95+

dozenQUANTITY

0.95+

Nenshad Bardoliwalla, Paxata - #BigDataNYC 2016 - #theCUBE


 

>> Voiceover: Live from New York, it's The Cube, covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to New York City, everybody. Nenshad Bardoliwalla is here, he's the co-founder and chief product officer at Paxata, a company that, three years ago, I want to say three years ago, came out of stealth on The Cube. >> October 27, 2013. >> Right, and we were at the Warwick Hotel across the street from the Hilton. Yeah, Prakash came on The Cube and came out of stealth. Welcome back. >> Thank you very much. >> Great to see you guys. Taking the world by storm. >> Great to be here, and of course, Prakash sends his apologies. He couldn't be here so he sent his stunt double. (Dave and George laugh) >> Great, so give us the update. What's the latest? >> So there are a lot of great things going on in our space. The thing that we announced here at the show is what we're calling Paxata Connect, OK? We are moving just in the same way that we created the self-service data preparation category, and now there are 50 companies that claim they do self-service data prep. We are moving the industry to the next phase of what we are calling our business information platform. Paxata Connect is one of the first major milestones in getting to that vision of the business information platform. What Paxata Connect allows our customers to do is, number one, to have visual, completely declarative, point-and-click browsing access to a variety of different data sources in the enterprise. For example, we support, we are the only company that we know of that supports connecting to multiple, simultaneous, different Hadoop distributions in one system. So a Paxata customer can connect to MapR, they can connect to Hortonworks, they can connect to Cloudera, and they can federate across all of them, which is a very powerful aspect of the system. >> And part of this involves, when you say declarative, it means you don't have to write a program to retrieve the data. >> Exactly right. Exactly right. >> Is this going into HTFS, into Hive, or? >> Yes it is. In fact, so Hadoop is one part of, this multi-source Hadoop capability is one part of Paxata Connect. The second is, as we've moved into this information platform world, our customers are telling us they want read-write access to more than just Hadoop. Hadoop is obviously a very important part, but we're actually supporting no-sequel data sources like Cloudant, Mongo DB, we're supporting read and write, we're supporting, for the first time, relational databases, we already supported read, but now we actually support write to relational databases. So Paxata is really becoming kind of this fabric, a business-centric information fabric, that allows people to move data from anywhere to any destination, and transform it, profile it, explore it along the way. >> Excellent. Let's get into some of the use cases. >> Yeah, tell us where the banks are. The sense at the conference is that everyone sort of got their data lakes to some extent up and running. Now where are they pushing to go next? >> Sure, that's an excellent question. So we have really focused on the enterprise segment, as you know. So the customers that are working with Paxata from an industry perspective, banking is, of course, a very important one, we were really proud to share the stage yesterday with both Citi and Standard Chartered Bank, two of our flagship banking customers. But Paxata is also heavily used in the United States government, in the intelligence community, I won't say any more about that. It's used heavily in retail and consumer products, it's used heavily in the high-tech space, it's used heavily by data service providers, that is, companies whose entire business is based on data. But to answer your question specifically, what's happening in the data lake world is that a lot of folks, the early adopters, have jumped onto the data lake bandwagon. So they're pouring terabytes and petabytes of data into the data lake. And then the next question the business asks is, OK, now what? Where's the data, right? One of the simplest use cases, but actually one that's very pervasive for our customers, is they say, "Look, we don't even know, "our business people, they don't even know "what's in Hadoop right now." And by the way, I will also say that the data lake is not just Hadoop, but Amazon S3 is also serving as a data lake. The capabilities inside Microsoft's cloud are also serving as a data lake. Even the notion of a data lake is becoming this sort of polymorphic distributed thing. So what they do is, they want to be able to get what we like to say is first eyes on data. We let people with Paxata, especially with the release of Connect, to just point and click their way and to actually explore the data in all of the native systems before they even bring it in to something like Paxata. So they can actually sneak preview thousands of database tables or thousands of compressed data sets inside of Amazon S3, or thousands of data sets inside of Hadoop, and now the business people for the first time can point and click and actually see what is in the data lake in the first place. So step number one is, we have taken the approach so far in the industry of, there have been a lot of IT-driven use cases that have motivated people to go to the data lake approach. But now, we obviously want to show, all of our companies want to show business value, so tools and platforms like Paxata that sit on top of the data lake, that can federate across multiple data lakes and provide business-centric access to that information is the first significant use case pattern we're seeing. >> Just a clarification, could there be two roles where one is for slightly more technical business user exposes views summarizing, so that the ultimate end user doesn't have to see the thousands of tables? >> Absolutely, that's a great question. So when you look at self-service, if somebody wants to roll out a self-service strategy, there are multiple roles in an organization that actually need to intersect with self-service. There is a pattern in organizations where people say, "We want our people to get access to all the data." Of course it's governed, they have to have the right passwords and SSO and all that, but they're the companies who say, yes, the users really need to be able to see all of the data across these different tables. But there's a different role, who also uses Paxata extensively, who are the curators, right? These are the people who say, look, I'm going to provision the raw data, provide the views, provide even some normalization or transformation, and then land that data back into another layer, as people call the data relay, they go from layer zero to layer one to layer two, they're different directory structures, but the point is, there's a natural processing frame that they're going through with their data, and then from the curated data that's created by the data stewards, then the analysts can go pick it up. >> One of the other big challenges that our research is showing, that chief data officers express, is that they get this data in the data lake. So they've got the data sources, you're providing access to it, the other piece is they want to trust that data. There's obviously a governance piece, but then there's a data quality piece, maybe you could talk about that? >> Absolutely. So use case number one is about access. The second reason that people are not so -- So, why are people doing data prep in the first place? They are trying to make information-driven decisions that actually help move their business forward. So if you look at researchers from firms like Forrester, they'll say there are two reasons that slow down the latency of going from raw data to decision. Number one is access to data. That's the use case we just talked about. Number two is the trustworthiness of data. Our approach is very different on that. Once people actually can find the data that they're looking for, the big paradigm shift in the self-service world is that, instead of trying to process data based on transforming the metadata attributes, like I'm going to draw on a work flow diagram, bring in this table, aggregate with this operator, then split it this way, filter it, which is the classic ETL paradigm. The, I don't want to say profound, but maybe the very obvious thing we did was to say, "What if people could actually look at the data in the first place --" >> And sort of program it by example? >> We can tell, that's right. Because our eyes can tell us, our brains help us to say, we can immediately look at a data set, right? You look at an age column, let's say. There are values in the age column of 150 years. Maybe 20 years from now there may be someone who, on Earth, lives to 150 years. But pretty much -- >> Highly unlikely. >> The customers at the banks you work with are not 150 years old, right? So just being able to look at the data, to get to the point that you're asking, quality is about data being fit for a specific purpose. In order for data to be fit for a specific purpose, the person who needs the data needs to make the decision about what is quality data. Both of you may have access to the same transactional data, raw data, that the IT team has landed in the Hadoop cluster. But now you pull it up for one use case, you pull it up for another use case, and because your needs are different, what constitutes quality to you and where you want to make the investment is going to be very different. So by putting the power of that capability into the hands of the person who actually knows what they want, that is how we are actually able to change the paradigm and really compress the latency from "Here's my raw data" to "Here's the decision I want to make on that data." >> Let me ask, it sounds like, having put all of the self-service capabilities together, you've democratized access to this data. Now, what happens in terms of governance, or more importantly, just trust, when the pipeline, you know, has to go beyond where you're working on it, to some of the analytics or some of the basic ingest? To say, "I know this data came from here "and it's going there." >> That's right, how do we verify the fidelity of these data sources? It's a fantastic question. So, in my career, having worked in BI for a couple of decades, I know I look much younger but it actually has been a couple of decades. Remember, the camera adds about 15 pounds, for those of you watching at home. (Dave and George laugh) >> George: But you've lost already. >> Thank you very much. >> So you've lost net 30. (Nenshad laughs) >> Or maybe I'm back to where I'm supposed to be. What I've seen as the two models of governance in the enterprise when it comes to analytics and information management, right? There's model one, which is, we're going to build an enterprise data warehouse, we're going to know all the possible questions people are going to ask in advance, we're going to preprogram the ETL routines, we're going to put something like a MicroStrategy or BusinessObjects, an enterprise-reporting factory tool. Then you spend 10 million dollars on that project, the users come in and for the first time they use the system, and they say, "Oh, I kind of want to change this, this way. "I want to add this calculation." It takes them about five minutes to determine that they can't do it for whatever reason, and what is the first feature they look for in the product in order to move forward? Download to Excel, right? So you invested 15 million dollars to build a download to Excel capability which they already had before. So if you lock things down too much, the point is, the end users will go around you. They've been doing it for 30 years and they'll keep doing it. Then we have model two. Model two is, Excel spreadsheet. Excel Hell, or spreadmarts. There are lots of words for these things. You have a version of the data, you have a version of the data, I have a version of the data. We all started from the same transactional data, yet you're the head of sales, so suddenly your forecast looks really rosy. You're the head of finance, you really don't like what the forecast looks like. And I'm the product guy, so why am I even looking at the forecast in the first place, but somehow I got access to the data, right? These are the two polarities of the enterprise that we've worked with for the last 30 years. We wanted to find sort of a middle path, which is to say, let's give people the freedom and flexibility to be able to do the transformations they need to. If they want to add a column, let them add a column. If they want to change a calculation, let them add a a calculation. But, every single step in the process must be recorded. It must be versioned, it must be auditable. It must be governed in that way. So why the large banks and the intelligence community and the large enterprise customers are attracted to Paxata is because they have the ability to have perfect retraceability for every decision that they make. I can actually sit next to you and say, "This is why the data looks like this. "This is how this value, which started at one million, "became 1.5 million." That covers the Paxata part. But then the answer to the question you asked is, how do you even extend that to a broader ecosystem? I think that's really about some of the metadata interchange initiatives that a lot of the vendors in the Hadoop space, but also in the traditional enterprise space, have had for the last many years. If you look at something like Apache Atlas or Cloudera Navigator, they are systems designed to collect, aggregate, and connect these different metadata steps so you can see in an end-to-end flow, this is the raw data that got ingested into Hadoop. These are the transformations that the end user did in Paxata in order to make it ready for analytics. This is how it's getting consumed in something like Zoom Data, and you actually have the entire life cycle of data now actually manifested as a software asset. >> So those not, in other words, those are not just managing within the perimeter of Hadoop. They are managers of managers. >> That's right, that's right. Because the data is coming from anywhere, and it's going to anywhere. And then you can add another dimension of complexity which is, it's not just one Hadoop cluster. It's 10 Hadoop clusters. And those 10 Hadoop clusters, three of them are in Amazon. Four of them are in Microsoft. Three of them are in Google Cloud platform. How do you know what people are doing with data then? >> How is this all presented to the user? What does the user see? >> Great question. The trick to all of this, of self service, first you have to know very clearly, who is the person you are trying to serve? What are their technical skills and capabilities, and how can you get them productive as fast as possible? When we created this category, our key notion was that we were going to go after analysts. Now, that is a very generic term, right? Because we are all, in some sense, analysts in our day-to-day lives. But in Paxata, a business analyst, in an enterprise organizational context, is somebody that has the ability to use Microsoft Excel, they have to have that skill or they won't be successful with today's Paxata. They have to know what a VLOOKUP is, because a VLOOKUP is a way to actually pull data from a second data source into one. We would all know that as a join or a lookup. And the third thing is, they have to know what a pivot table is and know how a pivot table works. Because the key insight we had is that, of the hundreds of millions of analysts, people who use Excel on a day-to-day basis, a lot of their work is data prep. But Excel, being an amazing generic tool, is actually quite bad for doing data prep. So the person we target, when I go to a customer and they say, "Are we a good candidate to use Paxata?" and we're talking to the actual person who's going to use the software, I say, "Do you know what a VLOOKUP is, yes or no? "Do you know what a pivot table is, yes or no?" If they have that skill, when they come into Paxata, we designed Paxata to be very attractive to those people. So it's completely point-and-click. It's completely visual. It's completely interactive. There's no scripting inside that whole process, because do you think the average Microsoft Excel analyst wants to script, or they want to use a proprietary wrangling language? I'm sorry, but analysts don't want to wrangle. Data scientists, the 1% of the 1%, maybe they like to wrangle, but you don't have that with the broader analyst community, and that is a much larger market opportunity that we have targeted. >> Well, very large, I mean, a lot of people are familiar with those concepts in Excel, and if they're not, they're relatively easy to learn. >> Nenshad: That's right. Excellent. All right, Nenshad, we have to leave it there. Thanks very much for coming on The Cube, appreciate it. >> Thank you very much for having me. >> Congratulations for all the success. >> Thank you. >> All right, keep it right there, everybody. We'll be back with our next guest. This is The Cube, we're live from New York City at Big Data NYC. We'll be right back. (electronic music)

Published Date : Sep 30 2016

SUMMARY :

Brought to you by headline sponsors, here, he's the co-founder across the street from the Hilton. Great to see you guys. Great to be here, and of course, What's the latest? of the business information platform. to retrieve the data. Exactly right. explore it along the way. Let's get into some of the use cases. The sense at the conference One of the simplest use These are the people who One of the other big That's the use case we just talked about. to say, we can immediately the banks you work with of the self-service capabilities together, Remember, the camera adds about 15 pounds, So you've lost net 30. of the data, I have a version of the data. They are managers of managers. and it's going to anywhere. And the third thing is, they have to know relatively easy to learn. have to leave it there. This is The Cube, we're

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
CitiORGANIZATION

0.99+

October 27, 2013DATE

0.99+

GeorgePERSON

0.99+

George GilbertPERSON

0.99+

NenshadPERSON

0.99+

IBMORGANIZATION

0.99+

Dave VellantePERSON

0.99+

PrakashPERSON

0.99+

DavePERSON

0.99+

New York CityLOCATION

0.99+

NvidiaORGANIZATION

0.99+

CiscoORGANIZATION

0.99+

EarthLOCATION

0.99+

15 million dollarsQUANTITY

0.99+

twoQUANTITY

0.99+

30 yearsQUANTITY

0.99+

ForresterORGANIZATION

0.99+

ExcelTITLE

0.99+

thousandsQUANTITY

0.99+

50 companiesQUANTITY

0.99+

10 million dollarsQUANTITY

0.99+

Standard Chartered BankORGANIZATION

0.99+

New York CityLOCATION

0.99+

Nenshad BardoliwallaPERSON

0.99+

two reasonsQUANTITY

0.99+

one millionQUANTITY

0.99+

MicrosoftORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

firstQUANTITY

0.99+

two rolesQUANTITY

0.99+

two polaritiesQUANTITY

0.99+

1.5 millionQUANTITY

0.99+

HortonworksORGANIZATION

0.99+

150 yearsQUANTITY

0.99+

HadoopTITLE

0.99+

PaxataORGANIZATION

0.99+

second reasonQUANTITY

0.99+

OneQUANTITY

0.99+

two modelsQUANTITY

0.99+

secondQUANTITY

0.99+

oneQUANTITY

0.99+

yesterdayDATE

0.99+

BothQUANTITY

0.99+

three years agoDATE

0.99+

first timeQUANTITY

0.98+

first timeQUANTITY

0.98+

New YorkLOCATION

0.98+

bothQUANTITY

0.98+

1%QUANTITY

0.97+

third thingQUANTITY

0.97+

one systemQUANTITY

0.97+

about five minutesQUANTITY

0.97+

PaxataPERSON

0.97+

first featureQUANTITY

0.97+

DataLOCATION

0.96+

one partQUANTITY

0.96+

United States governmentORGANIZATION

0.95+

thousands of tablesQUANTITY

0.94+

20 yearsQUANTITY

0.94+

Model twoQUANTITY

0.94+

10 Hadoop clustersQUANTITY

0.94+

terabytesQUANTITY

0.93+