Image Title

Search Results for Jupyter:

Anurag Gupta, Shoreline io | AWS re:Invent 2022 - Global Startup Program


 

(gentle music) >> Now welcome back to theCUBE, everyone. I'm John Walls, and once again, we're glad to have you here for AWS re:Invent 22. Our coverage continues here on Thursday, day three, of what has been a jam-packed week of tech and AWS, of course, has been the great host for this. It's now a pleasure to welcome in Anurag Gupta, who is the founder and CEO of Shoreline, joining us here as part of the AWS Global Showcase Startup Program, and Anurag, good to see you, sir. Thanks for joining us. >> Thank you so much. >> Tell us about Shoreline, about what you're up to. >> So we're a DevOps company. We're really focused on repairing issues. If you think about it, there are a ton DevOps companies and we all went to the cloud in order to gain faster innovation and by and large check. Then all of the things involved in getting things into production, artifact generation, testing, configuration management, deployment, also by and large, automated. Now pity the poor SRE who's getting the deluge of stuff on them, every week, every two days, sometimes multiple times a day, and it's complicated, right? Kubernetes, VMs, lots of services, multiple clouds, sometimes, and you know, they need to know a little bit about everything. And you know what, there are a ton of companies that actually help you with what we call Day-2 Ops. It's just that most of them help you with observability, telling you what's gone wrong, or incident management, routing something to someone. But you know, back when I was at AWS, I never got really that excited about one more dashboard to look at or one more like better ticket routing. What used to really excite me was having some issue extinguished forever. And if you think about it, like the first five minutes of an incident are detecting and routing. The next hour, two hours, is some human being going in and fixing it, so that feels like the big opportunity to reduce, so hopefully we can talk a little bit about different ways that one can do that. >> What about Day-2 Ops? Just tell me about how you define that. >> So I basically define it as once the software goes into a production, just making sure things stay up and are healthy and you're resilient and you don't get errors and all of those sorts of things because everything breaks sooner or later, you know, to a greater or lesser degree. >> Especially that SRE you're talking about, right? >> Yeah. >> So let's go back to that scenario. Yeah, you pity the poor soul because they do have to be a little expert in everything. >> Exactly. >> And that's really challenging and we all know that, that's really hard. So how do you go about trying to lighten that burden, then? >> So when you look at the numbers, about somewhere between 40% to even 95% of the alarms that fire, the alerts that fire, are false positives and that's crazy. Why is someone waking up just to deal with? >> It's a lot of wasted time, isn't it? >> A lot of wasted time. And you know, you're also training someone into what I call ClickOps, just to go in and click the button and resolve it and you don't actually know if it was the false positive or it's the rare real positive, and so that's a challenge, right? And so the first thing to do is to figure out where the false positives are. Like, let's say Datadog tells you that CPU is high and alarms. Is that a good thing or a bad thing? It's hard for them to tell, right? But you have to then introspect it into something precise like, oh, CPU is high, but response times are standard and the request rate is high. Okay, that's a good thing. I'm going to ignore this. Or CPU is high, but it kind of resolves itself, so I'm going to not wake anybody up. Or CPU is high and oh, it's the darn JVM starting to garbage collect again, so let me go and take a heap dump and give that to my dev team and then bounce the JVM and you know, without waking anybody up, or CPU is high, I have no idea what's going on. Now it's time to wake somebody up. You know, what you want to use humans for is the ability to think about novel stuff, not to do repetitive stuff, so that's the first step. The second step is, about 40% of what remains is repetitive and straightforward. So like a disk is full, I'd better clean up the garbage on the disk or maybe grow the disk. People shouldn't wake up to deal to grow a disk. And so for that, what you want to do is just have those sorts of things get automated away. One of the nice things about Shoreline is, is that we take the experience in what we build for one company, and if they're willing, provide it to everybody else. Our belief is, a central tenant is, if someone somewhere fixes something, everyone everywhere should gain the benefit because we all sit on the same three clouds, we all sit on the same set of database infrastructure, et cetera. We should all get the same benefits. Why do we have to scar our own backs rather than benefiting from somebody else's scar tissue, so that's the second thing. The third thing is, okay, let's say it's not straightforward, not something I've seen before, then in that case, what often happens is on average like eight people get involved. You know, it initially goes to L1 support or L1 ops and, but they don't necessarily know because, as you say, the environment's complex. And so, you know, they go into Slack and they say, "At here, can somebody help me with this?" And those things take a much longer time, so wouldn't it be better that if your best SRE is able to say, "Hey, check these 20 things and then run these actions." We could convert that into like a Jupyter Notebook where you could say the incident got fired I pre-populated all the diagnostics, and then I tell people very precisely, "If you see this, run this, et cetera." Like a wiki, but actually something you could run right in this product. And then, you know, last piece of the puzzle, the smaller piece, is sometimes new things happen and when something new happens, what you want is sort of the central tech of Shoreline, which is parallel distributed, real-time debugging. And so the ability to do, you know, execute a command across your fleet rather than individual boxes so that you can say something like, "I'm hearing that my credit card app is slow. For everything tagged as being part of my credit card app, please run for everything that's running over 90% CPU, please run a top command." And so, you know, then you can run in the same time on one host as you can on 30,000 and that helps a lot. So that's the core of what we do. People use us for all sorts of things, also preventative maintenance, you know, just the proactive regular things. You know, like your car, you do an oil change, well, you know, you need to rotate your certs, certificates. You need to make sure that, you know, there isn't drift in your configurations, there isn't drift in your software. There's also security elements to it, right? You want to make sure that you aren't getting weird inbound/outbound traffic across to ports you don't expect to be open. You don't want to have these processes running, you know, maybe something's bad. And so that's all the kind of weird anomaly detection that's easy to do if you run things in a distributed parallel way across everything. That's super hard to do if you have to go and Whac-A-Mole across one box after the next. >> Well, which leads to a question just in terms of setting priorities then, which is what you're talking about helping companies establish priorities, this hierarchy of level one warning, level two, level three, level four. Sounds like that should be a basic, right? But you're saying that's not, that's not really happening in the enterprise. >> Well, you know, I would say that if you hadn't automated deployments, you should do that first. If you haven't automated your testing pipeline, shame on you, you should do that like a year ago. But now it's time to help people in production because you've done that other work and people are suffering. You know, the crazy thing about the cloud is, is that companies spend about three times more on the human beings to operate their cloud infrastructure as on the cloud infrastructure itself. I've yet to hear anybody say that their cloud bill is too low, you know, so, you know, there's a clearer savings also available. And you know, back when I was at AWS, obviously I had to keep the lights on too, but you know, I had to do that, but it's kind of a tax on my engineers and I'd really spend, prefer to spend the head count on innovation, on doing things that delight my customers. You never delight your customers by keeping the lights on, you just avoid irritating them by turning 'em off, right? >> So why are companies so fixed in on spending so much time on manually repairing things and not looking for these kinds of little, much more elegant solution and cost-efficient, time-saving, so on so forth. >> Yeah, I think there just hasn't been very much in this space as yet because it's a hard, hard problem to solve. You know, automation's a little bit scary and that's the reality of it and the way you make it less scary is by proving it out, by doing the simple things first, like reducing the alert fatigue, you know, that's easy. You know, providing notebooks to people so that they can click things and do things in a straightforward way. That's pretty easy. The full automation, that's kind of the North Star, that's what we aspire to do. But you know, people get there over time and one of our customers had 700 instances of this particular incident solved for them last week. You imagine how many human beings would've been doing it otherwise, you know? >> Right. >> That's just one thing, you know? >> How many did it take the build a pyramid? How many decades did that take, right? You had an announcement this week. I don't think we've talked about that. >> No, yeah, so we just announced Incident Insights, which is a free product that lets people plug into initially PagerDuty and pretty soon the Opsgenie ServiceNow, et cetera. And what you can do is, is you give us an API key read-only and we will suck your PagerDuty data out. We apply some lightweight ML unsupervised learning, and in a couple of minutes, we categorize all of your incidents so that you can understand which are the ones that happen most often and are getting resolved really quickly. That's ClickOps, right? Those alarms shouldn't fire. Which are the ones that involve a lot of people? Those are good candidates to build a notebook. Which are the ones that happen again and again and again? Those are good candidates for automation. And so, I think one of the challenges people have is, is that they don't actually know what their teams are doing and so this is intended to provide them that visibility. One of our very first customers was doing the beta test for us on it. He used to tell us he had about 100 tickets, incidents a week. You know, he brought this tool in and he had 2,100 last week and was all, you know, like these false alarms, so while he's giving us- >> That was eye opening for him to see that, sure. >> And why he's, you know, looking at it, you know, he's just like filing Jiras to say, "Oh, change this threshold, cancel this alarm forever." You know, all of that kind of stuff. Before you get to do the fancy work, you got to clean your room before you get to do anything else, right? >> Right, right, dinner before dessert, basically. >> There you go. >> Hey, thanks for the insights on this and again the name of the new product, by the way, is... >> Incident Insights. >> Incident Insights. >> Totally free. >> Free. >> Yeah, it takes a couple of minutes to set up. Go to the website, Shoreline.io/insight and you can be up and running in a couple of minutes. >> Outstanding, again, the company is Shoreline. This is Anurag Gupta, and thank you for being with us. We appreciate it. >> Appreciate it, thank you. >> Glad to have to here on theCUBE. Back with more from AWA re:Invent 22. You're watching theCUBE, the leader in high-tech coverage. (gentle music)

Published Date : Dec 1 2022

SUMMARY :

of the AWS Global Showcase about what you're up to. But you know, back when I was at AWS, Just tell me about how you define that. and you don't get errors Yeah, you pity the poor soul So how do you go about trying So when you look at the numbers, And so the ability to do, you know, in the enterprise. And you know, back when I was at AWS, and not looking for these kinds of little, and the way you make it less the build a pyramid? and was all, you know, for him to see that, sure. And why he's, you know, before dessert, basically. and again the name of the new and you can be up and running thank you for being with us. Glad to have to here on theCUBE.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
John WallsPERSON

0.99+

ShorelineORGANIZATION

0.99+

Anurag GuptaPERSON

0.99+

ThursdayDATE

0.99+

2,100QUANTITY

0.99+

AWSORGANIZATION

0.99+

700 instancesQUANTITY

0.99+

AnuragPERSON

0.99+

20 thingsQUANTITY

0.99+

last weekDATE

0.99+

first stepQUANTITY

0.99+

JirasPERSON

0.99+

second thingQUANTITY

0.99+

30,000QUANTITY

0.99+

two hoursQUANTITY

0.99+

eight peopleQUANTITY

0.99+

second stepQUANTITY

0.99+

95%QUANTITY

0.99+

40%QUANTITY

0.99+

third thingQUANTITY

0.99+

one boxQUANTITY

0.99+

about 100 ticketsQUANTITY

0.98+

first five minutesQUANTITY

0.98+

OneQUANTITY

0.98+

oneQUANTITY

0.98+

one thingQUANTITY

0.97+

this weekDATE

0.97+

one companyQUANTITY

0.97+

a year agoDATE

0.96+

first thingQUANTITY

0.96+

firstQUANTITY

0.96+

Shoreline.io/insightOTHER

0.96+

SREORGANIZATION

0.95+

about three timesQUANTITY

0.95+

three cloudsQUANTITY

0.95+

JupyterORGANIZATION

0.94+

DatadogORGANIZATION

0.94+

over 90% CPUQUANTITY

0.93+

one hostQUANTITY

0.93+

Global Showcase Startup ProgramEVENT

0.92+

about 40%QUANTITY

0.91+

level fourQUANTITY

0.91+

a weekQUANTITY

0.9+

first customersQUANTITY

0.9+

one moreQUANTITY

0.89+

every two daysQUANTITY

0.86+

level threeQUANTITY

0.86+

level oneQUANTITY

0.85+

DayQUANTITY

0.85+

PagerDutyORGANIZATION

0.84+

level twoQUANTITY

0.81+

re:Invent 2022 - Global Startup ProgramTITLE

0.8+

Shoreline ioORGANIZATION

0.78+

IncidentORGANIZATION

0.73+

ClickOpsORGANIZATION

0.71+

DayTITLE

0.7+

times a dayQUANTITY

0.69+

theCUBEORGANIZATION

0.67+

next hourDATE

0.66+

2TITLE

0.65+

theCUBETITLE

0.63+

KubernetesTITLE

0.62+

day threeQUANTITY

0.62+

everyQUANTITY

0.6+

ton of companiesQUANTITY

0.6+

Invent 22TITLE

0.59+

StarLOCATION

0.59+

OpsgenieORGANIZATION

0.57+

AWAORGANIZATION

0.57+

InventEVENT

0.53+

SlackTITLE

0.52+

PagerDutyTITLE

0.48+

22TITLE

0.46+

2QUANTITY

0.43+

L1ORGANIZATION

0.33+

ServiceNowCOMMERCIAL_ITEM

0.32+

reEVENT

0.27+

Keith Norbie, NetApp | VMware Explore 2022


 

>>Okay, welcome back everyone to the Cube's live coverage of VMware Explorer, 2022. I'm John Forer host of the cube with Dave Lisa Martin, Dave Nicholson, two sets for three days. We're on three days, we're here breaking down all the action of what's going on around VMware is our 12th year covering VMware's user conference. Formerly known as world. Now explore as it explores new territory, its future multi-cloud vSphere eight and a variety of new next generation cloud. We're here on day three, breaking out. This is day three more, more intimate, much more deeper conversations. And we have coming back on the Q Keith Norby with NetApp, the worldwide product partner solutions executive at NetApp Keith. Great to see you industry to veteran cube alumni. Thanks for coming back. It's >>Good to see you >>Again. Yeah. I wanted to bring you back for a couple reasons. One is I want to talk about the NetApp story and also where that's going with DM VMware as that's evolving and, and is changing and, and with Broadcom and, and the new next generation, but also analyzing kind of the customer impact piece of it. You're like an analyst who've been in the industry for a long time. Been commentating on the cube. VMware's in an interesting spot right now because I, I mean, I love the story. I mean, we can debate the messaging. Some people are very critical of it a little bit too multicloud, not enough cloud native, but I see the waves, right? I get it. Virtualization kicked ass tech names. Now it moves to hybrid cloud. And now this next gen is a, you know, clear cloud native multi-cloud environment. I, I get that. I can see, I can, I can get there, but is it ready? And the timing. Right. And do they have all the peace parts? What's the role of the ecosystem? These are all open questions. >>Yeah. And, and the reality is no one has a single answer. And that's part of the fun of this, is that not just a NetApp, but the rest of the ecosystem and videos here, as an example, who, who is thinking, you know, the Kings of AI are gonna be sitting at a V VMware show and yet it's absolutely relevant. So you have a very complex set of things that emerge, but yet also it's, it's, that's not overcomplicated. There is a set of primary principles that, you know, organizations I think are all looking to get to. And I think the reality is that this is maturing in different spurts. So whether it's ecosystem or it's, you know, operations modes and several other factors that kind of come into it, you know, that's part of the landscape, >>You know, I gotta ask you, you know, you and I are both kind of historians. We always talk about what's happened and happening and gonna happen. You know, it's interesting 12 years covering world and now explore NetApp has always been such a great company. We've been, I've been following that company, you know, since, you know, 1997, you know, days. And, and certainly with the past decade of the cloud or so the moves you guys may have been really good, but NetApp's never really had the kind of positioning in the VMware story going back in the past 12 years. And this keynote, you guys were mentioned in the keynote. Yeah. Has there ever been a time where NetApp was actually mentioned in a keynote at world or now explore? >>Well, you know, when we started this relationship back when I was a partner, I really monetized and took advantage of some of the advantages that NetApp had with VMware back in the early days, we're talking to ESX three days and they were dominant to the point where the rest of, you know, the ecosystem was trying to catch up. And of course, you know, a lot of competition from there, but yeah, it, it, it was great seeing a day, one VMware keynote with NetApp mentioned in the same relevance as AWS and VMware, which is exactly where we've been. You know, one thing that NetApp has done really well is not just being AWS, but be in all the hyperscalers as first party services and having a, a portfolio of other ways that we deal with things like, you know, data governance and cloud data management and cloud cloud backup, and overall dealing with cyber resiliency and, and ransomware protection and list goes on and on. So we've done our job to really make ourself both relevant and easy for people to consume. And it was great to see VMware and AWS come together. And the funny part was that, you know, we had on, on the previous cube session, you have VMware and AWS in between NetApp, all talking about, we have this whole thing running at all three of our booths. And that's fantastic. You >>Know, I, I can say because I actually was there and documented it and actually wrote about it in the early 20 11, 20 12, the then CEO Georgian's and I had an interview. He actually was the first storage company to actually engage with AWS back then. Yeah. I mean, that's a long time ago. That's that's 10 years ago. And then everyone else kind of followed EMC kind of was deer in the headlights at that point. They were poo pooing, AWS. Oh yeah, no, it'll never work either of which will never work. It's just a, a fluke. Yeah. For developers. NetApp was on the Amazon web services partnership train for a long time. >>Yeah. It, it, it's really amazing how early we got on this thing, which you can see the reason why that matters now is because it's not only in first party service, but that's also very robust and scalable. And this is one of the reasons why we think this opens it up. And, you know, as much as you wanna talk about the technology capabilities in, in this offering, the funny part is, is the intro conversation is how much money you save. So it unlocks all the, the use cases that you weren't able to do before. And when you, when you look at use case after use case on these workloads, they were hell held back. The number one conversation we had at this show was partner after partner, organization, after organization that came into our booth and talked to us about, yeah, I've got a bunch of these scenarios that I've been holding back on because I heard whispers about this. Now we're gonna go in >>Unleash those. All right. So what are, what's the top stories for you guys now at NetApp? What's the update it's been a while, since we had a cube update with you guys, what are you guys showing of the show? What's your agenda? What are your talking points? What's the main story? >>Well, for us, it's, it's, it's, it's always, you know, a cloud and on-prem combination of priorities within our partner ecosystem. The way we kind of communicate that out is really through three lenses. You know, one is on the hybrid cloud opportunity, people taking data center and modernizing the data center with the apps and getting the cloud, just like we're delivering here at this VMware world show. Also the AI and modern data analytics opportunity, and then public cloud, because really in a lot of these situations at apps, you know, the, the buyer, the consumer, the people that are interested in transforming are looking at it from different lenses. And these all start with really the customer journeys, the data ops buyer is different than the data center ops buyer. And, and that's exactly who we target this in is, is NetApp. I think, focuses relentlessly on how we reach them. And by the way, not just on storage products, if you look at like our instant cluster acquisition and all these other things, we're trying to be as relevant, we, as we can in data management and you know, whether that's pipelining data management or storing data management, that's >>Where we're there. You know, I, I was talking with David Nicholson, cuz we have, you know, we joked together. I say the holy Trinity, he goes with the devil's triangle. I'm Catholic, gotta know what his, his denomination is, but storage, networking, and compute. Obviously the, the three majors, it never changes. And I think it was interesting now, and I wanna get your reaction to this and what NetApp's doing around it is that if look at the DevOps movement, it's clearly cloud native, but the it ops is not it anymore. It's basically security and data I'm I'm oversimplifying, but DevOps, the developers now do a lot of that. I call it work in, in the CSD pipeline, but the real challenge is data and ops. That's a storage conversation. Compute is beautiful. You got containers, Kubernetes, all kinds of stuff going on with compute, move, compute around, move the data to compute. But storage is where the action is for cyber and data ops. Yeah. And AI. So like storage is back. They never left, but it's, it's transformed to even be more important because the role of hyper-convergence shows that compute and storage go well together. What's your take on this and how is NetApp modernized to, to solve the data ops and take that to the next level and of obviously enable and, and enable in great security and or defense ability. >>Yeah. And that's why no one architecture is gonna solve every problem. That's why, when we look at the data ops buyer, there's adjacencies to the apps buyer, to the other cloud ops buyer. And there's also the fin ops buyer because all of 'em have to work together. What we're, what we're focusing on. Isn't just storing data. But it's also things around how you discover govern data. You know, how you protect data, even things like in the ed workspace, the chip manufacturers, how we use cloud bursting to be able to accelerate performance on chip design. So whether you're translating this for the industry vernacular about how we help say in the financial sector for AI and what we do within Invidia, or it's something translated to this VMware opportunity on AWS, you know, what we've put together is, is something that has as much meaningful relevance for storing data, but also for all the other adjacencies that kind of extend off there. >>Talk about what you're doing with your partner. I saw last night I did, I did a fly by a NetApp event. It was Nvidia insight, which is a partner, an integrator partner. So you got a lot of the frontline on the front lines, you got partners and you got, you know, big solutions with NetApp and now vendors like Nvidia, what are you actually selling? What's what's getting, I guess what's being put together, not selling, I'm obviously selling gear and what, but like solutions, but what's being packaged to the customer. Where does, what does and video fit in? What are you guys? And what's the winning formula. Take us through the highlights. >>Yeah. And so the VMware highlights here are obviously that we're trying to get infrastructure foundations to just not have, be, be trapped in one cloud or anyone OnPrem. So having a little more E elasticity, but if you extend that out, like you, like you mentioned with a partner that's trying to, to go drive AI within Nvidia, you know, NetApp doesn't create any AI deals cuz no one starts an AI journey with storage. They always start it with the, a with the data model. So the data scientists will actually start these things in cloud and they'll bring 'em on prem. Once the data sets get to a, a big enough scenario and then they wanna build it into a multi-cloud over time. And that's where Nvidia has really led the charge. So someone like an insight or other partners could be Kindra or, or Accenture, or even small boutique partners that are in the data analytics space. They'll go drive that. And we provide not just data storage, but are really complimentary infrastructure. In fact, I always say it like on the AI story alone, we have an integration for the data scientists. So when they go pull the data sets in, you can either do that as a manual copy that takes hours sometimes days, or you can do it instantaneously with our integration to their Jupyter notebook. So I say for AI, as an example, NetApp creates time for data scientists. Got >>It. And where's the, the cloud transformation with you guys right now? How is the hybrid working? Obviously you got the public and hybrids, a steady state right now multi-cloud is still a little fantasy in terms of actual multi-cloud that's coming next, but hybrid and cloud, what's the key key configuration for NetApp what's the hot products? >>Well, I think the key is that you can't just be trapped in one location. So we started this whole thing back with data fabric, as you know, and it's built from there up into, into more of the ops layer and some of the technology layers that have to compliment to come with it. In fact, one of the things that we do that isn't always seen as adjacency to us is our spot product on cloud, which allows you to play in the finops space to be able to look at the analyzed spend and sort of optimized environments for a DevOps environment cloud, to be able to give back a big percentage of what you probably misallocate in those operating models. Once you're working with NetApp and allow it to re re redeploy it in the place that you wanna spend it, you know, so it's, it's both the upper and lower stories coming together. >>Yeah. I was on the walking around the hallway yesterday and I was kind of going through the main event last night, overheard people talking about ransomware. I mean, still ransomware is such a big problem. Security's huge. How are you guys doing there? What's the story with security? Obviously ransomware is a big storage aspect and, and backup recovery and whatnot. All that's kind of tied together. How does NetApp enable better security? What's the story >>There? Yeah, it's funny because that's, that's where a lot of the headlines are at this show at every other show is security for us. It's really about cyber resilience. It is one of the key foundational parts of our hybrid cloud offerings. So as we go out to the partners, you mentioned, you know, insight and there's others, you know, CDW ahead here, and the GSI hosting providers, they're all trying to figure out the security opportunity because that is live. So we have a cyber resiliency solution that isn't just our snapshot technologies, but it's also some of the discovery data governance. But also, you know, you gotta work this with ecosystem, as we said, you know, you have all the other ISVs out there that have several solutions, not just the traditional data protection ones, but also the security players. Because if you look at the full perimeter and you look at how you have to secure that and be able to both block remediate and bring back a site, you know, those are complex sets of things that no one person owns. But what we've tried to do is really be as, as meaningful and pervasive and integrated to that package as possible. That's why it's a lead story in the hybrid clouds. >>Can you share for a minute, just give the NetApp commercial plug cuz you guys have continued to stay relevant. What's the story this year for the folks watching that our customers or potential customers, what's the NetApp story for this year? >>Well, the net, the nets right for this year is kind of what I mentioned, which is, you know, we're in this multi-cloud world. So whether you're coming at this from any perspective, we have relevancy for, for the, the on-prem place that you've always enjoyed us, but at the opposite of the spectrum, if you're coming at us from an AWS show or the cloud op the cloud ops buyer, we have a complete portfolio that if you never knew net from the on-prem, you're gonna see us massively relevant in that, in that environment. And you just go to an AWS show or a Microsoft Azure, so, or a Google show, you'll see us there. You'll see exactly why we were relevant there. You'll see them mention why we're relevant there. So our message is really that we have a full portfolio across the hybrid multi-cloud from anyone buyer perspective, to be able to solve those problems, but by the way, do it with partners cuz the partners are the ones that complete all this. None of us on our own, AWS, Microsoft, VMware, NetApp, none of us have the singular solution ourselves. And we can't deliver ourselves. You have to have those partners that have those skills, those competencies. And that's why we, we leverage it that way. >>Great, great stuff. Now I gotta ask you what what's going on in your world with partners. How's it going? What's the vibe what's that just share some insight into what's happening inside the partners? Are they happy with the margins? Are they shifting behavior? What are some of the, the high order bit news items or, or trends going on at the, on the front lines with your partners? >>Well, I think listen, the, the, the challenges pitfalls, the, the objections, the, all the problems that have been there in the past are even more multiplied with today's economy and all the situations we've gone through with COVID. But the reality is what's emerged is an interesting kind of tapestry of a lot of different partner types. So for us, we recognize that across the traditional GSIs, you see these cloud native partners emerging, which is an exciting realm, you know, to look at folks that really built their business in the cloud with no on-prem and being relevant with them, just consulting partners alone. Like the SAP ecosystem has a very condensed set of partners that really drive a lot of the transformation of SAP. And a lot of them don't, you know, don't do product business. So how does someone like NetApp be relevant with them? You gotta put together an offering that says we do X, Y, and Z for SAP. And so it's, it's a combination of these partners across the, the different >>Ecosystems. Yeah. And I, and I, I'm gonna, I wanna get your reaction to something and you probably don't, you don't have to go out, out in the limb and, and put NetApp in a, in a position on official position. But I've been saying on the cube that no matter what happens with VMware's situation with Broadcom, this is not a dying market, right? I mean like you you'd think when someone gets bought out or, or intention bought out, that'd be like this, this dark cloud that would hang over the, the company and this condition is their user conference. So this is a good barometer to get a feel for it. And I gotta tell you, Sunday night here at VMware Explorer, the expo floor was not dead. It was buzzing. It was packed the ecosystem and even the conversations and the positionings, it's all, all growth. So, so I think VMware's in a really interesting spot here with the Broadcom, because no matter what happens that ecosystem's going to settle somewhere. Yeah. It's not going away cuz they have such great customer base. So, you know, assume that broad Tom is gonna do the right thing and they keep most of the jewels they'll keep all the customers. So, but still that wave is coming. Yeah. It's independent of VMware. Yeah. That's the whole point. So what happens next? >>Well, I think, you know, we, >>We, you guys are gonna get mop up in business. Amazon's gonna get some business, Microsoft, HPE, you name it all gonna, >>Yeah. I think, you know, we've, we've been in business with Broadcom for a long time, whether it be the switch business, the chip business, everything in between. And so we've got a very mature relationship with them and we have a great relationship with VMware. It's it's best. It's almost ever been now and together. I think that will all just rationalize and, and settle over time as this kind of goes through both the next Barcelona show and when it comes back here next year, and I think, you know, what you'll see is probably, you know, some of the stuff settle into the new things they announced here at the show and the things that maybe you haven't heard from, but ultimately the, these, these, these solutions that they have to come forward with, you know, have to land on things that go forward. And so today you just saw that with VMware trying to do VMware cloud and AWS, they realized that there was a gap in terms of people adopting and wanting to do a storage expansion without adding compute. So they made a move with us that made total sense. I think you're gonna see more of those things that are very common sense, ways to solve the, the barriers to, you know, modernization, adoption and maturity. That's just gonna be a natural part of the vetting. And I think they'll probably come a lot more. >>It's gonna be very interesting. We interviewed AJ Patel yesterday. He heads up he's SVP G of the modern app side. He's a middleware guy. So you can almost connect the dots kind of where we're going with this. Yeah. So I assume there's a nice middleware layer of developing everybody wins yeah. In this, if done properly. So it's clearly that VMware, no matter what happens at Broadcom from this show, my assessment's all steam all steam ahead. No, one's holding back at this point. >>Yeah. It's funny. The, the most mature partners we talk to have this interesting sort of upper and lower story and the upper story is all about that, that application data and middleware kind of layer. What are you doing there to be relevant about the different issues they run into versus some of the stuff that we've grown up with on the infrastructure side, they wanna make that as, as nascent as possible, like infrastructure's code and all this stuff that the automation platforms do. But you're right. If you don't get up into that application, middleware space, you know, and work on that, on that side of the house, you know, you're not gonna be >>Relevant. Yeah. I mean, it's interesting, you know, most people, people take it literally. It doesn't mean middleware. We don't mean middleware. We mean that what middleware was yeah. In the old metaphor just still has to happen. That's where complexity solved. You got hardware, essentially cloud and you got applications, right. So it's all, all kind of the same, but not >>Yeah. In a lot of cases, it could be conceived as even like pipelining, you know, it's it's, you have data and apps going through a transformation from the old style and the old application structures to cloud native apps and a, a much different architecture. The, the whole deal is how you're relevant there. How you solving real problems about simplifying, improving performance, improving securities, you mentioned all those things are relevant and that's where, that's where you have to place >>Your bets. I love that storage is continuing to be at the center of the value proposition. Again, storage compute, networking never goes away. It's just being kind of flexed in new ways just to continue to say, deliver better value. Keith, thanks for coming on the queue. Great to see you for the, see you again, man, day three for coming back on and give us some commentary. Really appreciate it. And congratulations on all the success with the partners and having the cloud story. Right. Thanks. Cheers. Okay. More cube coverage. After this short break day three, Walter Wall coverage. I'm John furier host Dave ante, Lisa Martin, Dave Nicholson, all here covering VMware. We'll be back with more after this short break.

Published Date : Sep 1 2022

SUMMARY :

I'm John Forer host of the cube with Dave Lisa Martin, Dave Nicholson, two sets for three days. And now this next gen is a, you know, kind of come into it, you know, that's part of the landscape, the moves you guys may have been really good, but NetApp's never really had the kind of positioning And the funny part was that, you know, we had on, early 20 11, 20 12, the then CEO Georgian's and And, you know, as much as you wanna talk about the technology capabilities in, since we had a cube update with you guys, what are you guys showing of the show? Well, for us, it's, it's, it's, it's always, you know, a cloud and on-prem combination You know, I, I was talking with David Nicholson, cuz we have, you know, we joked together. you know, what we've put together is, is something that has as much meaningful relevance So you got a lot of the frontline on the front lines, you got partners and you got, you know, big solutions with to go drive AI within Nvidia, you know, NetApp doesn't create any AI deals cuz no one It. And where's the, the cloud transformation with you guys right now? allow it to re re redeploy it in the place that you wanna spend it, you know, so it's, What's the story with security? So as we go out to the partners, you mentioned, you know, Can you share for a minute, just give the NetApp commercial plug cuz you Well, the net, the nets right for this year is kind of what I mentioned, which is, you know, we're in this multi-cloud world. Now I gotta ask you what what's going on in your world with partners. which is an exciting realm, you know, to look at folks that really built their business So, you know, assume that broad Tom is gonna do the right thing We, you guys are gonna get mop up in business. the barriers to, you know, modernization, adoption and maturity. So you can almost connect the dots kind of where we're going with this. middleware space, you know, and work on that, on that side of the house, you know, you're not gonna be In the old metaphor just still has to happen. that's where you have to place Great to see you for the, see you again,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Lisa MartinPERSON

0.99+

KeithPERSON

0.99+

Dave Lisa MartinPERSON

0.99+

Dave NicholsonPERSON

0.99+

Keith NorbiePERSON

0.99+

AWSORGANIZATION

0.99+

NvidiaORGANIZATION

0.99+

David NicholsonPERSON

0.99+

BroadcomORGANIZATION

0.99+

AJ PatelPERSON

0.99+

MicrosoftORGANIZATION

0.99+

John ForerPERSON

0.99+

AmazonORGANIZATION

0.99+

NetAppORGANIZATION

0.99+

yesterdayDATE

0.99+

VMwareORGANIZATION

0.99+

next yearDATE

0.99+

KindraORGANIZATION

0.99+

three daysQUANTITY

0.99+

Walter WallPERSON

0.99+

todayDATE

0.99+

OneQUANTITY

0.99+

12 yearsQUANTITY

0.99+

AccentureORGANIZATION

0.99+

1997DATE

0.99+

HPEORGANIZATION

0.99+

last nightDATE

0.99+

oneQUANTITY

0.99+

John furierPERSON

0.99+

bothQUANTITY

0.99+

12th yearQUANTITY

0.99+

InvidiaORGANIZATION

0.98+

Sunday nightDATE

0.98+

EMCORGANIZATION

0.98+

2022DATE

0.98+

this yearDATE

0.98+

VMware ExplorerORGANIZATION

0.98+

NetAppTITLE

0.98+

10 years agoDATE

0.97+

two setsQUANTITY

0.97+

first storageQUANTITY

0.97+

Dave antePERSON

0.97+

BarcelonaLOCATION

0.96+

three majorsQUANTITY

0.96+

day threeQUANTITY

0.95+

ESXTITLE

0.95+

single answerQUANTITY

0.94+

GSIORGANIZATION

0.94+

TomPERSON

0.93+

Joe Nolte, Allegis Group & Torsten Grabs, Snowflake | Snowflake Summit 2022


 

>>Hey everyone. Welcome back to the cube. Lisa Martin, with Dave ante. We're here in Las Vegas with snowflake at the snowflake summit 22. This is the fourth annual there's close to 10,000 people here. Lots going on. Customers, partners, analysts, cross media, everyone talking about all of this news. We've got a couple of guests joining us. We're gonna unpack snow park. Torston grabs the director of product management at snowflake and Joe. No NTY AI and MDM architect at Allegis group. Guys. Welcome to the program. Thank >>You so much for having >>Us. Isn't it great to be back in person? It is. >>Oh, wonderful. Yes, it >>Is. Indeed. Joe, talk to us a little bit about Allegis group. What do you do? And then tell us a little bit about your role specifically. >>Well, Allegis group is a collection of OPCA operating companies that do staffing. We're one of the biggest staffing companies in north America. We have a presence in AMEA and in the APAC region. So we work to find people jobs, and we help get 'em staffed and we help companies find people and we help individuals find >>People incredibly important these days, excuse me, incredibly important. These days. It is >>Very, it very is right >>There. Tell me a little bit about your role. You are the AI and MDM architect. You wear a lot of hats. >>Okay. So I'm a architect and I support both of those verticals within the company. So I work, I have a set of engineers and data scientists that work with me on the AI side, and we build data science models and solutions that help support what the company wants to do, right? So we build it to make business business processes faster and more streamlined. And we really see snow park and Python helping us to accelerate that and accelerate that delivery. So we're very excited about it. >>Explain snow park for, for people. I mean, I look at it as this, this wonderful sandbox. You can bring your own developer tools in, but, but explain in your words what it >>Is. Yeah. So we got interested in, in snow park because increasingly the feedback was that everybody wants to interact with snowflake through SQL. There are other languages that they would prefer to use, including Java Scala and of course, Python. Right? So then this led down to the, our, our work into snow park where we're building an infrastructure that allows us to host other languages natively on the snowflake compute platform. And now here, what we're, what we just announced is snow park for Python in public preview. So now you have the ability to natively run Python code on snowflake and benefit from the thousands of packages and libraries that the open source community around Python has contributed over the years. And that's a huge benefit for data scientists. It is ML practitioners and data engineers, because those are the, the languages and packages that are popular with them. So yeah, we very much look forward to working with the likes of you and other data scientists and, and data engineers around the Python ecosystem. >>Yeah. And, and snow park helps reduce the architectural footprint and it makes the data pipelines a little easier and less complex. We have a, we had a pipeline and it works on DMV data. And we converted that entire pipeline from Python, running on a VM to directly running down on snowflake. Right. We were able to eliminate code because you don't have to worry about multi threading, right? Because we can just set the warehouse size through a task, no more multi threading, throw that code away. Don't need to do it anymore. Right. We get the same results, but the architecture to run that pipeline gets immensely easier because it's a store procedure that's already there. And implementing that calling to that store procedure is very easy. The architecture that we use today uses six different components just to be able to run that Python code on a VM within our ecosystem to make sure that it runs on time and is scheduled and all of that. Right. But with snowflake, with snowflake and snow park and snowflake Python, it's two components. It's the store procedure and our ETL tool calling it. >>Okay. So you've simplified that, that stack. Yes. And, and eliminated all the other stuff that you had to do that now Snowflake's doing, am I correct? That you're actually taking the application development stack and the analytics stack and bringing them together? Are they merging? >>I don't know. I think in a way I'm not real sure how I would answer that question to be quite honest. I think with stream lit, there's a little bit of application that's gonna be down there. So you could maybe start to say that I'd have to see how that carries out and what we do and what we produce to really give you an answer to that. But yeah, maybe in a >>Little bit. Well, the reason I asked you is because you talk, we always talk about injecting data into apps, injecting machine intelligence and ML and AI into apps, but there are two separate stacks today. Aren't they >>Certainly the two are getting closer >>To Python Python. It gets a little better. Explain that, >>Explain, explain how >>That I just like in the keynote, right? The other day was SRE. When she showed her sample application, you can start to see that cuz you can do some data pipelining and data building and then throw that into a training module within Python, right down inside a snowflake and have it sitting there. Then you can use something like stream lit to, to expose it to your users. Right? We were talking about that the other day, about how do you get an ML and AI, after you have it running in front of people, we have a model right now that is a Mo a predictive and prescriptive model of one of our top KPIs. Right. And right now we can show it to everybody in the company, but it's through a Jupyter notebook. How do I deliver it? How do I get it in the front of people? So they can use it well with what we saw was streamlet, right? It's a perfect match. And then we can compile it. It's right down there on snowflake. And it's completely easier time to delivery to production because since it's already part of snowflake, there's no architectural review, right. As long as the code passes code review, and it's not poorly written code and isn't using a library that's dangerous, right. It's a simple deployment to production. So because it's encapsulated inside of that snowflake environment, we have approval to just use it. However we see fit. >>It's very, so that code delivery, that code review has to occur irrespective of, you know, not always whatever you're running it on. Okay. So I get that. And, and, but you, it's a frictionless environment you're saying, right. What would you have had to do prior to snowflake that you don't have to do now? >>Well, one, it's a longer review process to allow me to push the solution into production, right. Because I have to explain to my InfoSec people, right? My other it's not >>Trusted. >>Well, well don't use that word. No. Right? It got, there are checks and balances in everything that we do, >>It has to be verified. And >>That's all, it's, it's part of the, the, what I like to call the good bureaucracy, right? Those processes are in place to help all of us stay protected. >>It's the checklist. Yeah. That you >>Gotta go to. >>That's all it is. It's like fly on a plane. You, >>But that checklist gets smaller. And sometimes it's just one box now with, with Python through snow park, running down on the snowflake platform. And that's, that's the real advantage because we can do things faster. Right? We can do things easier, right? We're doing some mathematical data science right now and we're doing it through SQL, but Python will open that up much easier and allow us to deliver faster and more accurate results and easier not to mention, we're gonna try to bolt on the hybrid tables to that afterwards. >>Oh, we had talk about that. So can you, and I don't, I don't need an exact metric, but when you say faster talking 10% faster, 20% faster, 50% path >>Faster, it really depends on the solution. >>Well, gimme a range of, of the worst case, best case. >>I, I really don't have that. I don't, I wish I did. I wish I had that for you, but I really don't have >>It. I mean, obviously it's meaningful. I mean, if >>It is meaningful, it >>Has a business impact. It'll >>Be FA I think what it will do is it will speed up our work inside of our iterations. So we can then, you know, look at the code sooner. Right. And evaluate it sooner, measure it sooner, measure it faster. >>So is it fair to say that as a result, you can do more. Yeah. That's to, >>We be able do more well, and it will enable more of our people because they're used to working in Python. >>Can you talk a little bit about, from an enablement perspective, let's go up the stack to the folks at Allegis who are on the front lines, helping people get jobs. What are some of the benefits that having snow park for Python under the hood, how does it facilitate them being able to get access to data, to deliver what they need to, to their clients? >>Well, I think what we would use snowflake for a Python for there is when we're building them tools to let them know whether or not a user or a piece of talent is already within our system. Right. Things like that. Right. That's how we would leverage that. But again, it's also new. We're still figuring out what solutions we would move to Python. We are, we have some targeted, like we're, I have developers that are waiting for this and they're, and they're in private preview. Now they're playing around with it. They're ready to start using it. They're ready to start doing some analytical work on it, to get some of our analytical work out of, out of GCP. Right. Because that's where it is right now. Right. But all the data's in snowflake and it just, but we need to move that down now and take the data outta the data wasn't in snowflake before. So there, so the dashboards are up in GCP, but now that we've moved all of that data down in, down in the snowflake, the team that did that, those analytical dashboards, they want to use Python because that's the way it's written right now. So it's an easier transformation, an easier migration off of GCP and get us into snow, doing everything in snowflake, which is what we want. >>So you're saying you're doing the visualization in GCP. Is that righting? >>It's just some dashboarding. That's all, >>Not even visualization. You won't even give for. You won't even give me that. Okay. Okay. But >>Cause it's not visualization. It's just some D boardings of numbers and percentages and things like that. It's no graphic >>And it doesn't make sense to run that in snowflake, in GCP, you could just move it into AWS or, or >>No, we, what we'll be able to do now is all that data before was in GCP and all that Python code was running in GCP. We've moved all that data outta GCP, and now it's in snowflake and now we're gonna work on taking those Python scripts that we thought we were gonna have to rewrite differently. Right. Because Python, wasn't available now that Python's available, we have an easier way of getting those dashboards back out to our people. >>Okay. But you're taking it outta GCP, putting it to snowflake where anywhere, >>Well, the, so we'll build the, we'll build those, those, those dashboards. And they'll actually be, they'll be displayed through Tableau, which is our enterprise >>Tool for that. Yeah. Sure. Okay. And then when you operationalize it it'll go. >>But the idea is it's an easier pathway for us to migrate our code, our existing code it's in Python, down into snowflake, have it run against snowflake. Right. And because all the data's there >>Because it's not a, not a going out and coming back in, it's all integrated. >>We want, we, we want our people working on the data in snowflake. We want, that's our data platform. That's where we want our analytics done. Right. We don't want, we don't want, 'em done in other places. We when get all that data down and we've, we've over our data cloud journey, we've worked really hard to move all of that data. We use out of existing systems on prem, and now we're attacking our, the data that's in GCP and making sure it's down. And it's not a lot of data. And we, we fixed it with one data. Pipeline exposes all that data down on, down in snowflake now. And we're just migrating our code down to work against the snowflake platform, which is what we want. >>Why are you excited about hybrid tables? What's what, what, what's the >>Potential hybrid tables I'm excited about? Because we, so some of the data science that we do inside of snowflake produces a set of results and there recommendations, well, we have to get those recommendations back to our people back into our, our talent management system. And there's just some delays. There's about an hour delay of delivering that data back to that team. Well, with hybrid tables, I can just write it to the hybrid table. And that hybrid table can be directly accessed from our talent management system, be for the recruiters and for the hiring managers, to be able to see those recommendations and near real time. And that that's the value. >>Yep. We learned that access to real time. Data it in recent years is no longer a nice to have. It's like a huge competitive differentiator for every industry, including yours guys. Thank you for joining David me on the program, talking about snow park for Python. What that announcement means, how Allegis is leveraging the technology. We look forward to hearing what comes when it's GA >>Yeah. We're looking forward to, to it. Nice >>Guys. Great. All right guys. Thank you for our guests and Dave ante. I'm Lisa Martin. You're watching the cubes coverage of snowflake summit 22 stick around. We'll be right back with our next guest.

Published Date : Jun 15 2022

SUMMARY :

This is the fourth annual there's close to Us. Isn't it great to be back in person? Yes, it Joe, talk to us a little bit about Allegis group. So we work to find people jobs, and we help get 'em staffed and we help companies find people and we help It is You are the AI and MDM architect. on the AI side, and we build data science models and solutions I mean, I look at it as this, this wonderful sandbox. and libraries that the open source community around Python has contributed over the years. And implementing that calling to that store procedure is very easy. And, and eliminated all the other stuff that you had to do that now Snowflake's doing, am I correct? we produce to really give you an answer to that. Well, the reason I asked you is because you talk, we always talk about injecting data into apps, It gets a little better. And it's completely easier time to delivery to production because since to snowflake that you don't have to do now? Because I have to explain to my InfoSec we do, It has to be verified. Those processes are in place to help all of us stay protected. It's the checklist. That's all it is. And that's, that's the real advantage because we can do things faster. I don't need an exact metric, but when you say faster talking 10% faster, I wish I had that for you, but I really don't have I mean, if Has a business impact. So we can then, you know, look at the code sooner. So is it fair to say that as a result, you can do more. We be able do more well, and it will enable more of our people because they're used to working What are some of the benefits that having snow park of that data down in, down in the snowflake, the team that did that, those analytical dashboards, So you're saying you're doing the visualization in GCP. It's just some dashboarding. You won't even give for. It's just some D boardings of numbers and percentages and things like that. gonna have to rewrite differently. And they'll actually be, they'll be displayed through Tableau, which is our enterprise And then when you operationalize it it'll go. And because all the data's there And it's not a lot of data. so some of the data science that we do inside of snowflake produces a set of results and We look forward to hearing what comes when it's GA Thank you for our guests and Dave ante.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavidPERSON

0.99+

Lisa MartinPERSON

0.99+

JoePERSON

0.99+

10%QUANTITY

0.99+

20%QUANTITY

0.99+

DavePERSON

0.99+

AllegisORGANIZATION

0.99+

Las VegasLOCATION

0.99+

Allegis GroupORGANIZATION

0.99+

Joe NoltePERSON

0.99+

50%QUANTITY

0.99+

north AmericaLOCATION

0.99+

PythonTITLE

0.99+

Java ScalaTITLE

0.99+

SQLTITLE

0.99+

bothQUANTITY

0.99+

one boxQUANTITY

0.99+

twoQUANTITY

0.99+

thousandsQUANTITY

0.99+

Snowflake Summit 2022EVENT

0.98+

AWSORGANIZATION

0.98+

TableauTITLE

0.98+

six different componentsQUANTITY

0.98+

two componentsQUANTITY

0.98+

Python PythonTITLE

0.98+

Torsten GrabsPERSON

0.97+

oneQUANTITY

0.96+

todayDATE

0.96+

TorstonPERSON

0.96+

Allegis groupORGANIZATION

0.96+

OPCAORGANIZATION

0.95+

one dataQUANTITY

0.95+

two separate stacksQUANTITY

0.94+

InfoSecORGANIZATION

0.91+

Dave antePERSON

0.9+

fourth annualQUANTITY

0.88+

JupyterORGANIZATION

0.88+

parkTITLE

0.85+

snowflake summit 22EVENT

0.84+

10,000 peopleQUANTITY

0.82+

SnowflakeORGANIZATION

0.78+

AMEALOCATION

0.77+

snow parkTITLE

0.76+

snowORGANIZATION

0.66+

couple of guestsQUANTITY

0.65+

NTYORGANIZATION

0.6+

SnowflakeEVENT

0.59+

MDMORGANIZATION

0.58+

APACORGANIZATION

0.58+

premORGANIZATION

0.52+

GALOCATION

0.5+

snowTITLE

0.46+

SRETITLE

0.46+

litORGANIZATION

0.43+

streamTITLE

0.41+

22QUANTITY

0.4+

The Future Is Built On InFluxDB


 

>>Time series data is any data that's stamped in time in some way that could be every second, every minute, every five minutes, every hour, every nanosecond, whatever it might be. And typically that data comes from sources in the physical world like devices or sensors, temperature, gauges, batteries, any device really, or things in the virtual world could be software, maybe it's software in the cloud or data and containers or microservices or virtual machines. So all of these items, whether in the physical or virtual world, they're generating a lot of time series data. Now time series data has been around for a long time, and there are many examples in our everyday lives. All you gotta do is punch up any stock, ticker and look at its price over time and graphical form. And that's a simple use case that anyone can relate to and you can build timestamps into a traditional relational database. >>You just add a column to capture time and as well, there are examples of log data being dumped into a data store that can be searched and captured and ingested and visualized. Now, the problem with the latter example that I just gave you is that you gotta hunt and Peck and search and extract what you're looking for. And the problem with the former is that traditional general purpose databases they're designed as sort of a Swiss army knife for any workload. And there are a lot of functions that get in the way and make them inefficient for time series analysis, especially at scale. Like when you think about O T and edge scale, where things are happening super fast, ingestion is coming from many different sources and analysis often needs to be done in real time or near real time. And that's where time series databases come in. >>They're purpose built and can much more efficiently support ingesting metrics at scale, and then comparing data points over time, time series databases can write and read at significantly higher speeds and deal with far more data than traditional database methods. And they're more cost effective instead of throwing processing power at the problem. For example, the underlying architecture and algorithms of time series databases can optimize queries and they can reclaim wasted storage space and reuse it. At scale time, series databases are simply a better fit for the job. Welcome to moving the world with influx DB made possible by influx data. My name is Dave Valante and I'll be your host today. Influx data is the company behind InfluxDB. The open source time series database InfluxDB is designed specifically to handle time series data. As I just explained, we have an exciting program for you today, and we're gonna showcase some really interesting use cases. >>First, we'll kick it off in our Palo Alto studios where my colleague, John furrier will interview Evan Kaplan. Who's the CEO of influx data after John and Evan set the table. John's gonna sit down with Brian Gilmore. He's the director of IOT and emerging tech at influx data. And they're gonna dig into where influx data is gaining traction and why adoption is occurring and, and why it's so robust. And they're gonna have tons of examples and double click into the technology. And then we bring it back here to our east coast studios, where I get to talk to two practitioners, doing amazing things in space with satellites and modern telescopes. These use cases will blow your mind. You don't want to miss it. So thanks for being here today. And with that, let's get started. Take it away. Palo Alto. >>Okay. Today we welcome Evan Kaplan, CEO of influx data, the company behind influx DB. Welcome Evan. Thanks for coming on. >>Hey John, thanks for having me >>Great segment here on the influx DB story. What is the story? Take us through the history. Why time series? What's the story >><laugh> so the history history is actually actually pretty interesting. Um, Paul dicks, my partner in this and our founder, um, super passionate about developers and developer experience. And, um, he had worked on wall street building a number of time series kind of platform trading platforms for trading stocks. And from his point of view, it was always what he would call a yak shave, which means you had to do a ton of work just to start doing work, which means you had to write a bunch of extrinsic routines. You had to write a bunch of application handling on existing relational databases in order to come up with something that was optimized for a trading platform or a time series platform. And he sort of, he just developed this real clear point of view is this is not how developers should work. And so in 2013, he went through why Combinator and he built something for, he made his first commit to open source in flu DB at the end of 2013. And, and he basically, you know, from my point of view, he invented modern time series, which is you start with a purpose-built time series platform to do these kind of workloads. And you get all the benefits of having something right outta the box. So a developer can be totally productive right away. >>And how many people in the company what's the history of employees and stuff? >>Yeah, I think we're, I, you know, I always forget the number, but it's something like 230 or 240 people now. Um, the company, I joined the company in 2016 and I love Paul's vision. And I just had a strong conviction about the relationship between time series and IOT. Cuz if you think about it, what sensors do is they speak time, series, pressure, temperature, volume, humidity, light, they're measuring they're instrumenting something over time. And so I thought that would be super relevant over long term and I've not regretted it. >>Oh no. And it's interesting at that time, go back in the history, you know, the role of databases, well, relational database is the one database to rule the world. And then as clouds started coming in, you starting to see more databases, proliferate types of databases and time series in particular is interesting. Cuz real time has become super valuable from an application standpoint, O T which speaks time series means something it's like time matters >>Time. >>Yeah. And sometimes data's not worth it after the time, sometimes it worth it. And then you get the data lake. So you have this whole new evolution. Is this the momentum? What's the momentum, I guess the question is what's the momentum behind >>You mean what's causing us to grow. So >>Yeah, the time series, why is time series >>And the >>Category momentum? What's the bottom line? >>Well, think about it. You think about it from a broad, broad sort of frame, which is where, what everybody's trying to do is build increasingly intelligent systems, whether it's a self-driving car or a robotic system that does what you want to do or a self-healing software system, everybody wants to build increasing intelligent systems. And so in order to build these increasing intelligent systems, you have to instrument the system well, and you have to instrument it over time, better and better. And so you need a tool, a fundamental tool to drive that instrumentation. And that's become clear to everybody that that instrumentation is all based on time. And so what happened, what happened, what happened what's gonna happen? And so you get to these applications like predictive maintenance or smarter systems. And increasingly you want to do that stuff, not just intelligently, but fast in real time. So millisecond response so that when you're driving a self-driving car and the system realizes that you're about to do something, essentially you wanna be able to act in something that looks like real time, all systems want to do that, want to be more intelligent and they want to be more real time. And so we just happen to, you know, we happen to show up at the right time in the evolution of a >>Market. It's interesting near real time. Isn't good enough when you need real time. >><laugh> yeah, it's not, it's not. And it's like, and it's like, everybody wants, even when you don't need it, ironically, you want it. It's like having the feature for, you know, you buy a new television, you want that one feature, even though you're not gonna use it, you decide that your buying criteria real time is a buying criteria >>For, so you, I mean, what you're saying then is near real time is getting closer to real time as possible, as fast as possible. Right. Okay. So talk about the aspect of data, cuz we're hearing a lot of conversations on the cube in particular around how people are implementing and actually getting better. So iterating on data, but you have to know when it happened to get, know how to fix it. So this is a big part of how we're seeing with people saying, Hey, you know, I wanna make my machine learning algorithms better after the fact I wanna learn from the data. Um, how does that, how do you see that evolving? Is that one of the use cases of sensors as people bring data in off the network, getting better with the data knowing when it happened? >>Well, for sure. So, so for sure, what you're saying is, is, is none of this is non-linear, it's all incremental. And so if you take something, you know, just as an easy example, if you take a self-driving car, what you're doing is you're instrumenting that car to understand where it can perform in the real world in real time. And if you do that, if you run the loop, which is I instrumented, I watch what happens, oh, that's wrong? Oh, I have to correct for that. I correct for that in the software. If you do that for a billion times, you get a self-driving car, but every system moves along that evolution. And so you get the dynamic of, you know, of constantly instrumenting watching the system behave and do it. And this and sets up driving car is one thing. But even in the human genome, if you look at some of our customers, you know, people like, you know, people doing solar arrays, people doing power walls, like all of these systems are getting smarter. >>Well, let's get into that. What are the top applications? What are you seeing for your, with in, with influx DB, the time series, what's the sweet spot for the application use case and some customers give some >>Examples. Yeah. So it's, it's pretty easy to understand on one side of the equation that's the physical side is sensors are sensors are getting cheap. Obviously we know that and they're getting the whole physical world is getting instrumented, your home, your car, the factory floor, your wrist, watch your healthcare, you name it. It's getting instrumented in the physical world. We're watching the physical world in real time. And so there are three or four sweet spots for us, but, but they're all on that side. They're all about IOT. So they're think about consumer IOT projects like Google's nest todo, um, particle sensors, um, even delivery engines like rapid who deliver the Instacart of south America, like anywhere there's a physical location do and that's on the consumer side. And then another exciting space is the industrial side factories are changing dramatically over time. Increasingly moving away from proprietary equipment to develop or driven systems that run operational because what, what has to get smarter when you're building, when you're building a factory is systems all have to get smarter. And then, um, lastly, a lot in the renewables sustainability. So a lot, you know, Tesla, lucid, motors, Cola, motors, um, you know, lots to do with electric cars, solar arrays, windmills, arrays, just anything that's gonna get instrumented that where that instrumentation becomes part of what the purpose >>Is. It's interesting. The convergence of physical and digital is happening with the data IOT. You mentioned, you know, you think of IOT, look at the use cases there, it was proprietary OT systems. Now becoming more IP enabled internet protocol and now edge compute, getting smaller, faster, cheaper AI going to the edge. Now you have all kinds of new capabilities that bring that real time and time series opportunity. Are you seeing IOT going to a new level? What was the, what's the IOT where's the IOT dots connecting to because you know, as these two cultures merge yeah. Operations, basically industrial factory car, they gotta get smarter, intelligent edge is a buzzword, but I mean, it has to be more intelligent. Where's the, where's the action in all this. So the >>Action, really, it really at the core, it's at the developer, right? Because you're looking at these things, it's very hard to get an off the shelf system to do the kinds of physical and software interaction. So the actions really happen at the developer. And so what you're seeing is a movement in the world that, that maybe you and I grew up in with it or OT moving increasingly that developer driven capability. And so all of these IOT systems they're bespoke, they don't come out of the box. And so the developer, the architect, the CTO, they define what's my business. What am I trying to do? Am I trying to sequence a human genome and figure out when these genes express theself or am I trying to figure out when the next heart rate monitor's gonna show up on my apple watch, right? What am I trying to do? What's the system I need to build. And so starting with the developers where all of the good stuff happens here, which is different than it used to be, right. Used to be you'd buy an application or a service or a SA thing for, but with this dynamic, with this integration of systems, it's all about bespoke. It's all about building >>Something. So let's get to the developer real quick, real highlight point here is the data. I mean, I could see a developer saying, okay, I need to have an application for the edge IOT edge or car. I mean, we're gonna have, I mean, Tesla's got applications of the car it's right there. I mean, yes, there's the modern application life cycle now. So take us through how this impacts the developer. Does it impact their C I C D pipeline? Is it cloud native? I mean, where does this all, where does this go to? >>Well, so first of all, you're talking about, there was an internal journey that we had to go through as a company, which, which I think is fascinating for anybody who's interested is we went from primarily a monolithic software that was open sourced to building a cloud native platform, which means we had to move from an agile development environment to a C I C D environment. So to a degree that you are moving your service, whether it's, you know, Tesla monitoring your car and updating your power walls, right. Or whether it's a solar company updating the arrays, right. To degree that that service is cloud. Then increasingly remove from an agile development to a C I C D environment, which you're shipping code to production every day. And so it's not just the developers, all the infrastructure to support the developers to run that service and that sort of stuff. I think that's also gonna happen in a big way >>When your customer base that you have now, and as you see, evolving with infl DB, is it that they're gonna be writing more of the application or relying more on others? I mean, obviously there's an open source component here. So when you bring in kind of old way, new way old way was I got a proprietary, a platform running all this O T stuff and I gotta write, here's an application. That's general purpose. Yeah. I have some flexibility, somewhat brittle, maybe not a lot of robustness to it, but it does its job >>A good way to think about this is versus a new way >>Is >>What so yeah, good way to think about this is what, what's the role of the developer slash architect CTO that chain within a large, within an enterprise or a company. And so, um, the way to think about it is I started my career in the aerospace industry <laugh> and so when you look at what Boeing does to assemble a plane, they build very, very few of the parts. Instead, what they do is they assemble, they buy the wings, they buy the engines, they assemble, actually, they don't buy the wings. It's the one thing they buy the, the material for the w they build the wings, cuz there's a lot of tech in the wings and they end up being assemblers smart assemblers of what ends up being a flying airplane, which is pretty big deal even now. And so what, what happens with software people is they have the ability to pull from, you know, the best of the open source world. So they would pull a time series capability from us. Then they would assemble that with, with potentially some ETL logic from somebody else, or they'd assemble it with, um, a Kafka interface to be able to stream the data in. And so they become very good integrators and assemblers, but they become masters of that bespoke application. And I think that's where it goes, cuz you're not writing native code for everything. >>So they're more flexible. They have faster time to market cuz they're assembling way faster and they get to still maintain their core competency. Okay. Their wings in this case, >>They become increasingly not just coders, but designers and developers. They become broadly builders is what we like to think of it. People who start and build stuff by the way, this is not different than the people just up the road Google have been doing for years or the tier one, Amazon building all their own. >>Well, I think one of the things that's interesting is is that this idea of a systems developing a system architecture, I mean systems, uh, uh, systems have consequences when you make changes. So when you have now cloud data center on premise and edge working together, how does that work across the system? You can't have a wing that doesn't work with the other wing kind of thing. >>That's exactly. But that's where the that's where the, you know, that that Boeing or that airplane building analogy comes in for us. We've really been thoughtful about that because IOT it's critical. So our open source edge has the same API as our cloud native stuff that has enterprise on pre edge. So our multiple products have the same API and they have a relationship with each other. They can talk with each other. So the builder builds it once. And so this is where, when you start thinking about the components that people have to use to build these services is that you wanna make sure, at least that base layer, that database layer, that those components talk to each other. >>So I'll have to ask you if I'm the customer. I put my customer hat on. Okay. Hey, I'm dealing with a lot. >>That mean you have a PO for <laugh> >>A big check. I blank check. If you can answer this question only if the tech, if, if you get the question right, I got all this important operation stuff. I got my factory, I got my self-driving cars. This isn't like trivial stuff. This is my business. How should I be thinking about time series? Because now I have to make these architectural decisions, as you mentioned, and it's gonna impact my application development. So huge decision point for your customers. What should I care about the most? So what's in it for me. Why is time series >>Important? Yeah, that's a great question. So chances are, if you've got a business that was, you know, 20 years old or 25 years old, you were already thinking about time series. You probably didn't call it that you built something on a Oracle or you built something on IBM's DB two, right. And you made it work within your system. Right? And so that's what you started building. So it's already out there. There are, you know, there are probably hundreds of millions of time series applications out there today. But as you start to think about this increasing need for real time, and you start to think about increasing intelligence, you think about optimizing those systems over time. I hate the word, but digital transformation. Then you start with time series. It's a foundational base layer for any system that you're gonna build. There's no system I can think of where time series, shouldn't be the foundational base layer. If you just wanna store your data and just leave it there and then maybe look it up every five years. That's fine. That's not time. Series time series is when you're building a smarter, more intelligent, more real time system. And the developers now know that. And so the more they play a role in building these systems, the more obvious it becomes. >>And since I have a PO for you and a big check, yeah. What is, what's the value to me as I, when I implement this, what's the end state, what's it look like when it's up and running? What's the value proposition for me. What's an >>So, so when it's up and running, you're able to handle the queries, the writing of the data, the down sampling of the data, they're transforming it in near real time. So that the other dependencies that a system that gets for adjusting a solar array or trading energy off of a power wall or some sort of human genome, those systems work better. So time series is foundational. It's not like it's, you know, it's not like it's doing every action that's above, but it's foundational to build a really compelling, intelligent system. I think that's what developers and archs are seeing now. >>Bottom line, final word. What's in it for the customer. What's what, what's your, um, what's your statement to the customer? What would you say to someone looking to do something in time series on edge? >>Yeah. So, so it's pretty clear to clear to us that if you're building, if you view yourself as being in the build business of building systems that you want 'em to be increasingly intelligent, self-healing autonomous. You want 'em to operate in real time that you start from time series. But I also wanna say what's in it for us influx what's in it for us is people are doing some amazing stuff. You know, I highlighted some of the energy stuff, some of the human genome, some of the healthcare it's hard not to be proud or feel like, wow. Yeah. Somehow I've been lucky. I've arrived at the right time, in the right place with the right people to be able to deliver on that. That's that's also exciting on our side of the equation. >>Yeah. It's critical infrastructure, critical, critical operations. >>Yeah. >>Yeah. Great stuff, Evan. Thanks for coming on. Appreciate this segment. All right. In a moment, Brian Gilmore director of IOT and emerging technology that influx day will join me. You're watching the cube leader in tech coverage. Thanks for watching >>Time series data from sensors systems and applications is a key source in driving automation and prediction in technologies around the world. But managing the massive amount of timestamp data generated these days is overwhelming, especially at scale. That's why influx data developed influx DB, a time series data platform that collects stores and analyzes data influx DB empowers developers to extract valuable insights and turn them into action by building transformative IOT analytics and cloud native applications, purpose built and optimized to handle the scale and velocity of timestamped data. InfluxDB puts the power in your hands with developer tools that make it easy to get started quickly with less code InfluxDB is more than a database. It's a robust developer platform with integrated tooling. That's written in the languages you love. So you can innovate faster, run in flex DB anywhere you want by choosing the provider and region that best fits your needs across AWS, Microsoft Azure and Google cloud flex DB is fast and automatically scalable. So you can spend time delivering value to customers, not managing clusters, take control of your time series data. So you can focus on the features and functionalities that give your applications a competitive edge. Get started for free with influx DB, visit influx data.com/cloud to learn more. >>Okay. Now we're joined by Brian Gilmore director of IOT and emerging technologies at influx data. Welcome to the show. >>Thank you, John. Great to be here. >>We just spent some time with Evan going through the company and the value proposition, um, with influx DV, what's the momentum, where do you see this coming from? What's the value coming out of this? >>Well, I think it, we're sort of hitting a point where the technology is, is like the adoption of it is becoming mainstream. We're seeing it in all sorts of organizations, everybody from like the most well funded sort of advanced big technology companies to the smaller academics, the startups and the managing of that sort of data that emits from that technology is time series and us being able to give them a, a platform, a tool that's super easy to use, easy to start. And then of course will grow with them is, is been key to us. Sort of, you know, riding along with them is they're successful. >>Evan was mentioning that time series has been on everyone's radar and that's in the OT business for years. Now, you go back since 20 13, 14, even like five years ago that convergence of physical and digital coming together, IP enabled edge. Yeah. Edge has always been kind of hyped up, but why now? Why, why is the edge so hot right now from an adoption standpoint? Is it because it's just evolution, the tech getting better? >>I think it's, it's, it's twofold. I think that, you know, there was, I would think for some people, everybody was so focused on cloud over the last probably 10 years. Mm-hmm <affirmative> that they forgot about the compute that was available at the edge. And I think, you know, those, especially in the OT and on the factory floor who weren't able to take Avan full advantage of cloud through their applications, you know, still needed to be able to leverage that compute at the edge. I think the big thing that we're seeing now, which is interesting is, is that there's like a hybrid nature to all of these applications where there's definitely some data that's generated on the edge. There's definitely done some data that's generated in the cloud. And it's the ability for a developer to sort of like tie those two systems together and work with that data in a very unified uniform way. Um, that's giving them the opportunity to build solutions that, you know, really deliver value to whatever it is they're trying to do, whether it's, you know, the, the out reaches of outer space or whether it's optimizing the factory floor. >>Yeah. I think, I think one of the things you also mentions genome too, dig big data is coming to the real world. And I think I, OT has been kind of like this thing for OT and, and in some use case, but now with the, with the cloud, all companies have an edge strategy now. So yeah, what's the secret sauce because now this is hot, hot product for the whole world and not just industrial, but all businesses. What's the secret sauce. >>Well, I mean, I think part of it is just that the technology is becoming more capable and that's especially on the hardware side, right? I mean, like technology compute is getting smaller and smaller and smaller. And we find that by supporting all the way down to the edge, even to the micro controller layer with our, um, you know, our client libraries and then working hard to make our applications, especially the database as small as possible so that it can be located as close to sort of the point of origin of that data in the edge as possible is, is, is fantastic. Now you can take that. You can run that locally. You can do your local decision making. You can use influx DB as sort of an input to automation control the autonomy that people are trying to drive at the edge. But when you link it up with everything that's in the cloud, that's when you get all of the sort of cloud scale capabilities of parallelized, AI and machine learning and all of that. >>So what's interesting is the open source success has been something that we've talked about a lot in the cube about how people are leveraging that you guys have users in the enterprise users that IOT market mm-hmm <affirmative>, but you got developers now. Yeah. Kind of together brought that up. How do you see that emerging? How do developers engage? What are some of the things you're seeing that developers are really getting into with InfluxDB >>What's? Yeah. Well, I mean, I think there are the developers who are building companies, right? And these are the startups and the folks that we love to work with who are building new, you know, new services, new products, things like that. And, you know, especially on the consumer side of IOT, there's a lot of that, just those developers. But I think we, you gotta pay attention to those enterprise developers as well, right? There are tons of people with the, the title of engineer in, in your regular enterprise organizations. And they're there for systems integration. They're there for, you know, looking at what they would build versus what they would buy. And a lot of them come from, you know, a strong, open source background and they, they know the communities, they know the top platforms in those spaces and, and, you know, they're excited to be able to adopt and use, you know, to optimize inside the business as compared to just building a brand new one. >>You know, it's interesting too, when Evan and I were talking about open source versus closed OT systems, mm-hmm <affirmative> so how do you support the backwards compatibility of older systems while maintaining open dozens of data formats out there? Bunch of standards, protocols, new things are emerging. Everyone wants to have a control plane. Everyone wants to leverage the value of data. How do you guys keep track of it all? What do you guys support? >>Yeah, well, I mean, I think either through direct connection, like we have a product called Telegraph, it's unbelievable. It's open source, it's an edge agent. You can run it as close to the edge as you'd like, it speaks dozens of different protocols in its own, right? A couple of which MQTT B, C U a are very, very, um, applicable to these T use cases. But then we also, because we are sort of not only open source, but open in terms of our ability to collect data, we have a lot of partners who have built really great integrations from their own middleware, into influx DB. These are companies like ke wear and high bite who are really experts in those downstream industrial protocols. I mean, that's a business, not everybody wants to be in. It requires some very specialized, very hard work and a lot of support, um, you know, and so by making those connections and building those ecosystems, we get the best of both worlds. The customers can use the platforms they need up to the point where they would be putting into our database. >>What's some of customer testimonies that they, that share with you. Can you share some anecdotal kind of like, wow, that's the best thing I've ever used. This really changed my business, or this is a great tech that's helped me in these other areas. What are some of the, um, soundbites you hear from customers when they're successful? >>Yeah. I mean, I think it ranges. You've got customers who are, you know, just finally being able to do the monitoring of assets, you know, sort of at the edge in the field, we have a customer who's who's has these tunnel boring machines that go deep into the earth to like drill tunnels for, for, you know, cars and, and, you know, trains and things like that. You know, they are just excited to be able to stick a database onto those tunnel, boring machines, send them into the depths of the earth and know that when they come out, all of that telemetry at a very high frequency has been like safely stored. And then it can just very quickly and instantly connect up to their, you know, centralized database. So like just having that visibility is brand new to them. And that's super important. On the other hand, we have customers who are way far beyond the monitoring use case, where they're actually using the historical records in the time series database to, um, like I think Evan mentioned like forecast things. So for predictive maintenance, being able to pull in the telemetry from the machines, but then also all of that external enrichment data, the metadata, the temperatures, the pressure is who is operating the machine, those types of things, and being able to easily integrate with platforms like Jupyter notebooks or, you know, all of those scientific computing and machine learning libraries to be able to build the models, train the models, and then they can send that information back down to InfluxDB to apply it and detect those anomalies, which >>Are, I think that's gonna be an, an area. I personally think that's a hot area because I think if you look at AI right now, yeah. It's all about training the machine learning albums after the fact. So time series becomes hugely important. Yeah. Cause now you're thinking, okay, the data matters post time. Yeah. First time. And then it gets updated the new time. Yeah. So it's like constant data cleansing data iteration, data programming. We're starting to see this new use case emerge in the data field. >>Yep. Yeah. I mean, I think you agree. Yeah, of course. Yeah. The, the ability to sort of handle those pipelines of data smartly, um, intelligently, and then to be able to do all of the things you need to do with that data in stream, um, before it hits your sort of central repository. And, and we make that really easy for customers like Telegraph, not only does it have sort of the inputs to connect up to all of those protocols and the ability to capture and connect up to the, to the partner data. But also it has a whole bunch of capabilities around being able to process that data, enrich it, reform at it, route it, do whatever you need. So at that point you're basically able to, you're playing your data in exactly the way you would wanna do it. You're routing it to different, you know, destinations and, and it's, it's, it's not something that really has been in the realm of possibility until this point. Yeah. Yeah. >>And when Evan was on it's great. He was a CEO. So he sees the big picture with customers. He was, he kinda put the package together that said, Hey, we got a system. We got customers, people are wanting to leverage our product. What's your PO they're sell. He's selling too as well. So you have that whole CEO perspective, but he brought up this notion that there's multiple personas involved in kind of the influx DB system architect. You got developers and users. Can you talk about that? Reality as customers start to commercialize and operationalize this from a commercial standpoint, you got a relationship to the cloud. Yep. The edge is there. Yep. The edge is getting super important, but cloud brings a lot of scale to the table. So what is the relationship to the cloud? Can you share your thoughts on edge and its relationship to the cloud? >>Yeah. I mean, I think edge, you know, edges, you can think of it really as like the local information, right? So it's, it's generally like compartmentalized to a point of like, you know, a single asset or a single factory align, whatever. Um, but what people do who wanna pro they wanna be able to make the decisions there at the edge locally, um, quickly minus the latency of sort of taking that large volume of data, shipping it to the cloud and doing something with it there. So we allow them to do exactly that. Then what they can do is they can actually downsample that data or they can, you know, detect like the really important metrics or the anomalies. And then they can ship that to a central database in the cloud where they can do all sorts of really interesting things with it. Like you can get that centralized view of all of your global assets. You can start to compare asset to asset, and then you can do those things like we talked about, whereas you can do predictive types of analytics or, you know, larger scale anomaly detections. >>So in this model you have a lot of commercial operations, industrial equipment. Yep. The physical plant, physical business with virtual data cloud all coming together. What's the future for InfluxDB from a tech standpoint. Cause you got open. Yep. There's an ecosystem there. Yep. You have customers who want operational reliability for sure. I mean, so you got organic <laugh> >>Yeah. Yeah. I mean, I think, you know, again, we got iPhones when everybody's waiting for flying cars. Right. So I don't know. We can like absolutely perfectly predict what's coming, but I think there are some givens and I think those givens are gonna be that the world is only gonna become more hybrid. Right. And then, you know, so we are going to have much more widely distributed, you know, situations where you have data being generated in the cloud, you have data gen being generated at the edge and then there's gonna be data generated sort sort of at all points in between like physical locations as well as things that are, that are very virtual. And I think, you know, we are, we're building some technology right now. That's going to allow, um, the concept of a database to be much more fluid and flexible, sort of more aligned with what a file would be like. >>And so being able to move data to the compute for analysis or move the compute to the data for analysis, those are the types of, of solutions that we'll be bringing to the customers sort of over the next little bit. Um, but I also think we have to start thinking about like what happens when the edge is actually off the planet. Right. I mean, we've got customers, you're gonna talk to two of them, uh, in the panel who are actually working with data that comes from like outside the earth, like, you know, either in low earth orbit or you know, all the way sort of on the other side of the universe. Yeah. And, and to be able to process data like that and to do so in a way it's it's we gotta, we gotta build the fundamentals for that right now on the factory floor and in the mines and in the tunnels. Um, so that we'll be ready for that one. >>I think you bring up a good point there because one of the things that's common in the industry right now, people are talking about, this is kind of new thinking is hyper scale's always been built up full stack developers, even the old OT world, Evan was pointing out that they built everything right. And the world's going to more assembly with core competency and IP and also property being the core of their apple. So faster assembly and building, but also integration. You got all this new stuff happening. Yeah. And that's to separate out the data complexity from the app. Yes. So space genome. Yep. Driving cars throws off massive data. >>It >>Does. So is Tesla, uh, is the car the same as the data layer? >>I mean the, yeah, it's, it's certainly a point of origin. I think the thing that we wanna do is we wanna let the developers work on the world, changing problems, the things that they're trying to solve, whether it's, you know, energy or, you know, any of the other health or, you know, other challenges that these teams are, are building against. And we'll worry about that time series data and the underlying data platform so that they don't have to. Right. I mean, I think you talked about it, uh, you know, for them just to be able to adopt the platform quickly, integrate it with their data sources and the other pieces of their applications. It's going to allow them to bring much faster time to market on these products. It's gonna allow them to be more iterative. They're gonna be able to do more sort of testing and things like that. And ultimately it will, it'll accelerate the adoption and the creation of >>Technology. You mentioned earlier in, in our talk about unification of data. Yeah. How about APIs? Cuz developers love APIs in the cloud unifying APIs. How do you view view that? >>Yeah, I mean, we are APIs, that's the product itself. Like everything, people like to think of it as sort of having this nice front end, but the front end is B built on our public APIs. Um, you know, and it, it allows the developer to build all of those hooks for not only data creation, but then data processing, data analytics, and then, you know, sort of data extraction to bring it to other platforms or other applications, microservices, whatever it might be. So, I mean, it is a world of APIs right now and you know, we, we bring a very sort of useful set of them for managing the time series data. These guys are all challenged with. It's >>Interesting. You and I were talking before we came on camera about how, um, data is, feels gonna have this kind of SRE role that DevOps had site reliability engineers, which manages a bunch of servers. There's so much data out there now. Yeah. >>Yeah. It's like reigning data for sure. And I think like that ability to be like one of the best jobs on the planet is gonna be to be able to like, sort of be that data Wrangler to be able to understand like what the data sources are, what the data formats are, how to be able to efficiently move that data from point a to point B and you know, to process it correctly so that the end users of that data aren't doing any of that sort of hard upfront preparation collection storage's >>Work. Yeah. That's data as code. I mean, data engineering is it is becoming a new discipline for sure. And, and the democratization is the benefit. Yeah. To everyone, data science get easier. I mean data science, but they wanna make it easy. Right. <laugh> yeah. They wanna do the analysis, >>Right? Yeah. I mean, I think, you know, it, it's a really good point. I think like we try to give our users as many ways as there could be possible to get data in and get data out. We sort of think about it as meeting them where they are. Right. So like we build, we have the sort of client libraries that allow them to just port to us, you know, directly from the applications and the languages that they're writing, but then they can also pull it out. And at that point nobody's gonna know the users, the end consumers of that data, better than those people who are building those applications. And so they're building these user interfaces, which are making all of that data accessible for, you know, their end users inside their organization. >>Well, Brian, great segment, great insight. Thanks for sharing all, all the complexities and, and IOT that you guys helped take away with the APIs and, and assembly and, and all the system architectures that are changing edge is real cloud is real. Yeah, absolutely. Mainstream enterprises. And you got developer attraction too, so congratulations. >>Yeah. It's >>Great. Well, thank any, any last word you wanna share >>Deal with? No, just, I mean, please, you know, if you're, if you're gonna, if you're gonna check out influx TV, download it, try out the open source contribute if you can. That's a, that's a huge thing. It's part of being the open source community. Um, you know, but definitely just, just use it. I think when once people use it, they try it out. They'll understand very, >>Very quickly. So open source with developers, enterprise and edge coming together all together. You're gonna hear more about that in the next segment, too. Right. Thanks for coming on. Okay. Thanks. When we return, Dave LAN will lead a panel on edge and data influx DB. You're watching the cube, the leader in high tech enterprise coverage. >>Why the startup, we move really fast. We find that in flex DB can move as fast as us. It's just a great group, very collaborative, very interested in manufacturing. And we see a bright future in working with influence. My name is Aaron Seley. I'm the CTO at HBI. Highlight's one of the first companies to focus on manufacturing data and apply the concepts of data ops, treat that as an asset to deliver to the it system, to enable applications like overall equipment effectiveness that can help the factory produce better, smarter, faster time series data. And manufacturing's really important. If you take a piece of equipment, you have the temperature pressure at the moment that you can look at to kind of see the state of what's going on. So without that context and understanding you can't do what manufacturers ultimately want to do, which is predict the future. >>Influx DB represents kind of a new way to storm time series data with some more advanced technology and more importantly, more open technologies. The other thing that influx does really well is once the data's influx, it's very easy to get out, right? They have a modern rest API and other ways to access the data. That would be much more difficult to do integrations with classic historians highlight can serve to model data, aggregate data on the shop floor from a multitude of sources, whether that be P C U a servers, manufacturing execution systems, E R P et cetera, and then push that seamlessly into influx to then be able to run calculations. Manufacturing is changing this industrial 4.0, and what we're seeing is influx being part of that equation. Being used to store data off the unified name space, we recommend InfluxDB all the time to customers that are exploring a new way to share data manufacturing called the unified name space who have open questions around how do I share this new data that's coming through my UNS or my QTT broker? How do I store this and be able to query it over time? And we often point to influx as a solution for that is a great brand. It's a great group of people and it's a great technology. >>Okay. We're now going to go into the customer panel and we'd like to welcome Angelo Fasi. Who's a software engineer at the Vera C Ruben observatory in Caleb McLaughlin whose senior spacecraft operations software engineer at loft orbital guys. Thanks for joining us. You don't wanna miss folks this interview, Caleb, let's start with you. You work for an extremely cool company. You're launching satellites into space. I mean, there, of course doing that is, is highly complex and not a cheap endeavor. Tell us about loft Orbi and what you guys do to attack that problem. >>Yeah, absolutely. And, uh, thanks for having me here by the way. Uh, so loft orbital is a, uh, company. That's a series B startup now, uh, who and our mission basically is to provide, uh, rapid access to space for all kinds of customers. Uh, historically if you want to fly something in space, do something in space, it's extremely expensive. You need to book a launch, build a bus, hire a team to operate it, you know, have a big software teams, uh, and then eventually worry about, you know, a bunch like just a lot of very specialized engineering. And what we're trying to do is change that from a super specialized problem that has an extremely high barrier of access to a infrastructure problem. So that it's almost as simple as, you know, deploying a VM in, uh, AWS or GCP is getting your, uh, programs, your mission deployed on orbit, uh, with access to, you know, different sensors, uh, cameras, radios, stuff like that. >>So that's, that's kind of our mission. And just to give a really brief example of the kind of customer that we can serve. Uh, there's a really cool company called, uh, totem labs who is working on building, uh, IOT cons, an IOT constellation for in of things, basically being able to get telemetry from all over the world. They're the first company to demonstrate indoor T, which means you have this little modem inside a container container that you, that you track from anywhere in the world as it's going across the ocean. Um, so they're, it's really little and they've been able to stay a small startup that's focused on their product, which is the, uh, that super crazy complicated, cool radio while we handle the whole space segment for them, which just, you know, before loft was really impossible. So that's, our mission is, uh, providing space infrastructure as a service. We are kind of groundbreaking in this area and we're serving, you know, a huge variety of customers with all kinds of different missions, um, and obviously generating a ton of data in space, uh, that we've gotta handle. Yeah. >>So amazing Caleb, what you guys do, I, now I know you were lured to the skies very early in your career, but how did you kinda land on this business? >>Yeah, so, you know, I've, I guess just a little bit about me for some people, you know, they don't necessarily know what they wanna do like early in their life. For me, I was five years old and I knew, you know, I want to be in the space industry. So, you know, I started in the air force, but have, uh, stayed in the space industry, my whole career and been a part of, uh, this is the fifth space startup that I've been a part of actually. So, you know, I've, I've, uh, kind of started out in satellites, did spent some time in working in, uh, the launch industry on rockets. Then, uh, now I'm here back in satellites and you know, honestly, this is the most exciting of the difference based startups. That I've been a part of >>Super interesting. Okay. Angelo, let's, let's talk about the Ruben observatory, ver C Ruben, famous woman scientist, you know, galaxy guru. Now you guys the observatory, you're up way up high. You're gonna get a good look at the Southern sky. Now I know COVID slowed you guys down a bit, but no doubt. You continued to code away on the software. I know you're getting close. You gotta be super excited. Give us the update on, on the observatory and your role. >>All right. So yeah, Rubin is a state of the art observatory that, uh, is in construction on a remote mountain in Chile. And, um, with Rubin, we conduct the, uh, large survey of space and time we are going to observe the sky with, uh, eight meter optical telescope and take, uh, a thousand pictures every night with a 3.2 gig up peaks of camera. And we are going to do that for 10 years, which is the duration of the survey. >>Yeah. Amazing project. Now you, you were a doctor of philosophy, so you probably spent some time thinking about what's out there and then you went out to earn a PhD in astronomy, in astrophysics. So this is something that you've been working on for the better part of your career, isn't it? >>Yeah, that's that's right. Uh, about 15 years, um, I studied physics in college, then I, um, got a PhD in astronomy and, uh, I worked for about five years in another project. Um, the dark energy survey before joining rubing in 2015. >>Yeah. Impressive. So it seems like you both, you know, your organizations are looking at space from two different angles. One thing you guys both have in common of course is, is, is software. And you both use InfluxDB as part of your, your data infrastructure. How did you discover influx DB get into it? How do you use the platform? Maybe Caleb, you could start. >>Uh, yeah, absolutely. So the first company that I extensively used, uh, influx DBN was a launch startup called, uh, Astra. And we were in the process of, uh, designing our, you know, our first generation rocket there and testing the engines, pumps, everything that goes into a rocket. Uh, and when I joined the company, our data story was not, uh, very mature. We were collecting a bunch of data in LabVIEW and engineers were taking that over to MATLAB to process it. Um, and at first there, you know, that's the way that a lot of engineers and scientists are used to working. Um, and at first that was, uh, like people weren't entirely sure that that was a, um, that that needed to change, but it's something the nice thing about InfluxDB is that, you know, it's so easy to deploy. So as the, our software engineering team was able to get it deployed and, you know, up and running very quickly and then quickly also backport all of the data that we collected thus far into influx and what, uh, was amazing to see. >>And as kind of the, the super cool moment with influx is, um, when we hooked that up to Grafana Grafana as the visualization platform we used with influx, cuz it works really well with it. Uh, there was like this aha moment of our engineers who are used to this post process kind of method for dealing with their data where they could just almost instantly easily discover data that they hadn't been able to see before and take the manual processes that they would run after a test and just throw those all in influx and have live data as tests were coming. And, you know, I saw them implementing like crazy rocket equation type stuff in influx, and it just was totally game changing for how we tested. >>So Angelo, I was explaining in my open, you know, you could, you could add a column in a traditional RDBMS and do time series, but with the volume of data that you're talking about, and the example of the Caleb just gave you, I mean, you have to have a purpose built time series database, where did you first learn about influx DB? >>Yeah, correct. So I work with the data management team, uh, and my first project was the record metrics that measured the performance of our software, uh, the software that we used to process the data. So I started implementing that in a relational database. Um, but then I realized that in fact, I was dealing with time series data and I should really use a solution built for that. And then I started looking at time series databases and I found influx B. And that was, uh, back in 2018. The another use for influx DB that I'm also interested is the visits database. Um, if you think about the observations we are moving the telescope all the time in pointing to specific directions, uh, in the Skype and taking pictures every 30 seconds. So that itself is a time series. And every point in that time series, uh, we call a visit. So we want to record the metadata about those visits and flex to, uh, that time here is going to be 10 years long, um, with about, uh, 1000 points every night. It's actually not too much data compared to other, other problems. It's, uh, really just a different, uh, time scale. >>The telescope at the Ruben observatory is like pun intended, I guess the star of the show. And I, I believe I read that it's gonna be the first of the next gen telescopes to come online. It's got this massive field of view, like three orders of magnitude times the Hub's widest camera view, which is amazing, right? That's like 40 moons in, in an image amazingly fast as well. What else can you tell us about the telescope? >>Um, this telescope, it has to move really fast and it also has to carry, uh, the primary mirror, which is an eight meter piece of glass. It's very heavy and it has to carry a camera, which has about the size of a small car. And this whole structure weighs about 300 tons for that to work. Uh, the telescope needs to be, uh, very compact and stiff. Uh, and one thing that's amazing about it's design is that the telescope, um, is 300 tons structure. It sits on a tiny film of oil, which has the diameter of, uh, human hair. And that makes an almost zero friction interface. In fact, a few people can move these enormous structure with only their hands. Uh, as you said, uh, another aspect that makes this telescope unique is the optical design. It's a wide field telescope. So each image has, uh, in diameter the size of about seven full moons. And, uh, with that, we can map the entire sky in only, uh, three days. And of course doing operations everything's, uh, controlled by software and it is automatic. Um there's a very complex piece of software, uh, called the scheduler, which is responsible for moving the telescope, um, and the camera, which is, uh, recording 15 terabytes of data every night. >>Hmm. And, and, and Angela, all this data lands in influx DB. Correct. And what are you doing with, with all that data? >>Yeah, actually not. Um, so we are using flex DB to record engineering data and metadata about the observations like telemetry events and commands from the telescope. That's a much smaller data set compared to the images, but it is still challenging because, uh, you, you have some high frequency data, uh, that the system needs to keep up and we need to, to start this data and have it around for the lifetime of the price. Mm, >>Got it. Thank you. Okay, Caleb, let's bring you back in and can tell us more about the, you got these dishwasher size satellites. You're kind of using a multi-tenant model. I think it's genius, but, but tell us about the satellites themselves. >>Yeah, absolutely. So, uh, we have in space, some satellites already that as you said, are like dishwasher, mini fridge kind of size. Um, and we're working on a bunch more that are, you know, a variety of sizes from shoebox to, I guess, a few times larger than what we have today. Uh, and it is, we do shoot to have effectively something like a multi-tenant model where, uh, we will buy a bus off the shelf. The bus is, uh, what you can kind of think of as the core piece of the satellite, almost like a motherboard or something where it's providing the power. It has the solar panels, it has some radios attached to it. Uh, it handles the attitude control, basically steers the spacecraft in orbit. And then we build also in house, what we call our payload hub, which is, has all, any customer payloads attached and our own kind of edge processing sort of capabilities built into it. >>And, uh, so we integrate that. We launch it, uh, and those things, because they're in lower orbit, they're orbiting the earth every 90 minutes. That's, you know, seven kilometers per second, which is several times faster than a speeding bullet. So we've got, we have, uh, one of the unique challenges of operating spacecraft and lower orbit is that generally you can't talk to them all the time. So we're managing these things through very brief windows of time, uh, where we get to talk to them through our ground sites, either in Antarctica or, you know, in the north pole region. >>Talk more about how you use influx DB to make sense of this data through all this tech that you're launching into space. >>We basically previously we started off when I joined the company, storing all of that as Angelo did in a regular relational database. And we found that it was, uh, so slow in the size of our data would balloon over the course of a couple days to the point where we weren't able to even store all of the data that we were getting. Uh, so we migrated to influx DB to store our time series telemetry from the spacecraft. So, you know, that's things like, uh, power level voltage, um, currents counts, whatever, whatever metadata we need to monitor about the spacecraft. We now store that in, uh, in influx DB. Uh, and that has, you know, now we can actually easily store the entire volume of data for the mission life so far without having to worry about, you know, the size bloating to an unmanageable amount. >>And we can also seamlessly query, uh, large chunks of data. Like if I need to see, you know, for example, as an operator, I might wanna see how my, uh, battery state of charge is evolving over the course of the year. I can have a plot and an influx that loads that in a fraction of a second for a year's worth of data, because it does, you know, intelligent, um, I can intelligently group the data by, uh, sliding time interval. Uh, so, you know, it's been extremely powerful for us to access the data and, you know, as time has gone on, we've gradually migrated more and more of our operating data into influx. >>You know, let's, let's talk a little bit, uh, uh, but we throw this term around a lot of, you know, data driven, a lot of companies say, oh, yes, we're data driven, but you guys really are. I mean, you' got data at the core, Caleb, what does that, what does that mean to you? >>Yeah, so, you know, I think the, and the clearest example of when I saw this be like totally game changing is what I mentioned before at Astro where our engineer's feedback loop went from, you know, a lot of kind of slow researching, digging into the data to like an instant instantaneous, almost seeing the data, making decisions based on it immediately, rather than having to wait for some processing. And that's something that I've also seen echoed in my current role. Um, but to give another practical example, uh, as I said, we have a huge amount of data that comes down every orbit, and we need to be able to ingest all of that data almost instantaneously and provide it to the operator. And near real time, you know, about a second worth of latency is all that's acceptable for us to react to, to see what is coming down from the spacecraft and building that pipeline is challenging from a software engineering standpoint. >>Um, our primary language is Python, which isn't necessarily that fast. So what we've done is started, you know, in the, in the goal of being data driven is publish metrics on individual, uh, how individual pieces of our data processing pipeline are performing into influx as well. And we do that in production as well as in dev. Uh, so we have kind of a production monitoring, uh, flow. And what that has done is allow us to make intelligent decisions on our software development roadmap, where it makes the most sense for us to, uh, focus our development efforts in terms of improving our software efficiency. Uh, just because we have that visibility into where the real problems are. Um, it's sometimes we've found ourselves before we started doing this kind of chasing rabbits that weren't necessarily the real root cause of issues that we were seeing. Uh, but now, now that we're being a bit more data driven, there we are being much more effective in where we're spending our resources and our time, which is especially critical to us as we scale to, from supporting a couple satellites, to supporting many, many satellites at >>Once. Yeah. Coach. So you reduced those dead ends, maybe Angela, you could talk about what, what sort of data driven means to, to you and your teams? >>I would say that, um, having, uh, real time visibility, uh, to the telemetry data and, and metrics is, is, is crucial for us. We, we need, we need to make sure that the image that we collect with the telescope, uh, have good quality and, um, that they are within the specifications, uh, to meet our science goals. And so if they are not, uh, we want to know that as soon as possible and then, uh, start fixing problems. >>Caleb, what are your sort of event, you know, intervals like? >>So I would say that, you know, as of today on the spacecraft, the event, the, the level of timing that we deal with probably tops out at about, uh, 20 Hertz, 20 measurements per second on, uh, things like our, uh, gyroscopes, but the, you know, I think the, the core point here of the ability to have high precision data is extremely important for these kinds of scientific applications. And I'll give an example, uh, from when I worked at, on the rocket at Astra there, our baseline data rate that we would ingest data during a test is, uh, 500 Hertz. So 500 samples per second. And in some cases we would actually, uh, need to ingest much higher rate data, even up to like 1.5 kilohertz. So, uh, extremely, extremely high precision, uh, data there where timing really matters a lot. And, uh, you know, I can, one of the really powerful things about influx is the fact that it can handle this. >>That's one of the reasons we chose it, uh, because there's times when we're looking at the results of a firing where you're zooming in, you know, I talked earlier about how on my current job, we often zoom out to look, look at a year's worth of data. You're zooming in to where your screen is preoccupied by a tiny fraction of a second. And you need to see same thing as Angela just said, not just the actual telemetry, which is coming in at a high rate, but the events that are coming out of our controllers. So that can be something like, Hey, I opened this valve at exactly this time and that goes, we wanna have that at, you know, micro or even nanosecond precision so that we know, okay, we saw a spike in chamber pressure at, you know, at this exact moment, was that before or after this valve open, those kind of, uh, that kind of visibility is critical in these kind of scientific, uh, applications and absolutely game changing to be able to see that in, uh, near real time and, uh, with a really easy way for engineers to be able to visualize this data themselves without having to wait for, uh, software engineers to go build it for them. >>Can the scientists do self-serve or are you, do you have to design and build all the analytics and, and queries for your >>Scientists? Well, I think that's, that's absolutely from, from my perspective, that's absolutely one of the best things about influx and what I've seen be game changing is that, uh, generally I'd say anyone can learn to use influx. Um, and honestly, most of our users might not even know they're using influx, um, because what this, the interface that we expose to them is Grafana, which is, um, a generic graphing, uh, open source graphing library that is very similar to influx own chronograph. Sure. And what it does is, uh, let it provides this, uh, almost it's a very intuitive UI for building your queries. So you choose a measurement and it shows a dropdown of available measurements. And then you choose a particular, the particular field you wanna look at. And again, that's a dropdown, so it's really easy for our users to discover. And there's kind of point and click options for doing math aggregations. You can even do like perfect kind of predictions all within Grafana, the Grafana user interface, which is really just a wrapper around the APIs and functionality of the influx provides putting >>Data in the hands of those, you know, who have the context of domain experts is, is key. Angela, is it the same situation for you? Is it self serve? >>Yeah, correct. Uh, as I mentioned before, um, we have the astronomers making their own dashboards because they know what exactly what they, they need to, to visualize. Yeah. I mean, it's all about using the right tool for the job. I think, uh, for us, when I joined the company, we weren't using influx DB and we, we were dealing with serious issues of the database growing to an incredible size extremely quickly, and being unable to like even querying short periods of data was taking on the order of seconds, which is just not possible for operations >>Guys. This has been really formative it's, it's pretty exciting to see how the edge is mountaintops, lower orbits to be space is the ultimate edge. Isn't it. I wonder if you could answer two questions to, to wrap here, you know, what comes next for you guys? Uh, and is there something that you're really excited about that, that you're working on Caleb, maybe you could go first and an Angela, you can bring us home. >>Uh, basically what's next for loft. Orbital is more, more satellites, a greater push towards infrastructure and really making, you know, our mission is to make space simple for our customers and for everyone. And we're scaling the company like crazy now, uh, making that happen, it's extremely exciting and extremely exciting time to be in this company and to be in this industry as a whole, because there are so many interesting applications out there. So many cool ways of leveraging space that, uh, people are taking advantage of. And with, uh, companies like SpaceX and the now rapidly lowering cost, cost of launch, it's just a really exciting place to be. And we're launching more satellites. We are scaling up for some constellations and our ground system has to be improved to match. So there's a lot of, uh, improvements that we're working on to really scale up our control software, to be best in class and, uh, make it capable of handling such a large workload. So >>You guys hiring >><laugh>, we are absolutely hiring. So, uh, I would in we're we need, we have PE positions all over the company. So, uh, we need software engineers. We need people who do more aerospace, specific stuff. So, uh, absolutely. I'd encourage anyone to check out the loft orbital website, if there's, if this is at all interesting. >>All right. Angela, bring us home. >>Yeah. So what's next for us is really, uh, getting this, um, telescope working and collecting data. And when that's happen is going to be just, um, the Lu of data coming out of this camera and handling all, uh, that data is going to be really challenging. Uh, yeah. I wanna wanna be here for that. <laugh> I'm looking forward, uh, like for next year we have like an important milestone, which is our, um, commissioning camera, which is a simplified version of the, of the full camera it's going to be on sky. And so yeah, most of the system has to be working by them. >>Nice. All right, guys, you know, with that, we're gonna end it. Thank you so much, really fascinating, and thanks to influx DB for making this possible, really groundbreaking stuff, enabling value creation at the edge, you know, in the cloud and of course, beyond at the space. So really transformational work that you guys are doing. So congratulations and really appreciate the broader community. I can't wait to see what comes next from having this entire ecosystem. Now, in a moment, I'll be back to wrap up. This is Dave ante, and you're watching the cube, the leader in high tech enterprise coverage. >>Welcome Telegraph is a popular open source data collection. Agent Telegraph collects data from hundreds of systems like IOT sensors, cloud deployments, and enterprise applications. It's used by everyone from individual developers and hobbyists to large corporate teams. The Telegraph project has a very welcoming and active open source community. Learn how to get involved by visiting the Telegraph GitHub page, whether you want to contribute code, improve documentation, participate in testing, or just show what you're doing with Telegraph. We'd love to hear what you're building. >>Thanks for watching. Moving the world with influx DB made possible by influx data. I hope you learn some things and are inspired to look deeper into where time series databases might fit into your environment. If you're dealing with large and or fast data volumes, and you wanna scale cost effectively with the highest performance and you're analyzing metrics and data over time times, series databases just might be a great fit for you. Try InfluxDB out. You can start with a free cloud account by clicking on the link and the resources below. Remember all these recordings are gonna be available on demand of the cube.net and influx data.com. So check those out and poke around influx data. They are the folks behind InfluxDB and one of the leaders in the space, we hope you enjoyed the program. This is Dave Valante for the cube. We'll see you soon.

Published Date : May 12 2022

SUMMARY :

case that anyone can relate to and you can build timestamps into Now, the problem with the latter example that I just gave you is that you gotta hunt As I just explained, we have an exciting program for you today, and we're And then we bring it back here Thanks for coming on. What is the story? And, and he basically, you know, from my point of view, he invented modern time series, Yeah, I think we're, I, you know, I always forget the number, but it's something like 230 or 240 people relational database is the one database to rule the world. And then you get the data lake. So And so you get to these applications Isn't good enough when you need real time. It's like having the feature for, you know, you buy a new television, So this is a big part of how we're seeing with people saying, Hey, you know, And so you get the dynamic of, you know, of constantly instrumenting watching the What are you seeing for your, with in, with influx DB, So a lot, you know, Tesla, lucid, motors, Cola, You mentioned, you know, you think of IOT, look at the use cases there, it was proprietary And so the developer, So let's get to the developer real quick, real highlight point here is the data. So to a degree that you are moving your service, So when you bring in kind of old way, new way old way was you know, the best of the open source world. They have faster time to market cuz they're assembling way faster and they get to still is what we like to think of it. I mean systems, uh, uh, systems have consequences when you make changes. But that's where the that's where the, you know, that that Boeing or that airplane building analogy comes in So I'll have to ask you if I'm the customer. Because now I have to make these architectural decisions, as you mentioned, And so that's what you started building. And since I have a PO for you and a big check, yeah. It's not like it's, you know, it's not like it's doing every action that's above, but it's foundational to build What would you say to someone looking to do something in time series on edge? in the build business of building systems that you want 'em to be increasingly intelligent, Brian Gilmore director of IOT and emerging technology that influx day will join me. So you can focus on the Welcome to the show. Sort of, you know, riding along with them is they're successful. Now, you go back since 20 13, 14, even like five years ago that convergence of physical And I think, you know, those, especially in the OT and on the factory floor who weren't able And I think I, OT has been kind of like this thing for OT and, you know, our client libraries and then working hard to make our applications, leveraging that you guys have users in the enterprise users that IOT market mm-hmm <affirmative>, they're excited to be able to adopt and use, you know, to optimize inside the business as compared to just building mm-hmm <affirmative> so how do you support the backwards compatibility of older systems while maintaining open dozens very hard work and a lot of support, um, you know, and so by making those connections and building those ecosystems, What are some of the, um, soundbites you hear from customers when they're successful? machines that go deep into the earth to like drill tunnels for, for, you know, I personally think that's a hot area because I think if you look at AI right all of the things you need to do with that data in stream, um, before it hits your sort of central repository. So you have that whole CEO perspective, but he brought up this notion that You can start to compare asset to asset, and then you can do those things like we talked about, So in this model you have a lot of commercial operations, industrial equipment. And I think, you know, we are, we're building some technology right now. like, you know, either in low earth orbit or you know, all the way sort of on the other side of the universe. I think you bring up a good point there because one of the things that's common in the industry right now, people are talking about, I mean, I think you talked about it, uh, you know, for them just to be able to adopt the platform How do you view view that? Um, you know, and it, it allows the developer to build all of those hooks for not only data creation, There's so much data out there now. that data from point a to point B and you know, to process it correctly so that the end And, and the democratization is the benefit. allow them to just port to us, you know, directly from the applications and the languages Thanks for sharing all, all the complexities and, and IOT that you Well, thank any, any last word you wanna share No, just, I mean, please, you know, if you're, if you're gonna, if you're gonna check out influx TV, You're gonna hear more about that in the next segment, too. the moment that you can look at to kind of see the state of what's going on. And we often point to influx as a solution Tell us about loft Orbi and what you guys do to attack that problem. So that it's almost as simple as, you know, We are kind of groundbreaking in this area and we're serving, you know, a huge variety of customers and I knew, you know, I want to be in the space industry. famous woman scientist, you know, galaxy guru. And we are going to do that for 10 so you probably spent some time thinking about what's out there and then you went out to earn a PhD in astronomy, Um, the dark energy survey So it seems like you both, you know, your organizations are looking at space from two different angles. something the nice thing about InfluxDB is that, you know, it's so easy to deploy. And, you know, I saw them implementing like crazy rocket equation type stuff in influx, and it Um, if you think about the observations we are moving the telescope all the And I, I believe I read that it's gonna be the first of the next Uh, the telescope needs to be, And what are you doing with, compared to the images, but it is still challenging because, uh, you, you have some Okay, Caleb, let's bring you back in and can tell us more about the, you got these dishwasher and we're working on a bunch more that are, you know, a variety of sizes from shoebox sites, either in Antarctica or, you know, in the north pole region. Talk more about how you use influx DB to make sense of this data through all this tech that you're launching of data for the mission life so far without having to worry about, you know, the size bloating to an Like if I need to see, you know, for example, as an operator, I might wanna see how my, You know, let's, let's talk a little bit, uh, uh, but we throw this term around a lot of, you know, data driven, And near real time, you know, about a second worth of latency is all that's acceptable for us to react you know, in the, in the goal of being data driven is publish metrics on individual, So you reduced those dead ends, maybe Angela, you could talk about what, what sort of data driven means And so if they are not, So I would say that, you know, as of today on the spacecraft, the event, so that we know, okay, we saw a spike in chamber pressure at, you know, at this exact moment, the particular field you wanna look at. Data in the hands of those, you know, who have the context of domain experts is, issues of the database growing to an incredible size extremely quickly, and being two questions to, to wrap here, you know, what comes next for you guys? a greater push towards infrastructure and really making, you know, So, uh, we need software engineers. Angela, bring us home. And so yeah, most of the system has to be working by them. at the edge, you know, in the cloud and of course, beyond at the space. involved by visiting the Telegraph GitHub page, whether you want to contribute code, and one of the leaders in the space, we hope you enjoyed the program.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Brian GilmorePERSON

0.99+

JohnPERSON

0.99+

AngelaPERSON

0.99+

EvanPERSON

0.99+

2015DATE

0.99+

SpaceXORGANIZATION

0.99+

2016DATE

0.99+

Dave ValantePERSON

0.99+

AntarcticaLOCATION

0.99+

BoeingORGANIZATION

0.99+

CalebPERSON

0.99+

10 yearsQUANTITY

0.99+

ChileLOCATION

0.99+

BrianPERSON

0.99+

AmazonORGANIZATION

0.99+

Evan KaplanPERSON

0.99+

Aaron SeleyPERSON

0.99+

Angelo FasiPERSON

0.99+

2013DATE

0.99+

PaulPERSON

0.99+

TeslaORGANIZATION

0.99+

2018DATE

0.99+

IBMORGANIZATION

0.99+

GoogleORGANIZATION

0.99+

two questionsQUANTITY

0.99+

Caleb McLaughlinPERSON

0.99+

40 moonsQUANTITY

0.99+

two systemsQUANTITY

0.99+

twoQUANTITY

0.99+

AngeloPERSON

0.99+

230QUANTITY

0.99+

300 tonsQUANTITY

0.99+

threeQUANTITY

0.99+

500 HertzQUANTITY

0.99+

3.2 gigQUANTITY

0.99+

15 terabytesQUANTITY

0.99+

eight meterQUANTITY

0.99+

two practitionersQUANTITY

0.99+

20 HertzQUANTITY

0.99+

25 yearsQUANTITY

0.99+

TodayDATE

0.99+

Palo AltoLOCATION

0.99+

PythonTITLE

0.99+

OracleORGANIZATION

0.99+

Paul dicksPERSON

0.99+

FirstQUANTITY

0.99+

iPhonesCOMMERCIAL_ITEM

0.99+

firstQUANTITY

0.99+

earthLOCATION

0.99+

240 peopleQUANTITY

0.99+

three daysQUANTITY

0.99+

appleORGANIZATION

0.99+

AWSORGANIZATION

0.99+

HBIORGANIZATION

0.99+

Dave LANPERSON

0.99+

todayDATE

0.99+

each imageQUANTITY

0.99+

next yearDATE

0.99+

cube.netOTHER

0.99+

InfluxDBTITLE

0.99+

oneQUANTITY

0.98+

1000 pointsQUANTITY

0.98+

Brian Gilmore, InfluxData


 

>>Okay. Now we're joined by Brian Gilmore, director of IOT and emerging technologies at influx data. Welcome to the show. >>Thank you, John. Great to be >>Here. We just spent some time with Evan going through the company and the value proposition, um, with influx DB, what's the momentum. What do see this coming from? What's the value coming out of this? >>Well, I think it, we're sort of hitting a point where the technology is, is like the adoption of it is becoming mainstream. We're seeing it in all sorts of organizations, everybody from like the most well funded sort of advanced big technology companies to the smaller academics, the startups and the managing of that sort, sort of data that emits from that technology is time series and us being able to give them a, a platform, a tool that's super easy to use, easy to start. And then of course we'll grow with them is, has been key to us, sort of, you know, riding along with them is they're successful. >>Evan was mentioning that time series has been on everyone's radar and that's in the OT business for years. Now, you go back 20 13, 14, even like five years ago that convergence of physical and digital coming together, IP enabled edge. Yeah. Edge has always been kind of hyped up, but why now? Why, why is the edge so hot right now from an adoption standpoint? Is it because it's just evolution, the tech getting better? >>I think it's, it's, it's twofold. I think that, you know, there was, I would think for some people, everybody was so focused on cloud over the last probably 10 years. Mm-hmm <affirmative> that they forgot about the compute that was available at the edge. And I think, you know, those, especially in the OT and on the factory floor who weren't able to take advantage full advantage of cloud through their applications, you know, still needed to be able to leverage that compute at the edge. I think the big thing that we're seeing now, which is trusting is, is that there's like a hybrid nature to all of these applications where there is definitely some data that's generated on the edge. There's definitely done some data that's generated in the cloud. And it's the ability for a developer to sort of like tie those two systems together and work with that data in a very unified uniform way. Um, that's giving them the opportunity to build solutions that, you know, really deliver value to whatever it is they're trying to do, whether it's, you know, the, the outer reaches of outer space or whether it's optimizing the factory floor. >>Yeah. I think, I think one of the things you also mentioned genome too, dig big data is coming to the real world. And I think I, I O T has been kind of like this thing for OT and, and some use case, but now with the, with the cloud, all companies have an edge strategy now. So yeah, what's the secret sauce because now this is hot, hot product for the whole world and not just industrial, but all businesses. What's the secret sauce. >>Well, I mean, I think part of it is just that the technology is becoming more capable and that's especially on the hardware side, right? I mean, like technology compute is getting smaller and smaller and smaller. And we find that by supporting all the way down to the edge, even to the micro controller layer with our, um, you know, our client libraries and then working hard to make our applications, especially the database as small as possible so that it can be located as close to sort of the point of origin of that data in the edge as possible is, is, is fantastic. Now you can take that. You can run that locally. You can do your local decision making. You can use influx DB as sort of an input to automation control the autonomy that people are trying to drive at the edge, but when you link it up with everything that's in the cloud, that's when you get all of the sort of cloud scale capabilities of parallel eyes, AI, and machine learning and all of that. So >>What's interesting is the open source success has been something that we've talked about a lot in the cube about how people are leveraging that you guys have users in the enterprise users at I O T market mm-hmm <affirmative>, but you got developers now. Yeah. Kind of together brought that up. How do you see that emerging? How do developers engage? What are some of, as you're seeing that developers are really getting into with influx DB what's >>Yeah. Well, I mean, I think there are the developers who are building companies, right? I mean, these are the startups and the folks that we love to work with who are building new, you know, new services, new products, things like that. And, you know, especially on the consumer side of, I T there's a lot of that, just those developers, but I think we, you gotta pay attention to those enterprise develop as well, right? There are tons of people with the, the title of engineer in, in your regular enterprise organizations. And they're there for a systems integration. They're there for, you know, looking at what they would build versus what they would buy. And a lot of them come from, you know, a strong, open source background and they, they know the communities, they know the top platforms in those spaces and, and, you know, they're excited to be able to adopt and use, you know, to optimize inside the business as compared to just building a brand new one. >>You know, it's interesting too, when Evan and I were talking about open source versus closed OT systems, mm-hmm <affirmative> so how do you support the backwards compatibility of older systems while maintaining opens dozens of data formats out there? A bunch of standards, protocols, new things are emerging, and everyone wants to have a control plane. Everyone wants to leverage the value of data. How do you guys keep track of it all? What do you guys support? >>Yeah, well, I mean, I think either through direct connection, like we have a product called Telegraph, it's unbelievable. It's open source, it's an edge agent. You can run it as close to the edge as you'd like, it speaks dozens of different protocols and its own, right. A couple of which M Q T T UA are very, very, um, applicable to these IOT use cases. But then we also, because we are sort of not only open source, but open in terms of our ability to collect data, we have a lot of partners who have built really great integrations from their own middleware, into influx DB. These are companies like cap wire and high by who are really experts in those downstream industrial protocols. I mean, that's a business, not everybody wants to be in. It requires some very specialized, very hard work and a lot of support, um, you know, and so by making those connections and building those ecosystems, we get the best of both worlds. The customers can use the platforms they need up to the point where they would be putting into our database. >>What's some of the customer testimonies that they, that share with you. Can you share some anecdotal, all kind of like, wow, that's the best thing I've ever used. That's really changed my business. Or this is a great tech that didn't helped me in these other areas. What are some of the, um, sound bites you hear from customers when they're successful? >>Yeah. I mean, I think it ranges. You've got customers who are, you know, just finally being able to do the monitoring of assets, you know, sort of at the edge in the field, we have a customer who's who has these tunnel boring machines that go deep into the earth to like drill tunnels for, for, you know, cars and, and, you know, trains and things like that. You know, they are just excited to be able to stick a database onto those tunnel, boring machines, send them in to the depths of the earth and know that when they come out, all of that telemetry at a very high frequency has been like safely stored. And then it can just very quickly and instantly connect up to their, you know, centralized database. So like just having that visibility is brand new to them. And that's super important. On the other hand, you have customers who are way far beyond the monitoring use case. >>We're, they're actually using the historical records in the time series database to, um, like I think Evan mentioned like forecast things. So for predictive maintenance, being able to pull in the telemetry from the machines, but then also all of that external enrichment data, the metadata, the temperatures, the pressures who was operating the machine, those types of things, and being able to of easily integrate with platforms like Jupyter notebooks. Yeah. Or, you know, all of those scientific computing and machine learning libraries to be able to build the models, train the models, and then they can send that information back down to influx TV to apply it and detect those anomalies, which >>Are, I think that's gonna be an, an area. I personally think that's a hot area because I think if you look at AI right now yeah. It's all about two training, the machine learning albums after the fact. So time series becomes hugely important. Yeah. Cause now you're thinking, okay, the data matters post time. Yeah. For sure. And then it gets updated the new time. Yeah. So it's like constant data cleansing data iteration, data programming. We're starting to see this new use case emerge in the data feed. Yep. >>Yeah. I mean, I think >>You >>Agree. Yeah, of course. Yeah. The, the ability to sort of handle those pipelines of data smartly, um, intelligently, and then to be able to do all of the things you need to do with that data in stream, um, before it hits your sort of central repository. And, and we make that really easy for customers like Telegraph, not only does it have sort of the inputs to connect up to all of those protocols and the ability to capture and connect up to the, to the partner data. But also it has a whole bunch of capabilities around being able to process that data, enrich it, reformat it, route it, do whatever you need. So at that point you're basically able to, you're playing your data in exactly the way you would wanna do it. You're routing it to D and you know, destinations and, and it's, it's, it's not something that really has been in the realm of possibility until this point. Yeah. >>Yeah. And when Evan was on it's great. He was a CEO. So he sees the big picture with customers. He was, he kind of put the package together that said, Hey, we got a system. We got customers, people are wanting to leverage our product. What's your PO they're sell, he's selling too as well. So you have that whole C your perspective, but he brought up this notion that there's multiple personas involved in kind of the influx DB system architect. You got developers and users. Can you talk about that? Reality as customers start to commercialize and operationalize this from a commercial standpoint, you got a relationship to the cloud. Yep. The edge is there. Yep. The edge is getting super important, but cloud brings a lot of scale to the table. So what is the relationship to the cloud? Can you share your thoughts on edge and its relationship to the cloud? Yeah. >>I mean, I think edge, you know, edge is you can think of it really as like the local information, right? So it's, it's generally like compartmentalized to a point of like, you know, a single asset or a single factory align, whatever. Um, but what people do who wanna pro they wanna be able to make the decisions there at the edge locally, um, quickly minus the latency of sort of taking that large volume of data, shipping it to the cloud and doing something with it there. So we allow, allow them to do exactly that. Then what they can do is they can actually down sample that data or they can, you know, detect like the really important metrics or the anomalies. And then they can ship that to a central database in the cloud where they can do all sorts of really interesting things with it. Like you can get that centralized view of all of your global assets. You can start to compare asset to asset, and then you can do as things like we talked about, whereas you can do predictive types of analytics or, you know, larger scale anomaly >>Detections. So in this model you have a lot of commercial operations, industrial equipment. Yep. The physical plant, physical business with virtual data cloud all coming together. What's the future for influx DB from a tech standpoint. Cause you got open. Yep. There's an ecosystem there. Yep. You have customers who want operational reliability for sure. I mean, so you got organic <laugh> >>Yeah. Yeah. I mean, I think, you know, again, we got iPhones when everybody's waiting for flying cars. Right. So I don't know. We can like absolutely perfectly predict what's coming, but I think there are some givens and I think those givens are gonna be that the world is only gonna become more hybrid. Right. And then, you know, so we are going to have much more widely distributed, you know, situations where you have data being generated in the cloud, you have data gen being generated at the edge and then there's gonna be data generated sort sort of at all points in between like physical locations as well as things that are, that are very virtual. And I think, you know, we are, we're building some technology right now. That's going to allow, um, the concept of a database to be much more fluid and flexible, sort of more aligned with what a file would be like. >>And so being able to move data to the compute for analysis or move the compute to the data for analysis, those are the types of, of solution is that we'll be bringing to the customers sort of over the next little bit. Um, but I also think we have to start thinking about like what happens when the edge is actually off the planet, right. I mean, we've got customers, you're gonna talk to two of them, uh, in the panel who are actually working with data that comes from like outside the earth. Like, you know, either in low earth orbit or, you know, all the, you sort of on the other side of the universe and, and to be able to process data like that and to do so in a way it's it's we gotta, we gotta build the fundamentals for that right now on the factory floor and in the mines and in the tunnels. Um, so that we'll be ready for that >>One. I think you bring up a good point there because one of the things that's common in the industry right now, people are talking about, this is kind of new thinking is hyper scale's always been built up full stack developers, even the old OT world that Evan was pointing out, that they built everything. Right. And the world's going into more assembly with core competency and IP and also property being the core of their apple. So faster assembly and building <affirmative>, but also integration. You got all this new stuff happening. Yeah. And that's to separate out the data complexity from the app. Yes. So space genome. Yep. Driving cars throws off massive data. >>It does. >>So is Tesla and there is the car the same as the data layer. >>I mean, yeah. It's, it's certainly a point of origin. I think the thing that we wanna do is we wanna let the developers work on the world, changing problems, the things that they're trying to solve, whether it's, you know, energy or, you know, any of the other health or, you know, other challenges that these teams are, are building against. And we'll worry about that time series data in the underlying data platforms so that they don't have to. Right. I mean, I think you talked about it, uh, you know, for them just to be able to adopt the platform quickly, integrate it with their data sources and the other pieces of their applications. It's going to allow them to bring much faster time to market on these products. It's gonna allow them to be more iterative. They're gonna be able to do more sort of testing and things like that. And ultimately will it'll accelerate the adoption and the creation of >>Technology. You mentioned earlier in, in our talk about unification of data. Yeah. How about APIs? Cuz developers love APIs in the cloud unifying APIs. How do you view view that? >>Yeah, I mean, we are APIs, that's the product itself. Like everything people like to think of it is sort of having this nice front end, but the front end is B built on our public APIs. Um, you know, and it, it allows the developer to build all of those hooks for not only data creation, but then data processing, data analytics, and then, you know, sort of data extraction to bring it to other platforms or other applications, microservices, whatever it might be. So, I mean, it is a world of APIs right now and you know, we, we bring a very sort of useful set of them for managing the time series data. These guys are all challenged with. >>It's interesting. You and I were talking before we came on camera about how, um, data feels gonna have this kind of SRE role that DevOps had site reliability engineers, which managed a bunch of there's so much data out there now. Yeah. >>Yeah. It's like raining data for sure. And I think like that ability to like one of the best jobs on the planet is gonna be to be able to like, sort of be that data Wrangler, to be able to understand like what the data sources are, what the data formats are, how to be able to efficiently move that data from point a to point B and you know, to process it correctly so that the end users of that data aren't doing any of that sort of hard upfront preparation collection, storage work >>That's data as code. I mean, data engineering. It is, it is becoming a new discipline it for sure. And, and the democratization is the benefit. Yeah. To everyone, data science get easier. I mean, data science, but they wanna make it easy. Right. <laugh> yeah. They wanna do the analysis, right? >>Yeah. I mean, I think, you know, it's, it's a really good point. I think like we try to give our users as many ways as there could be possible to get data in and get data out. We sort of think about it as meeting them where they are. Right. So like we build, we have the sort of client libraries that allow them to just port to us, you know, directly from the applications and the languages that they're writing, but then they can also pull it out. And at that point nobody's gonna know the users, the end consumers of that data, better than those people who are building those applications. And so they're building these users and interfaces, which are making all of that data accessible for, you know, their end users inside their organization. >>Well, Brian, great segment, great insight. Thanks for sharing all, all the complexities and, and IOT that you guys help take away with APIs and, and assembly and, and all the system architectures that are changing edge is real cloud is real, absolutely mainstream enterprises. New got developer attraction too. So congratulations. >>Yeah. It's >>Great. Well, thank you. Any, any last word you wanna share >>Deal with? No, just, I mean, please, you know, if you're, if you're gonna, if you're gonna check out influx TV, download it, try out the open source contribute if you can. That's a, that's a huge thing. It's part of being the open source community. Um, you know, but definitely just, just use it. I think once people use it, they try it out. They'll understand very, very >>Quickly awesome open source with developers, enterprise and edge coming together >>All together all together. You're gonna hear more about that in the next segment, too. >>Thanks for coming on. Okay. Thanks. When we return, Dave Lon will lead a panel on edge and data influx DB. You're watching the cube, the leader and high tech enterprise coverage.

Published Date : Apr 19 2022

SUMMARY :

Welcome to the show. What's the value coming out of this? has been key to us, sort of, you know, riding along with them is they're successful. Now, you go back 20 13, 14, even like five years ago that convergence of physical to take advantage full advantage of cloud through their applications, you know, still needed to be able to leverage that And I think I, I O T has been kind of like this thing for OT and, all the way down to the edge, even to the micro controller layer with our, um, you know, that you guys have users in the enterprise users at I O T market mm-hmm <affirmative>, they're excited to be able to adopt and use, you know, to optimize inside the business as compared to just building How do you guys keep track of it all? very hard work and a lot of support, um, you know, and so by making those connections and building those What are some of the, um, sound bites you hear from customers when they're successful? machines that go deep into the earth to like drill tunnels for, for, you know, Or, you know, all of those scientific computing and machine learning libraries to be able to build I personally think that's a hot area because I think if you look at AI right now You're routing it to D and you know, So you have that whole C your perspective, but he brought up this notion that I mean, I think edge, you know, edge is you can think of it really as like the local information, I mean, so you got organic <laugh> And I think, you know, we are, we're building some technology right now. Like, you know, either in low earth orbit or, you know, all the, you sort of on the other side of And that's to separate out the data complexity from the app. I mean, I think you talked about it, uh, you know, for them just to be able to adopt How do you view view that? but then data processing, data analytics, and then, you know, sort of data extraction to bring it to other kind of SRE role that DevOps had site reliability engineers, which managed a bunch of there's how to be able to efficiently move that data from point a to point B and you know, and the democratization is the benefit. that allow them to just port to us, you know, directly from the applications and you guys help take away with APIs and, and assembly and, and all the system architectures that are changing Any, any last word you wanna share No, just, I mean, please, you know, if you're, if you're gonna, if you're gonna check out influx TV, You're gonna hear more about that in the next segment, too. When we return, Dave Lon will lead a panel on edge

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Brian GilmorePERSON

0.99+

EvanPERSON

0.99+

Dave LonPERSON

0.99+

JohnPERSON

0.99+

BrianPERSON

0.99+

two systemsQUANTITY

0.99+

twoQUANTITY

0.99+

dozensQUANTITY

0.99+

iPhonesCOMMERCIAL_ITEM

0.99+

TeslaORGANIZATION

0.99+

appleORGANIZATION

0.99+

oneQUANTITY

0.97+

both worldsQUANTITY

0.96+

five years agoDATE

0.96+

earthLOCATION

0.95+

IOTORGANIZATION

0.94+

two trainingQUANTITY

0.94+

TelegraphORGANIZATION

0.9+

singleQUANTITY

0.9+

InfluxDataORGANIZATION

0.89+

single assetQUANTITY

0.87+

JupyterORGANIZATION

0.84+

OneQUANTITY

0.82+

dozens of data formatsQUANTITY

0.8+

influxORGANIZATION

0.79+

DevOpsORGANIZATION

0.72+

10 yearsQUANTITY

0.68+

tons of peopleQUANTITY

0.66+

TOTHER

0.63+

differentQUANTITY

0.59+

themQUANTITY

0.57+

20 13DATE

0.55+

twofoldQUANTITY

0.54+

14DATE

0.38+

Wen Phan, Ahana & Satyam Krishna, Blinkit & Akshay Agarwal, Blinkit | AWS Startup Showcase S2 E2


 

(gentle music) >> Welcome everyone to theCUBE's presentation of the AWS Startup Showcase. The theme is Data as Code; The Future of Enterprise Data and Analytics. This is the season two, episode two of the ongoing series of covering the exciting startups in the AWS ecosystem around data analytics and cloud computing. I'm your host, John Furrier. Today we're joined by great guests here. Three guests. Wen Phan, who's a Director of Product Management at Ahana, Satyam Krishna, Engineering Manager at Blinkit, and we have Akshay Agarwal, Senior Engineer at Blinkit as well. We're going to get into the relationship there. Let's get into. We're going to talk about how Blinkit's using open data lake, data house with Presto on AWS. Gentlemen, thanks for joining us. >> Thanks for having us. >> So we're going to get into the deep dive on the open data lake, but I want to just quickly get your thoughts on what it is for the folks out there. Set the table. What is the open data lakehouse? Why it is important? What's in it for the customers? Why are we seeing adoption around this because this is a big story. >> Sure. Yeah, the open data lakehouse is really being able to run a gamut of analytics, whether it be BI, SQL, machine learning, data science, on top of the data lake, which is based on inexpensive, low cost, scalable storage. And more importantly, it's also on top of open formats. And this to the end customer really offers a tremendous range of flexibility. They can run a bunch of use cases on the same storage and great price performance. >> You guys have any other thoughts on what's your reaction to the lakehouse? What is your experience with it? What's going on with Blinkit? >> No, I think for us also, it has been the primary driver of how as a company we have shifted our completely delivery model from us delivering in one day to someone who is delivering in 10 minutes, right? And a lot of this was made possible by having this kind of architecture in place, which helps us to be more open-source, more... where the tools are open-source, we have an open table format which helps us be very modular in nature, meaning we can pick solutions which works best for us, right? And that is the kind of architecture that we want to be in. >> Awesome. Wen, you know last time we chat with Ahana, we had a great conversation around Presto, data. The theme of this episode is Data as Code, which is interesting because in all the conversations in these episodes all around developers, which administrators are turning into developers, there's a developer vibe with data. And with opensource, it's software. Now you've got data taking a similar trajectory as how software development was with code, but the people running data they're not developers, they're administrators, they're operators. Now they're turning into DataOps. So it's kind of a similar vibe going on with branches and taking stuff out of and putting it back in, and testing it. Datasets becoming much more stable, iterating on machine learning algorithm. This is a movement. What's your guys reaction before we get into the relationships here with you guys. But, what's your reaction to this Data as Code movement? >> Yeah, so I think the folks at Blinkit are doing a great job there. I mean, they have a pretty compact data engineering team and they have some pretty stringent SLAs, as well as in terms of time to value and reliability. And what that ultimately translates for them is not only flexibility but reliability. So they've done some very fantastic work on a lot of automation, a lot of integration with code, and their data pipelines. And I'm sure they can give the details on that. >> Yes. Satyam and Akshay, you guys are engineers' software, but this is becoming a whole another paradigm where the frontline coding and or work or engineer data engineering is implementing the operations as well. It's kind of like DevOps for data. >> For sure. Right. And I think whenever you're working, even as a software engineer, the understanding of business is equally important. You cannot be working on something and be away from business, right? And that's where, like I mentioned earlier, when we realized that we have to completely move our stack and start giving analytics at 10 minutes, right. Because when you're delivering in 10 minutes, your leaders want to take decisions in your real-time. That means you need to move with them. You need to move with business. And when you do that, the kind of flexibility these softwares give is what enables the businesses at the end of the day. >> Awesome. This is the really kind of like, is there going to be a book called agile data warehouses? I don't think so. >> I think so. (laughing) >> The agile cloud data. This is cool. So let's get into what you guys do. What is Blinkit up to? What do you guys do? Can you take a minute to explain the company and your product? >> Sure. I'll take that. So Blinkit is India's biggest 10 minute delivery platform. It pioneered the delivery model in the country with over 10 million Indian shopping on our platform, ranging from everything: grocery staples, vegetables, emergency services, electronics, and much more, right. It currently delivers over 200,000 orders every day, and is in a hurry to bring the future of farmers to everyone in India. >> What's the relationship with Ahana and Blinkit? Wen, what's the tie in? >> Yeah, so Blinkit had a pretty well formed stack. They needed a little bit more flexibility and control. They thought a managed service was the way to go. And here at Ahana, we provide a SaaS managed service for Presto. So they engaged us and they evaluated our offering. And more importantly, we're able to partner. As a early stage startup, we really rely on very strong partners with great use cases that are willing to collaborate. And the folks at Blinkit have been really great in helping us push our product, develop our product. And we've been very happy about the value that we've been able to deliver to them as well. >> Okay. So let's unpack the open data lakehouse. What is it? What's under the covers? Let's get into it. >> Sure. So if bring up a slide. Like I said before, it's really a paradigm on being able to run a gamut of analytics on top of the open data lake. So what does that mean? How did it come about? So on the left hand side of the slide, we are coming out of this world where for the last several decades, the primary workhorse for SQL based processing and reporting and dashboarding use cases was really the data warehouse. And what we're seeing is a shift due to the trends in inexpensive scalable storage, cloud storage. The proliferation of open formats to facilitate using this storage to get certain amounts of reliability and performance, and the adoption of frameworks that can operate on top of this cloud data lake. So while here at Ahana, we're primarily focused on SQL workloads and Presto, this architecture really allows for other types of frameworks. And you see the ML and AI side. And like to Satyam's point earlier, offers a great amount of flexibility modularity for many use cases in the cloud. So really, that's really the lakehouse, and people like it for the performance, the openness, and the price performance. >> How's the open-source open side of it playing in the open-source? It's kind of open formats. What is the open-source angle on this because there's a lot of different approaches. I'm hearing open formats. You know, you have data stores which are a big part of seeing that. You got SQL, you mentioned SQL. There's got a mishmash of opportunities. Is it all coexisting? Is it one tool to rule the world or is it interchangeable? What's the open-source angle? >> There's multiple angles and I'll let definitely Satyam add to what I'm saying. This was definitely a big piece for Blinkit. So on one hand, you have the open formats. And what really the open formats enable is multiple compute engines to work on that data. And that's very huge. 'Cause it's open, you're not locked in. I think the other part of open that is important and I think it was important to Blinkit was the governance around that. So in particular Presto is governed by the Linux Foundation. And so, as a customer of open-source technology, they want some assurances for things like how's it governed? Is the license going to change? So there's that aspect of openness that I think is very important. >> Yeah. Blinkit, what's the data strategy here with lakehouse and you guys? Why are you adopting this type of architecture? >> So adding to what... Yeah, I think adding to Wen said, right. When we are thinking in terms of all these OpenStacks, you have got these open table formats, everything which is deployed over cloud, the primary reason there is modularity. It's as simple as that, right. You can plug and play so many different table formats from one thing to another based on the use case that you're trying to serve, so that you get the most value out of data. Right? I'll give you a very simple example. So for us we use... not even use one single table format. It's not that one thing solves for everything, right? We use both Hudi and Iceberg to solve for different use cases. One is good for when you're working for a certain data site. Icebergs works well when you're in the SQL kind of interface, right. Hudi's still trying to reach there. It's going to go there very soon. So having the ability to plug and play different formats based on the use case helps you to grow faster, helps you to take decisions faster because you now you're not stuck on one thing. They will have to implement it. Right. So I think that's what it is great about this data lake strategy. Keeping yourself cost effective. Yeah, please. >> So the enablement is basically use case driven. You don't have to be rearchitecturing for use cases. You can simply plug can play based on what you need for the use case. >> Yeah. You can... and again, you can focus on your business use case. You can figure out what your business users need and not worry about these things because that's where Presto comes in, helps you stitch that data together with multiple data formats, give you the performance that you need and it works out the best there. And that's something that you don't get to with traditional warehouse these days. Right? The kind of thing that we need, you don't get that. >> I do want to add. This is just to riff on what Satyam said. I think it's pretty interesting. So, it really allowed him to take the best-of-breed of what he was seeing in the community, right? So in the case of table formats, you've got Delta, you've got Hudi, you've got Iceberg, and they all have got their own roadmap and it's kind of organic of how these different communities want to evolve, and I think that's great, but you have these end consumers like Blinkit who have different maybe use cases overlapping, and they're not forced to pick one. When you have an open architecture, they can really put together best-of-breed. And as these projects evolve, they can continue to monitor it and then make decisions and continue to remain agile based on the landscape and how it's evolving. >> So the agility is a key point. Flexibility and agility, and time to valuing with your data. >> Yeah. >> All right. Wen, I got to get in to why the Presto is important here. Where does that fit in? Why is Presto important? >> Yeah. For me, it all comes down to the use cases and the needs. And reporting and dashboarding is not going to go away anytime soon. It's a very common use case. Many of our customers like Blinkit come to us for that use case. The difference now is today, people want to do that particular use case on top of the modern data lake, on top of scalable, inexpensive, low cost storage. Right? In addition to that, there's a need for this low latency interactive ability to engage with the data. This is often arises when you need to do things in a ad hoc basis or you're in the developmental phase of building things up. So if that's what your need is. And latency's important and getting your arms around the problems, very important. You have a certain SLA, I need to deliver something. That puts some requirements in the technology. And Presto is a perfect for that ideal use case. It's ideal for that use case. It's distributed, it's scalable, it's in memory. And so it's able to really provide that. I think the other benefit for Presto and why we're bidding on Presto is it works well on the data lakes, but you have to think about how are these organizations maturing with this technology. So it's not necessarily an all or nothing. You have organizations that have maybe the data lake and it's augmented with other analytical data stores like Snowflake or Redshift. So Presto also... a core aspect is its ability to federate or connect and query across different data sources. So this can be a permanent thing. This could also be a transitionary thing. We have some customers that are moving and slowly shifting their data portfolio from maybe all data warehouse into 80% data lake. But it gives that optionality, it gives that ability to transition over a timeframe. But for all those reasons, the latency, the scalability, the federation, is why Presto for this particular use case. >> And you can connect with other databases. It can be purpose built database, could be whatever. Right? >> Sure. Yes, yes. Presto has a very pluggable architecture. >> Okay. Here's the question for the Blinkit team? Why did you choose Presto and what led you to Ahana? >> So I'll take this better, over this what Presto sits well in that reach is, is how it is designed. Like basically, Presto decouples your storage with the compute. Basically like, people can use any storage and Presto just works as a query engine for them. So basically, it has a constant connectors where you can connect with a real-time databases like Pinot or a Druid, along with your warehouses like Redshift, along with your data lake that's like based on Hudi or Iceberg. So it's like a very landscape that you can use with the Presto. And consumers like the analytics doesn't need to learn the SQL or different paradigms of the querying for different sources. They just need to learn a single source. And, they get a single place to consume from. They get a single consumer on their single destination to write on also. So, it's a homologous architecture, which allows you to put a central security like which Presto integrates. So it's also based on open architecture, that's Apache engine. And it has also certain innovative features that you can see based on caching, which reduces a lot of the cost. And since you have further decoupled your storage with the compute, you can further reduce your cost, because now the biggest part of our tradition warehouse is a storage. And the cost goes massively upwards with the amount of data that you've added. Like basically, each time that you add more data, you require more storage, and warehouses ask you to write the data in their own format. Over here since we have decoupled that, the storage cost have gone down. It's literally that your cost that you are writing, and you just pay for the compute, and you can scale in scale out based on the requirements. If you have high traffic, you scale out. If you have low traffic, you scale in. So all those. >> So huge cost savings. >> Yeah. >> Yeah. Cost effectiveness, for sure. >> Cost effectiveness and you get a very good price value out of it. Like for each query, you can estimate what's the cost for you based on that tracking and all those things. >> I mean, if you think about the other classic Iceberg and what's under the water you don't know, it's the hidden cost. You think about the tooling, right, and also, time it takes to do stuff. So if you have flexibility on choice, when we were riffing on this last time we chatted with you guys and you brought it up earlier around, you can have the open formats to have different use cases in different tools or different platforms to work on it. Redshift, you can use Redshift here, or use something over there. You don't have to get locking >> Absolutely. >> Satyam & Akshay: Yeah. >> Locking is a huge problem. How do you guys see that 'cause sounds like here there's not a lot of locking. You got the open formats, and you got choice. >> Yeah. So you get best of the both worlds. Like you get with Ahana or with the Presto, you can get the best of the both worlds. Since it's cloud native, you can easily deploy your clusters very easily within like five minutes. Your cluster is up, you can start working on it. You can deploy multiple clusters for multiple teams. You get also flexibility of adding new connectors since it's open and further it's also much more secure since it's based on cloud native. So basically, you can control your security endpoints very well. So all those things comes in together with this architecture. So you can definitely go more on the lakehouse architecture than warehousing when you want to deliver data value faster. And basically, you get the much more high value out of your data in a sorted template. >> So Satyam, it sounds like the old warehousing was like the application person, not a lot of usage, old, a lot of latency. Okay. Here and there. But now you got more speed to deploy clusters, scale up scale down. Application developers are as everyone. It's not one person. It's not one group. It's whenever you want. So, you got speed. You got more diversity in the data opportunities, and your coding. >> Yeah. I think data warehouses are a way to start for every organization who is getting into data. I don't think data warehousing is still a solution and will be a solution for a lot of teams which are still getting into data. But as soon as you start scaling, as you start seeing the cost going up, as you start seeing the number of use cases adding up, having an open format definitely helps. So, I would say that's where we are also heading into and that's how our journey as well started with Presto as well, why we even thought about Ahana, right. >> (John chuckles) >> So, like you mentioned, one of the things that happened was as we were moving to the lakehouse and the open table format, I think Ahana is one of the first ones in the market to have Hudi as a first class citizen completely supported with all the things which are not even present at the time of... even with Presto, right. So we see Ahana working behind the scenes, improving even some of the things already over the open-source ecosystem. And that's where we get the most value out of Ahana as well. >> This is the convergence of open-source magic and commercialization. Wen, because you think about Data as Code, reminds me, I hear, "Data warehouse, it's not going to go away." But you got cloud scale or scale. It reminds me of the old, "Oh yeah, I have a data center." Well, here comes the cloud. So, doesn't really kill the data center, although Amazon would say that the data center's going to be eliminated. No, you just use it for whatever you need it for. You use it for specific use cases, but everyone, all the action goes to the cloud for scale. The same things happen with data, and look at the open-source community. It's kind of coming together. Data as Code is coming together. >> Yeah, absolutely. >> Absolutely. >> I do want to again to connect on another dot in terms of cost and that. You know, we've been talking a little bit about price performance, but there's an implicit cost, and I think this was also very important to Blinkit, and also why we're offering a managed service. So one piece of it. And it really revolves around the people, right? So outside of the technology, the performance. One thing that Akshay brought up and it's another important piece that I should have highlighted a little bit more is, Presto exposes the ability to interact your data in a widely adopted way, which is basically ANSI SQL. So the ability for your practitioners to use this technology is huge. That's just regular Presto. In terms of a managed service, the guys at Blinkit are a great high performing team, but they have to be very efficient with their time and what they manage. And what we're trying to do is provide leverage for them. So take a lot of the heavy lifting away, but at the same time, figuring out the right things to expose so that they have that same flexibility. And that's been the balancing point that we've been trying to balance at Ahana, but that goes back to cost. How do I total cost of ownership? And that not doesn't include just the actual querying processing time, but the ability for the organization to go ahead and absorb the solution. And what does it cost in terms of the people involved? >> Yeah. Great conversation. I mean, this brings up the question of back in the data center, the cloud days, you had the concept of an SRE, which is now popular, site reliability engineer. One person does all the clusters and manages all the scale. Is the data engineer the new SRE for data? Are we seeing a similar trajectory? Just want to get your reaction. What do you guys think? >> Yes, so I would say, definitely. It depends on the teams and the sizes of that. We are high performing team so each automation takes bits on the pieces of the architecture, like where they want to invest in. And it comes out with the value of the engineer's time and basically like how much they can invest in, how much they need to configure the architecture, and how much time it'll take to time to market. So basically like, this is what I would also highlight as an engineer. I found Ahana like the... I would say as a Presto in a cloud native environment, or I think so there's the one in the market that seamlessly scales and then scales out. And further, with a team of us, I would say our team size like three to four engineers managing cluster day in day out, conferring, tuning and all those things takes a lot of time. And Ahana came in and takes it off our plate and the hands in a solution which works out of box. So that's where this comes in. Ahana it's also based on open-source community. >> So the time of the engineer's time is so valuable. >> Yeah. >> My take on it really in terms of the data engineering being the SRE. I think that can work, it depends on the actual person, and we definitely try to make the process as easy as possible. I think in Blinkit's case, you guys are... There are data platform owners, but they definitely are aware of the pipelines. >> John: Yeah. >> So they have very intimate knowledge of what data engineers do, but I think in their case, you guys, you're managing a ton of systems. So it's not just even Presto. They have a ton of systems and surfacing that interface so they can cater to all the data engineers across their data systems, I think is the big need for them. I know you guys you want to chime in. I mean, we've seen the architecture and things like that. I think you guys did an amazing job there. >> So, and to adding to Wen's point, right. Like I generally think what DevOps is to the tech team. I think, what is data engineer or the data teams are to the data organization, right? Like they play a very similar role that you have to act as a guardrail to ensure that everyone has access to the data so the democratizing and everything is there, but that has to also come with security, right? And when you do that, there are (indistinct) a lot of points where someone can interact with data. We have... And again, there's a mixed match of open-source tools that works well, as well. And there are some paid tools as well. So for us like for visualization, we use Redash for our ad hoc analysis. And we use Tableau as well whenever we want to give a very concise reporting. We have Jupyter notebooks in place and we have EMRs as well. So we always have a mixed batch of things where people can interact with data. And most of our time is spent in acting as that guardrail to ensure that everyone should have access to data, but it shouldn't be exploited, right. And I think that's where we spend most of our time in. >> Yeah. And I think the time is valuable, but that your point about the democratization aspect of it, there seems to be a bigger step function value that you're enabling and needs to be talked out. The 10x engineer, it's more like 50x, right? If you get it done right, the enablement downstream at the scale that we're seeing with this new trend is significant. It's not just, oh yeah, visualization and get some data quicker, there's actually real advantages on a multiple with that engineering. So, and we saw that with DevOps, right? Like, you do this right and then magic happens on the edges. So, yeah, it's interesting. You guys, congratulations. Great environment. Thanks for sharing the insight Blinkit. Wen, great to see you. Ahana again with Presto, congratulations. The open-source meets data engineering. Thanks so much. >> Thanks, John. >> Appreciate it. >> Okay. >> Thanks John. >> Thanks. >> Thanks for having us. >> This season two, episode two of our ongoing series. This one is Data as Code. This is theCUBE. I'm John furrier. Thanks for watching. (gentle music)

Published Date : Apr 1 2022

SUMMARY :

This is the season two, episode What is the open data lakehouse? And this to the end customer And that is the kind of into the relationships here with you guys. give the details on that. is implementing the operations as well. You need to move with business. This is the really kind of like, I think so. So let's get into what you guys do. and is in a hurry to bring And the folks at Blinkit the open data lakehouse. So on the left hand side of the slide, What is the open-source angle on this Is the license going to change? with lakehouse and you guys? So having the ability to plug So the enablement is and again, you can focus So in the case of table formats, So the agility is a key point. Wen, I got to get in and the needs. And you can connect Presto has a very pluggable architecture. and what led you to Ahana? And consumers like the analytics and you get a very good and also, time it takes to do stuff. and you got choice. best of the both worlds. like the old warehousing as you start seeing the cost going up, and the open table format, the data center's going to be eliminated. figuring out the right things to expose and manages all the scale. and the sizes of that. So the time of the it depends on the actual person, I think you guys did an amazing job there. So, and to adding Thanks for sharing the insight Blinkit. This is theCUBE.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
John FurrierPERSON

0.99+

Wen PhanPERSON

0.99+

Akshay AgarwalPERSON

0.99+

JohnPERSON

0.99+

AmazonORGANIZATION

0.99+

AhanaPERSON

0.99+

IndiaLOCATION

0.99+

BlinkitORGANIZATION

0.99+

Satyam KrishnaPERSON

0.99+

Linux FoundationORGANIZATION

0.99+

AhanaORGANIZATION

0.99+

five minutesQUANTITY

0.99+

AkshayPERSON

0.99+

AWSORGANIZATION

0.99+

10 minutesQUANTITY

0.99+

Three guestsQUANTITY

0.99+

SatyamPERSON

0.99+

BlinkitPERSON

0.99+

one dayQUANTITY

0.99+

10 minuteQUANTITY

0.99+

RedshiftTITLE

0.99+

both worldsQUANTITY

0.99+

over 200,000 ordersQUANTITY

0.99+

PrestoPERSON

0.99+

over 10 millionQUANTITY

0.99+

SQLTITLE

0.99+

10xQUANTITY

0.99+

WenPERSON

0.98+

50xQUANTITY

0.98+

agileTITLE

0.98+

one pieceQUANTITY

0.98+

bothQUANTITY

0.98+

threeQUANTITY

0.98+

todayDATE

0.98+

oneQUANTITY

0.98+

single destinationQUANTITY

0.97+

One personQUANTITY

0.97+

each timeQUANTITY

0.96+

eachQUANTITY

0.96+

PrestoORGANIZATION

0.96+

one personQUANTITY

0.96+

single sourceQUANTITY

0.96+

TableauTITLE

0.96+

one toolQUANTITY

0.96+

IcebergsORGANIZATION

0.96+

TodayDATE

0.95+

OneQUANTITY

0.95+

one thingQUANTITY

0.95+

Compute Session 05


 

>> Thank you for joining us today for this session entitled, Deploy any Workload as a Service, When General Purpose Technology isn't Enough. This session today will be on our HPE GreenLake platform. And my name is Mark Seamans, and I'm a member of our GreenLake cloud services team. And I'll be kind of leading you through the material today which will include both a slide presentation as well as an interactive demo to get some experience in terms of how the process goes for interacting with your initial experience with our GreenLake system. So, let's go ahead and get started. One of the things that we've noticed over the last decade and I'm sure that you have as well has been the tremendous focus on accelerating business while concurrently trying to increase agility and to reduce costs. And one of the ways a lot of businesses have gone about doing that has been leveraging a cloud based technology set. And in many cases, that's involved moving some of the workloads to the public cloud. And so with that much said, though, while organizations have been able to enjoy that cost control and the agility associated with the public cloud. What we've seen is that the easy to move workloads have been moved but there's a significant amount as much as 70% in many cases of workloads that organizations run which still remain on prem. And there's reasons for that. Some cases it's due to data privacy and security concerns. Other times it's due to latency of really needing high-performance access to data. And the other times, it's really just related to the interconnected nature of systems and that you need to have a whole bunch of systems which form an overall experience and they need to be located close together. So, one of the challenges that we've worked with customers and have actually developed our GreenLake solution to address is this idea of trying to achieve this cloud-like experience for all of your apps and data in a way that leverages the best of the public cloud with also that same type of experience delivered on premise. So as you think about some of the challenges, again, we touched on this that customers are trying to address. One of the ones is this idea of agility, being able to move quickly and to be able to take a set of IT resources that you have and deploy them for different use cases and different models. So, it's one of the things as we built GreenLake, we really had a strong focus on is how do we provide a common foundation, a common framework to deliver that kind of agility. The next one is this term on the top right called scale. And one of the words you may hear is you hear cloud talked about regularly is this notion of what's called elasticity and the ability to have something stretch and get larger kind of on an on demand basis. That's another challenge and premise that we've really tried to work through. And you'll see how we've addressed that. Now, obviously, as you do this, you can achieve scale if you just put a ton of equipment in place much more maybe than you need at any given time but with that comes a lot of costs. And so as you think about wanting to have an agile and flexible system, what you'd also like is something where the costs flexes as your needs grow and it's elastic and that it can get larger and then it can get smaller as needed as well. So, we'll talk about how we do that with our GreenLake solution. And then finally it's complexity, it's trying to abstract away the vision for people of having to be aware of all the complexity it takes to build these systems and provide a single interface, a single experience for people to manage all of their IT assets. So we do that through this solution called HPE GreenLake and really we call it the cloud that comes to you. And as you think about what we're really trying to do here is take the notion of a cloud from being a place where people have thought about the public cloud and turning that to an idea of the cloud being an experience. And so it's regardless of whether it's in the public cloud or running on premise or as is the case with GreenLake, whether it's a mixture of those and maybe even a mixture of multiple public clouds with on-prem experience, the cloud now becomes something you experience and that you leverage as opposed to a place where you have an account and that can include edge computing combined with co-location or data center based computing. It could include equipment stored in your own data center and certainly it can include resources in the public cloud. So, let's take a look at how we go about delivering the experience and what some of those benefits are as we put these solutions in place. So, as you think about why you'd want to do this and the benefits you get from GreenLake, what we've seen in terms of both working with customers and actually having studies done with analysts is the benefits are numerous, but they come in areas that are shown here, one time to deployment. And that once you get this flexible and easily to manage environment in place with what we'll show you are these prebuilt, pre-configured and managed as a service solutions, your time to deployment for putting new workloads in place can shrink dramatically. The next in terms of having these pre-configured solutions and combining both the hardware and software technology with a set of managed services through our GreenLake managed services team, what you can do is dramatically reduce the risk of putting a new workload in place. So for example, if you wanted to deploy virtual desktop infrastructure and maybe you haven't done that in the past, you can leverage a GreenLake VDI solution along with GreenLake management services to very predictably and very reliably put that solution in place. So you're up and running focusing on the needs of your users with incredibly lowered risk, because this was built on a pre-validated and a pre-certified foundation. Obviously, I talked earlier about the idea with GreenLake is that you have flexibility in terms of scaling up your use of the resources, even though they're computers that may be in your data center or a colo, and also scaling them back down. So if you have workloads over time, that may be even an end of month cycle or an end to quarter cycle where certain workloads get larger and then would get smaller again, the ability with GreenLake on a consumption billing basis is there where your costs can flow as your use of the systems flow. And again, I'll show you a screen in just a few minutes, that kind of illustrates what that looks like. And then the last piece is the single pane of glass for control and insight into what's going on. And what we mean by that is not just what's going on from a cost perspective, but also what's going on from a system utilization perspective. You'll see in one of the screens I'll show that there's a system utilization report of all of your GreenLake resources that you can view at any time. And so what you can get visibility to, for example, with storage capacity as your storage capacity is being consumed over time as you generate more data, the system will tell you, hey, you're getting up to about 60, 70% utilized. And then at that point, we would be able to work with you to automatically deploy even though you won't be paying for it yet, additional storage capacity so it's ready as your needs grow to encompass that. So in terms of what are some of these services that we deliver as part of GreenLake? Well, they range and you see here a portfolio of services that we offer. If you start at the bottom, it's simple things, right? Things like compute as a service, and I'll show you examples of that today, networking as a service, hyper-converged infrastructure as a service. And then if we work our way up the stack, we move from kind of basic services to platform services, things like VMware and containers as a service. And then if we go to the top layer of this, we actually can offer complete solutions for targeted workloads. So if your need was for example, to run machine learning and AI, and you wanted to have a complete environment put in place that you could leverage for machine learning and AI and use it and consume it on a consumption as a service basis, we've got our MLOps solution that delivers that. And similarly, I mentioned earlier, VDI for virtual desktops or a solution for SAP HANA. So, the solutions range from very basic compute at the foundation all the way up to complete workload solutions that you can achieve. And the portfolio of what these are is expanding all the time. And as you'll see, you can go out to our hpe.com site and see a complete catalog of all the GreenLake services that are available. So let's take a minute and let's drill in like on that MLOps solution. And we can take a look at how that fits together and what makes that up. So, if you think about GreenLake for MLOps, it's a fast path for data scientists, and it's really oriented around the needs of data scientists within your organization who have a desire to be able to get in and start to analyze data for advantage in your business. So, what comes with an MLOps solution from GreenLake starts at the left side of the slide here with a fully curated hardware platform, including GPU based nodes, data science, optimized hardware, all the storage that you're going to need to run at scale and that performance to make these workloads work. And so that's one piece of it is a curated hardware stack for machine learning. Next in the software component, we pre-validated a whole bunch of the common stack elements that you would need. So beyond operating systems, but things for doing continuous integration, for things like TensorFlow and Jupyter notebooks are already pre-validated and delivered with this solution. So, the tools that your data scientists will need come with this, ready to go, out of the box. And then finally, as this solution gets delivered, there's a services component to it beyond just us installing this full thing and delivering a complete solution to you. But the GreenLake management services options where our services teams can work side by side with data scientists to assist them in getting up to speed on the solution, to leveraging the tools, to understanding best practices if you want those, if you want that assistance for deploying MLOps and the whole thing's delivered as a service. As similar, we similar solutions for other workloads like SAP HANA that would leverage again, different compute building blocks, but always in a way that's done for workload optimized solutions, best practice and that build up that stack. And so your experience in consuming this is always consistent, but what's running under the hood isn't just a generic solution that you might see in for example, a public cloud environment, it's a best practice, hardware optimized, software optimized environment built for each one of the workloads that we can deploy. So I like to do at this point is actually show you what's the process like for actually specifying a GreenLake solution. And maybe we'll take a look at compute as our example today. So, what I've got here is a browser experience, I'm just in my web browser, I'm on the hpe.com website and what I'd like to do. I mean the GreenLake section and I've actually clicked on this services menu and I'm going to go ahead and scroll down. And one of the things you can see here is that catalog of GreenLake services that I referenced. So, just like we showed you on the slide, this is that catalog of services that you can consume. I'm going to go to compute and we'll go about quoting a GreenLake compute solution. So we see when I clicked on that, one of the options I have is to get a price in my inbox. And I'll click on that to go in here to our GreenLake quick quote environment where if in my case here for our demonstration, I'll specify that I'd like to purchase to add to my GreenLake environment some additional general compute capability for some workloads that I might like to run. If I click on this, I go in and you notice here that I'm not going to specify server types. I'm really going to tell the system about the types of workloads that I'd like to run and the characteristics of those workloads. So for example, my workload choices would be adaptable performance or maybe densely optimized compute for highly scalable and high performance computing requirements. So, I'll select adaptable performance. I have a choice of processor types, my case, I'll pick Intel. And I then say, how many servers for the workloads that I want to run would be part of the solution. Again, in my case, maybe we'll quote a 20 server configuration. Now, as we think about the plans here, what you can see is we're really looking at the different options in terms of a balanced performance and price option which is the recommended option. But if I knew that the workloads I were going to run were more performance optimized, I could simply click on that option. And in the system under the hood does all the work to reconfigure the system. I'm not having to pick individual server options as you see. So once I picked between cost optimized balance or performance, I can go in here and select the rest of the options. Now, we'll start at the top right and you see here from a services perspective, this is where it specifies how much services content and in services assistance I'd like all the way from just doing proactive metering of my solution all the way through being able to do actual workload deployment and assistance with me physically managing the equipment myself. The other piece I'll focus on is this variable usage. And this comes back to how much of the variable time, variable capacity of additional capacity, what I like to have available in my data center for this solution. So if I know that my flex could be larger in the future of the capacity, I want to flex up and down. I might pick a slightly larger amount of flex capacity at my location as part of this solution. With that, I'd select that workload. And the less steps would be, I could click on get price and this whole thing will be packaged up and shipped to you in terms of the price of the solution. And any other details that you might like to see. And I encourage you to go out to hpe.com and to go through this process yourself for one of the workloads that might be of interest for you to get a flavor of that experience. So if we move forward, once you've deployed your GreenLake solution, one of the things you see here is that single pane of glass experience in terms of managing the system, right? We've got a single panel that all in one place provides you access to your cost information for billing, and what's driving that billing, your middle and the middle of the top center, you can see we've got information on the capacity planning but then we can actually drill in and actually look at additional things like services we offer around continuous compliance, capacity planning data for you to build and see how things like storage or filling, cost control information with recommendations around how you could reduce or minimize your costs based on the usage profile that you have. So, all of this is a fully integrated experience that can span components running both on-premise and also incorporating services that could be in the public cloud. Now, when we think about who's using this and why is this becoming attractive? You can imagine just looking at this capability that this ability to blend public cloud capabilities with on-premise or in a co-location, private data center capabilities provides tremendous power and provides tremendous flexibility for users. And so we're seeing this adopted broadly as kind of a new way, people are looking to take the advantages of cloud, but bring them into a much more self-managed or on-premise experience. And so some example, customers here include deployments in the automotive field, both at Porsche or over on the right at Zenseact, which is the autonomous driving division of Volvo where they're doing research with tremendous amounts of data to produce the best possible autonomous driving experience. And then in the center, Danfoss who is one of the world's leading manufacturers of both electric and hydraulic control components. And so as they produce components themselves, that drive an optimized management of physical infrastructure, power, liquids and cooling, they're leveraging GreenLake for the same type of control and best practice deployment of their data centers and of their IT infrastructure. So again, somebody who's innovating in their own world taking advantage of compute innovations to get the benefits of the cloud and the flexibility of a cloud-like environment but running within their own premise. And it's not just those three customers clearly. I mean, what we're seeing is, as you see on the slide, it's a unique solution in the market today. It provides the true benefits of the cloud, but with your own on-premise experience, it provides expertise in terms of services to help you take best advantage of it. And if you look at the adoption by customers, over a thousand customers in 50 countries have now deployed GreenLake based solutions as the foundation on which they're building their next generation IT architecture. So, there's a lot of unique capabilities that as we built GreenLake, that we have that really make this a single pane of glass and a very, very unified and elegant experience. So as we kind of wrap up, there's three things I want to call your attention to, one, GreenLake, which we focused a lot on today. I'd also like to call your attention to the point next services, which are an extension of those GreenLake services that I talked about earlier but there's a much broader portfolio of what Pointnext can do in delivering value for your organization. And then again, HPE financial services who much like what we do with GreenLake in this as a service consumption environment can provide a lot of financial flexibility in other models and other use cases. So, I'd encourage you to take time to learn about each of those three areas. And then there's obviously many many resources available online. And again, there's some that are listed here but it kind of as a single point takeaway from this slide, I encourage you to go to hpe.com. If you're interested in GreenLake, click on our GreenLake icon and you can take yourself through that quoting experience for what would be interesting and certainly as well for our compute solutions, there's a tremendous amount of information about the leading solutions that HPE brings to market. So with that, I hope that's been an informative set of experience. I'm thanking you for spending a little bit of time with us today and hopefully you'll take some time to learn more about GreenLake and how it might be a benefit for you within your organization. Thanks again.

Published Date : Apr 9 2021

SUMMARY :

and the benefits you get from GreenLake,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
VolvoORGANIZATION

0.99+

Mark SeamansPERSON

0.99+

PorscheORGANIZATION

0.99+

three customersQUANTITY

0.99+

todayDATE

0.99+

20 serverQUANTITY

0.99+

GreenLakeORGANIZATION

0.99+

PointnextORGANIZATION

0.99+

oneQUANTITY

0.99+

ZenseactORGANIZATION

0.99+

70%QUANTITY

0.99+

50 countriesQUANTITY

0.99+

over a thousand customersQUANTITY

0.99+

three thingsQUANTITY

0.99+

DanfossORGANIZATION

0.98+

bothQUANTITY

0.98+

eachQUANTITY

0.98+

SAP HANATITLE

0.98+

single pointQUANTITY

0.98+

HPEORGANIZATION

0.98+

OneQUANTITY

0.98+

single panelQUANTITY

0.97+

three areasQUANTITY

0.97+

one pieceQUANTITY

0.97+

single paneQUANTITY

0.97+

single interfaceQUANTITY

0.96+

single experienceQUANTITY

0.95+

JupyterORGANIZATION

0.94+

HPE GreenLakeTITLE

0.93+

hpe.comOTHER

0.93+

GreenLakeTITLE

0.93+

about 60, 70%QUANTITY

0.92+

TensorFlowORGANIZATION

0.91+

Deploy any Workload as a Service, When General Purpose Technology isn't EnoughTITLE

0.85+

one placeQUANTITY

0.84+

Session 05QUANTITY

0.83+

IntelORGANIZATION

0.79+

one ofQUANTITY

0.78+

hpe.comORGANIZATION

0.78+

GreeORGANIZATION

0.77+

GreenLakeCOMMERCIAL_ITEM

0.74+

up toQUANTITY

0.67+

onesQUANTITY

0.66+

last decadeDATE

0.66+

Clive Charlton and Aditya Agrawal | AWS Public Sector Summit Online


 

(upbeat music) >> Narrator: From around the globe. It's The CUBE, with digital coverage of AWS public sector online, (upbeat music) brought to you by, Amazon Web Services. >> Everyone welcome back to The CUBE virtual coverage, of AWS public sector summit online. I'm John Furrier, your host of The CUBE. Normally we're in person, out on Asia-Pacific, and all the different events related to public sector. But this year we have to do it remote, and we're going to do the remote virtual CUBE, with Data Virtual Public Sector Online Summit. And we have two great guests here, about Digital Earth Africa project, Clive Charlton. Head of Solutions Architecture, Sub-Saharan Africa with AWS, Clive thanks for coming on, and Aditya Agrawal founder of D4DInsights, and also the advisor for the Digital Earth Africa project with AWS. So gentlemen, thank you for coming on. Appreciate you coming on remotely. >> Thanks for having us. >> Thank you for having us, John. >> So Clive take us through real quickly. Just take a minute to describe what is the Digital Earth Africa Project. What are the problems, that you're aiming to solve? >> Well, we're really aiming to provide, actionable data to governments, and organization around Africa, by providing satellite imagery, in an easy to use format, and doing that on the cloud, that serves countries throughout Africa. >> And just from a cloud perspective, give us a quick taste of what's going on, just with the tech, it's on Amazon. You got a little satellite action. Is there ground station involved? Give us a little bit more color around, you know, what's the scope of the project. >> Yeah, so, historically speaking you'd have to process satellite imagery down link it, and then do some heavy heavy lifting, around the processing of the data. Digital Earth Africa was built, from the experiences from Digital Earth Australia, originally developed by a Geo-sciences Australia and they use container services for Kubernetes's called Elastic Kubernetes Service to spin up virtual machines, which we are required to process the raw satellite imagery, into a format called a Cloud Optimized GeoTIFF. This format is used to store very large volumes of data in a format that's really easy to query. So, organizations can just use NHTTP get range request. Just a query part of the file, that they're interested in, which means, the results are served much, much quicker, from much, much better overall experience, under the hood, the store where the data is stored in the Amazon Simple Storage Service, which is S3, and the Metadata Index in a Relational Database Service, that runs the Open Data CUBE Library, which is allows Digital Earth Africa, to store this data in both space and time. >> It's interesting. I just did a, some interviews last week, on a symposium on space and cybersecurity, and we were talking about , the impact of satellites and GPS and just the overall infrastructure shift. And it's just another part of the edge of the network. Aditya, I want to get your thoughts on this, and your reaction to the Digital Earth, cause you're an advisor. Let's zoom out. What's the impact of people's lives? Give us a quick overview, of how you see it playing out because, explaining to someone, who doesn't know anything about the project, like, okay what is it about, and how does it actually impact people? >> Sure. So, you know, as, as Clive mentioned, I mean there's, there's definitely a, a digital infrastructure behind Digital Earth Africa, in a way that it's going to be able to serve free and open satellite data. And often the, the issue around satellite data, especially within the context of Africa, and other parts of the world is that there's a level of capacity that's required, in order to be able to use that data. But there's also all kinds of access issues, because, traditionally satellite data is heavy. There's the old model of being able to download the data and then being able to do something with it. And then often about 80% of the time, that you spend on satellite data is spent, just pre processing the data, before you can actually, do any of the fun analysis around it, that really sort of impacts the kinds of decisions and actions that you're looking for. And so that's why Digital Earth Africa. And that's why this partnership, with Amazon is a fantastic partnership, because it really allows us, to be able, to scale the approach across the entire continent, make it easy for that data to be accessed and make it easier for people to be able to use that data. The way that Digital Earth Africa is being operationalized, is that we're not just looking at it, from the perspective of, let's put another infrastructure into Africa. We want this program, and it is a program, that we want institutionalized within Africa itself. One that leverages expertise across the continent, and one that brings in organizations across the continent to really sort of take the leadership and ownership of this program as it moves forward. The idea of it is that, once you're able to have this information, being able to address issues like food security, climate change, coastal resilience, land degradation where illegal mining is, where is the water? We want to be able to do that, in a way that it's really looking at what are the national development priorities within the countries themselves, and how does it also then support regional and global frameworks like Africa's Agenda 2063 and the sustainable development goals. >> No doubt in my mind, obviously, is that huge benefits to these kinds of technologies. I want to also just ask you, as a follow up is a huge space race going on, right now, explosion of availability of satellite data. And again, more satellites going up, There's more congestion, more contention. Again, we had a big event on that cybersecurity, and the congestion issue, but, you know, satellite data was power everyone here in the United States, you want an Uber, you want Google Maps you've got your everywhere with GPS, without it, we'd be kind of like (laughing), wondering what's going on. How do we even vote these days? So certainly an impact, but there's a huge surge of availability, of the use of satellite data. How do you explain this? And what are some of the challenges, from the data side that's coming, from the Digital Earth Africa project that you guys hope to resolve? >> Sure. I mean, that's a great question. I mean, I think at one level, when you're looking at the space race right now, satellites are becoming cheaper. They're becoming more efficient. There's increased technology now, on the types of sensors that you can deploy. There's companies like Planet, that are really revolutionizing how even small countries are able to deploy their own satellites, and the constellation that they're putting forward, in terms of the frequency by which, you're able to get data, for any given part of the earth on a daily basis, coupled with that. And you know, this is really sort of in climbs per view, but the cloud computing capabilities, and overall computing power that you have today, then what you had 10 years, 15 years ago is so vastly different. What used to take weeks to do before, for any kind of analysis on satellite data, which is heavy data now takes, you know, minutes or hours to do. So when you put all that together, again, you know, I think it really speaks, to the power of this partnership with Amazon and really, what that means, for how this data is going to be delivered to Africa, because it really allows for the scalability, for anything that happens through Digital Earth Africa. And so, for example, one of the approaches, that we're taking us, we identify what the priorities, and needs are at the country level. Let's say that it's a land degradation, there's often common issues across countries. And so when we can take one particular issue, tested with additional countries, and then we can scale it across the whole continent because the infrastructure is there for the whole continent. >> Yeah. That's a great point. So many storylines here. We'll get to climb in a second on sustainability. And I want to talk about the Open Data Platform. Obviously, open data, having data is one thing, but now train data, and having more trusted data becomes a huge issue. Again, I want to dig into that for a second, but, Clive, I want to ask you, first, what region are we in? I mean, is this, you guys actually have a great, first of all, we've been covering the region expansion from Bahrain all the way, as moves around the world, probably soon in space. There'll be a region Amazon space station region probably, someday in the future but, what region are you running the project out of? Can you, and why is it important? Can you share the update on the regional piece? >> Well, we're very pleased, that Digital Earth Africa, is using the new Africa region in Cape Town, in South Africa, which was launched in April of this year. It's one of 24 regions around the world and we have another three new regions announced, what this means for users of Digital Earth Africa is, they're able to use region closest to them, which gives them the best user experience. It's the, it's the quickest connection for them. But more importantly, we also wanted to use, an African solution, for African people and using the Africa region in Cape Town, really aligned with that thinking. >> So, localization on the data, latency, all that stuff is kind of within the region, within country here. Right? >> That's right, Yeah >> And why is that important? Is there any other benefits? Why should someone care? Obviously, this failover option, I mean, in any other countries to go to, but why is having something, in that region important for this project? >> Well, it comes down to latency for the, for the users. So, being as close to the data, as possible is, is really important, for the user experience. Especially when you're looking at large data sets, and big queries. You don't want to be, you don't want to be waiting a long lag time, for that query to go backwards and forwards, between the user and the region. So, having the data, in the Africa region in Cape Town is important. >> So it's about the region, I love when these new regions rollout from Amazon, Cause obviously it's this huge buildup CapEx, in this huge data center servers and everything. Sustainability is a huge part of the story. How does the sustainability piece fit into the, the data initiative supported in Africa? Can you share some updates on that? >> Well, this, this project is also closely aligned with the, Amazon Sustainability Data Initiative, which looks to accelerate sustainability research. and innovation, really by minimizing the cost, and the time required to acquire, and analyze large sustainability datasets. So the initiative supports innovators, and researchers with the data and tools, and, and technical experience, that they need to move sustainability, to the next level. These are public datasets and publicly available to anyone. In addition, to that, the initiative provides cloud grants to those who are interested in exploring, exploring the use of AWS technology and scalable infrastructure, to serve sustainability challenges, of this nature. >> Aditya, I want to hear your thoughts, on this comment that Clive made around latency, and certainly having a region there has great benefits. You don't need to hop on that. Everyone knows I'm a big fan of the regional model, but it brings up the issue, of what's going on in the country, from an infrastructure standpoint, a lot of mobility, a lot of edge computing. I can almost imagine that. So, so how do you see that evolving, from a business standpoint, from a project standpoint data standpoint, can you comment and react to that edge, edge angle? >> Yeah, I mean, I think, I think that, the value of an open data infrastructure, is that, you want to use that infrastructure, to create a whole data ecosystem type of an approach. And so, from the perspective of being able. to make this data readily accessible, making it efficiently accessible, and really being able to bring industry, into that ecosystem, because of what we really want as we, as the program matures, is for this program, to then also instigate the development of new businesses, entrepreneurship, really get the young people across Africa, which has the largest proportion of young people, anywhere in the world, to be engaged around what you can do, with satellite data, and the types of businesses that can be developed around it. And, so, by having all of our data reside in Cape Town on the continent there's obviously technical benefits, to that in terms of, being able to apply the data, and create new businesses. There's also a, a perception in the fact that, the data that Digital Earth Africa is serving, is in Africa and residing in Africa which does have, which does go a long way. >> Yeah. And that's a huge value. And I can just imagine the creativity cloud, if you can comment on this open data platform idea, because some of the commentary that we've been having on The CUBE here, and all around the world is data's great. We all know we're living with a lot of data, you starting to see that, the commoditization and horizontal scalability of data, is one thing, but to put it into software defined environments, whether, it's an entrepreneur coding up an app, or doing something to share some transparency, around some initiatives going on within the region or on the continent, it's about trusted data. It's about sharing algorithms. AI is also a consumer of data, machines consume data. So, it's not just the technology data, is part of this new normal. What's this Open Data Platform, And how does that translate into value in your opinion? >> I, yeah. And you know, when, when data is shared on, on AWS anyone can analyze it and build services on top of it, using a broad range of compute and data to data analytics products, you know, things like Amazon EC2, or Lambda, which is all serverless compute, to things like Amazon Elastic MapReduce, for complex extract and transformation processes, but sharing data in the cloud, lets users, spend more time on the data analysis, rather than, than the data acquisition. And researchers can analyze data shared on AWS, without needing to pay to store their own copy, which is what the Open Data Platform provides. You only have to pay for the compute that you use and you don't need to purchase storage, to start a new project. So the registry of the open data on AWS, makes it easy to find those datasets, but, by making them publicly available through AWS services. And when you share, share your data on AWS, you make it available, to a large and growing community of developers, and startups, and enterprises, all around the world. And you know, and we've been talking particularly around, around Africa. >> Yeah. So it's an open source model, basically, it's free. You don't, it doesn't cost you anything probably, just started maybe down the road, if it gets heavy, maybe to charging but the most part easy for scientists to use and then you're leveraging it into the open, contributing back. Is that right? >> Yep. That's right. To me getting, getting researchers, and startups, and organizations growing quickly, without having to worry about the data acquisition, they can just get going and start building. >> I want to get back to Aditya, on this skill gap issue, because you brought up something that, I thought was really cool. People are going to start building apps. I'm going to start to see more innovation. What are the needs out there? Because we're seeing a huge onboarding of new talent, young talent, people rescaling from existing jobs, certainly COVID accelerated, people looking for more different kinds of work. I'm sure there's a lot of (laughing) demand to, to do some innovative things. The question I always get, and want to get your reaction is, what are the skills needed to, to get involved, to one contribute, but also benefit from it, whether it's the data satellite, data or just how to get involved skill-wise >> Sure. >> Yes. >> Yeah. So most recently we've created a six week training course. That's really kind of taken users from understanding, the basics of Earth Observation Data, to how to work, with Python, to how to create their own Jupyter notebooks, and their own Use cases. And so there's a, there's a wide sort of range of skill sets, that are required depending on who you are because, effectively, what we want to be able to do is get everyone from, kind of the technical user, that might have some remote sensing background to the developer, to the policy maker, and decision maker, to understand the value of this infrastructure, whether you're the one who's actually analyzing the data. If you're the one who's developing new applications, or you're taking that information from a managerial or policy level discussion to actually deliver the action and sort of impact that you're looking for. And so, you know, in, in that regard, we're working with ITC in the Netherlands and again, with institutions across Africa, that already have a mandate, and expertise in this particular area, to create a holistic capacity development program, that will address all of those different factors. >> So I guess the follow up question I want to have is, how do you ensure the priorities of Africa are addressed, as part of this program? >> Yeah, so, we are, we've created a governance model, that really is both top down, and bottom up. At the bottom up level, We have a technical advisory committee, that has over 15 institutions, many of which are based across Africa, that really have a good understanding of the needs, the priorities, and the mandate for how to work with countries. And at the top down level, we're developing a governing board, that will be inclusive, of the key continental level institutions, that really provide the political buy-in, the sustainability of the program, and really provide overall guidance. And within that, we're also creating an operational models, such that these institutions, that do have the capacity to support the program, they're actually the ones, who are also going to be supporting, the implementation of the program itself. >> And there's been some United Nations, sustained development projects all kinds of government involvement, around making sure certain things would happen, within the country. Can you just share, some of the highlights, or some of the key initiatives, that are going on, that you're supporting, to make it a better, better world? >> Yeah. So this is, this program is very closely aligned to a sustainable development agenda. And so looking after, looking developing methods, that really address, the sustainable development goals as one facet, in Africa, there's another program looking overall, overall national development priorities and sustainability called the Agenda 2063. And really like, I think what it really comes down to this, this wouldn't be happening, without the country level involvement themselves. So, this started with five countries, originally, Senegal, Ghana, Kenya, Tanzania, and the government of Kenya itself, has really been, a kind of a founding partner for, how Digital Earth Africa and it's predecessor of Africa Regional Data Cube, came to be. And so without high level support, and political buying within those governments, I mean, it's really because of that. That's why we're, we're where we are. >> I need you to thank you for coming on and sharing that insight. Clive will give you the final word, for the folks watching Digital Earth Africa, processes, petabytes of data. I mean the satellite data as well, huge, you mentioned it's a new region. You're running Kubernetes, Elastic Kubernetes Service, making containers easy to use, pay as you go. So you get cutting edge, take the one minute to, to share why this region's cutting edge. Does it have the scale of other regions? What should they know about AWS, in Cape Town, for Africa's new region? Take a minute to, to put plugin. >> Yeah, thank you for that, John. So all regions are built in the, in the same way, all around the world. So they're built for redundancy and reliability. They typically have a minimum of three, what we call Availability Zones. And each one is a contains a, a cluster of, of data centers, and all interconnected with fast fiber. So, you know, you can survive, you know, a failure with with no impact to your services. And the Cape Town region is built in exactly the same the same way, we have most of the services available in the, in the Cape Town region, like most other regions. So, as a user of AWS, you, you can have the confidence that, You can deploy your services and workloads, into AWS and run it in the same in the same way, with the same kind of speed, and the same kind of support, and infrastructure that's backing any region, anywhere else in the world. >> Well great. Thanks for that plug, Aditya, thank you for your insight. And again, innovation follows cloud computing, whether you're building on top of it as a startup a government or enterprise, or the big society better, in this case, the Digital Earth Africa project. Great. A great story. Thank you for sharing. I appreciate it. >> Thank you for having us. >> Thank you for having us, John >> I'm John Furrier with, The CUBE, virtual remote, not in person this year. I hope to see you next time in person. Thanks for watching. (upbeat music) (upbeat music decreases)

Published Date : Oct 20 2020

SUMMARY :

Narrator: From around the globe. and all the different events What are the problems, and doing that on the cloud, you know, and the Metadata Index in a and just the overall infrastructure shift. and other parts of the world and the congestion issue, and the constellation that on the regional piece? It's one of 24 regions around the world So, localization on the data, in the Africa region in So it's about the region, and the time required to acquire, fan of the regional model, and the types of businesses and all around the world is data's great. the compute that you use it into the open, about the data acquisition, What are the needs out there? kind of the technical user, and the mandate for how or some of the key initiatives, and the government of Kenya itself, I mean the satellite data as well, and the same kind of support, or the big society better, I hope to see you next time in person.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Aditya AgrawalPERSON

0.99+

AmazonORGANIZATION

0.99+

AWSORGANIZATION

0.99+

ClivePERSON

0.99+

Cape TownLOCATION

0.99+

JohnPERSON

0.99+

AfricaLOCATION

0.99+

John FurrierPERSON

0.99+

Amazon Web ServicesORGANIZATION

0.99+

United StatesLOCATION

0.99+

six weekQUANTITY

0.99+

Agenda 2063TITLE

0.99+

Clive CharltonPERSON

0.99+

PythonTITLE

0.99+

AdityaPERSON

0.99+

NetherlandsLOCATION

0.99+

South AfricaLOCATION

0.99+

five countriesQUANTITY

0.99+

United NationsORGANIZATION

0.99+

last weekDATE

0.99+

one minuteQUANTITY

0.99+

Digital Earth AfricaORGANIZATION

0.99+

earthLOCATION

0.99+

bothQUANTITY

0.98+

D4DInsightsORGANIZATION

0.98+

April of this yearDATE

0.98+

10 yearsQUANTITY

0.98+

this yearDATE

0.98+

UberORGANIZATION

0.98+

BahrainLOCATION

0.98+

S3TITLE

0.97+

15 years agoDATE

0.97+

over 15 institutionsQUANTITY

0.97+

each oneQUANTITY

0.97+

Data Virtual Public Sector Online SummitEVENT

0.97+

oneQUANTITY

0.96+

firstQUANTITY

0.96+

about 80%QUANTITY

0.96+

threeQUANTITY

0.96+

EarthLOCATION

0.96+

Africa Regional Data CubeORGANIZATION

0.96+

Google MapsTITLE

0.95+

one levelQUANTITY

0.94+

UNLIST TILL 4/2 - Model Management and Data Preparation


 

>> Sue: Hello, everybody, and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue LeClaire, Director of Managing at Vertica and I'll be your host for this webinar. Joining me is Waqas Dhillon. He's part of the Vertica Product Management Team at Vertica. Before we begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Alternately, you can visit Vertica Forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides, and yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So, let's get started. Waqas, over to you. >> Waqas: Thank you, Sue. Hi, everyone. My name is Waqas Dhillon and I'm a Product Manager here at Vertica. So today, we're going to go through data preparation and model management in Vertica, and the session would essentially be starting with some introduction and going through some of the machine learning configurations and you're doing machine learning at scale. After that, we have two media sections here. The first one is on data preparation, and so we'd go through data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk we'll go through different export models using PMML and how that works with Vertica, and we'll share examples from that, as well. So yeah, let's dive right in. So, Vertica essentially is an open architecture with a rich ecosystem. So, you have a lot of options for data transformation and ingesting data from different tools, and then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a lot of them that Vertica connects to, and in the middle sits Vertica, which you can have on external tables or you can have in place analytics on R, on cloud, or on prem, so that choice is yours, but essentially what it does is it offers you a lot of options for performing your data and analytics on scale, and within that, data analytics machine learning is also a core component, and then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that distributed data analytics offers, so it offers a lot of those capabilities and builds on top of them, so you eliminate the overhead data transfer when you're working with Vertica machine learning, you keep your data secure, storing and managing the models really easy and much more efficient. You can serve a lot of concurrent users all at the same time, and then it's really scalable and avoids maintenance cost of a separate system, so essentially a lot of benefits here, but one important thing to mention here is that all the algorithms that you see, whether they're analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So, each node gets a distributed work load. On each node, too, there might be multiple tracks and multiple processors that are running with each of these functions. So, highly distributed solution and one of its kind in this space. So, when we talk about Vertica machine learning, it essentially covers all machine learning process and we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment, as well. So, when you're using with Vertica, you're using Vertica for machine learning, it takes care of all these steps and you can do all of that inside of the Vertica database, but when we look at the three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation and we'll go through some of them here. We have distributed in-database algorithms for model training and prediction, we have scalable functions for model evaluation, and finally we have distributed scoring functions, as well. Doing all of the stuff in the database, that's a really good thing, but we don't want it isolated in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica, as well. So, they might use Vertica for data prep, another two for model training, or use Vertica for model training and take those nodes out to other tools and do prediction there. So, integration is really important part of our overall offering. So, it's a pretty flexible system. We have been offering UdX in four languages, a lot of people find there over the past few years, but the new capability of importing PMML models for in-database scoring and exporting Vertica native-models, for external scoring it's something that we have recently added, and another talk would actually go through the TensorFlow integrations, a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML, and exporting PMML models, and finally, since Vertica is not just a cue engine, but also a data store, we have a lot of really good capability for model storage and management, as well. So, yeah. Let's dive into the first part on machine learning at scale. So, when we say machine learning at scale we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So, it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet, or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another, so what you'd want to do is bring models to the data instead of having to move this data to the tools, and the third thing here is that some sub-sampling it can actually compromise your accuracy, and a lot of tools that are out there they still force you to take smaller samples of your data because they can only handle so much data, but that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So, the first factor here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources, like RAM and CPU and keep increasing the performance as well as the capacity of that system, but there's a limit to what you can do here, and the limit, you can hit that in terms of cost, as well as in terms of technology. Beyond a certain point, you will not be able to scale more. So, the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So, essentially what you get with this architecture is a super computer, which stitches together several nodes and the workload is distributed on each of those nodes for massive develop processing and really fast speeds, as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So, what usually happens is, and this is a simplified version, you have a lot of applications and tools for which you might be collecting the data, and this data then goes into an analytics database. That database then in turn might be connected to some VI tools, dashboard and applications, and some ad-hoc queries being done on the database. Then, you want to do machine learning in this architecture. What usually happens is that you have your machine learning tools and the data that is coming in to the analytics database is actually being exported out of the machine learning tools. You're training your models there, and afterwards, when you have new incoming data, that data again goes out to the machine learning tools for prediction. With those results that you get from those tools usually ended up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So, there's essentially a lot of data overhead that's involved here. There are cons with that, including data governance, data movement, and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database, as well, so you get the benefits of having it applied on all of the data that's inside of the database and not having to care about all of the data movement there, but if there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation outside of the database, and then take the data out, the prepared data, build your model, and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So, the model would be archived, hosted by Vertica, and then you can keep on applying predictions on the new data that's incoming into the database. So, the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools they cannot handle big data and you are forced to sub-sample, but what happens here, as you can see in the figure on the left most, figure A, is that if you have a single data point, essentially any model can explain that, but if you have more data points, as in figure B, there would be a smaller number of models that could be able to explain that, and in figure C, even more data points, lesser number of models explained, but lesser also means here that these models would probably be more accurate, and the objective for building machine learning models is mostly to have prediction capability and generalization capability, essentially, on unseen data, so if you build a model that's accurate on one data point, it could not have very good generalization capabilities. The conventional wisdom with machine learning is that the more data points that you have for learning the better and more accurate models that you'll get out of your machine learning models. So, you need to pick a tool which can handle all of your data and does not force you to sub-sample that, and doing that, even a simpler model might be much better than a more complex model here. So, yeah. Let's go to data exploration and data preparation part. Vertica's a really powerful tool and it offers a lot of scalability in this space, and as I mentioned, will support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica, and then consider it a prepared training modeling deployment and managing the model, but this is a really critical step in the overall machine learning process. Some estimate it takes between 60 to 80% of the overall effort of a machine learning process. So, a lot of functions here. You can use part of Vertica, do data exploration, de-duplication, outlier detection, balancing, normalization, and potentially a lot more. You can actually go to our Vertica documentation and find them there. Within Vertica we divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Within exploration, you have a rich set functions that you can use in DB, and then if you want to build your own you can use the UDX to do that. Similarly, for transformation there's a lot of functions around time series, pattern matching, outlier detection that you can use to transform that data, and it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database. The good thing is actually their ability to scale on really, really large data set and be able to compute those results for you on that data set in an acceptable amount of time, which makes your machine learning processes really critical. So, let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'll not be able to go through all of them, but just for our understanding we can go through some of them and see how they work. So, we have here a sample data set of network flows. It's a similar attack from some source nodes, and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots, and then clean the data. The objective here is not to make a prediction, per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So, the first step is loading the data. Since in this case we know the structure of the data, so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure, there's a really cool feature in Vertica called flex tables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. You can also use that if your data is dynamic and it's changing, to board the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now, but once you've done that we'd like to look at the flows just to look at the data, you know how it looks, and once we do select star from flows and just have a limit here, we see that there's already some data duplication, and by duplication I mean rows which have the exact same data for each of the columns. So, as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So, we create a table with distinct flows and you can see here we have about a million flows here which are unique. So, moving on. The next step we want to do here, this is essentially time state data and these times are in days of the week, so we want to look at the trends of this data. So, the network traffic that's there, you can call it flows. So, based on hours of the day how does the traffic move and how does it differ from one day to another? So, it's part of an exploration process. There might be a lot of further exploration that you want to do, but we can start with this one and see how it goes, and you can see in the graph here that we have seven days of data, and the weekend traffic, which is in pink and purple here seems a little different from the rest of the days. Pretty close to each other, but yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here, but the thing is that this is just data for one week, as I mentioned. What if we load data for 70 days? You'd have a longer graph probably, but a lot of lines and would not really be able to make sense out of that data. It would be a really crowded plot for that, so we have to come up with a better way to be able to explore that and we'll come back to that in a little bit. So, what are some other things that we can do? We can get some statistics, we can take one sample flow and look at some of the values here. We see that the forward column here and ToS column here, they have zero values, and when we explore further we see that there's a lot of values here or records here for which these columns are essentially zero, so probably not really helpful for our use case. Then, we can look at the flow end. So, flow end is the end time when the last packet in a flow was sent and you can do a select min flow and max flow to see the data when it started and when it ended, and you can see it's about one week's of data for the first til eighth. Now, we also want to look at the data whether it's balanced or not because balanced data is really important for a lot of classification use cases that we want to try with this and you can see that source address, destination address, source port, and destination port, and you see it's highly in balanced data and so is versus destination address space, so probably something that we need to do, really powerful Vertica balancing functions that you can use within, and just sampling, over-sampling, or hybrid sampling here and that can be really useful here. Another thing we can look at is there's so many statistics of these columns, so off the unique flows table that we created we just use the summarize num call function in Vertica and it gives us a lot of really cool (mumbling) and percentile information on that. Now, if we look at the duration, which is the last record here, we can see that the mean is about 4.6 seconds, but when we look at the percentile information, we see that the median is about 0.27. So, there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they'd probably bring the mean to the 4.6 value, but then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So, we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this, and then we want to look at what are the most popular network services here? So again, simply queue here, select destination port count, add in the information here. We get the destination port and count for each. So, we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So, let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, record was anomaly or not, and creating our algorithm based on it. So, we see that there's this background label, a lot of records there, and then anomaly spam seems to be really high. There are anomaly UDB scans and SSS scams, as well. So, another question we can ask is among the SMTP flows, how labels are distributed, and we can say that anomaly spam is highest, and then comes the background spam. So, can we say out of this that SMTP flows, they are spams, and maybe we can build a model that actually answers that question for us? That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So, you can expect port 25 for SMTP service here, and we can see that SMTP with destination port 25, you have a lot of counts here, but there are some other destination ports for which the count is really low, and essentially, when we're doing and analysis at this scale, these data points might not really be needed. So, as part of the data prep slash data cleaning we might want to get rid of these records here. So now, what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flow and convert into a flow count and to a manageable number that we can then feed into one of the algorithms. Now, PCA principle component analysis, it's a really powerful algorithm in Vertica, and what it essentially does is a lot of times when you have a high number of columns, which might be highly (mumbling) with each other, you can feed them into the PCA algorithm and it will get for you a list of principle components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variants of the overall data set that you have. So, you can see here component one explains about 73.9% of the variance, and component two explains about 16% of the variance. So, if you combine those two components alone, that would get you for around 90% of the variance. Now, you can use PCA for a lot of different purposes, but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week, what sort of information can we get out of it? Is there any insight that this provides? Because once you have two data points, it's really easy to plot them. So, we just apply the PCA, we first (mumbling) it, and then reapply on our data set, and this is the graph we get as a result. Now, you can see component one is on the X axis here, component two on the y axis, and each of these points represents a day of the week. Now, with just two points it's easy to plot that and compare this to the graph that we saw earlier, which had a lot of lines and the more weeks that we added or the more days that we added, the more lines that we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday til Friday, that's closely clustered together, so probably pretty similar to each other, and then Saturday traffic is pretty much apart from all of these days and it's also further away from Sunday. So, these two days of traffic is different from other days of traffic and we can always dive deeper into this and look at exactly what's happening here and see how this traffic is actually different, but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So, current common practice is when you're putting your machine learning models into production, you'd have a dev or test environment, and in that you might be using a lot of different tools, Scikit and Spark, R, and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database that could be your analytical database, and all of the new data that's incoming would be coming into the database itself. So, as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning training models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So, why would you really need to transfer your models? The thing is that no machine learning platform provides everything. There might be some really cool algorithms that might compromise, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at and that's the reason why a lot of these tools might be used in the same company at the same time, and then there might be some functional considerations, as well. You might want to isolate your data between data science team and your production environment, and you might want to score your pre-trained models on some S nodes here. You cannot host probably a big solution, so there is a whole lot of use cases where model movement or model transfer from one tool to another makes sense. Now, one of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models, and helps you share models between the different applications that are PMML compliant. Really popular tool, and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model management, this model movement capability, there's a lot of model management capabilities that Vertica offers. So, models are essentially first class citizens of Vertica. What that means is that each model is associated with a DB schema, so the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users, he can work with the ownership rights in any way that you would work with any other relation in a database would be. So, the same commands that you use for granting access to a model, changing its owner, changing its name, or dropping it, you can use similar commands for more of this one. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance, and finally, the import/export part enables you to apply all of these operations to the model that you have imported or you might want to export while they're in the database, and I think it would be nice to actually go through and example to showcase some of these capabilities in our model management, including the PMML model import and export. So, the workflow for export would be that we trained some data, we'll train a logistic regression model, and we'll save it as an in-DB Vertica model. Then, we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, concoctions and stuff, and then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through and example for export. We'll have an external PMML model trained outside of Vertica, we'll import that PMML model and from there on, essentially, we'll treat it as an in-DB PMML model. We'll explore the summary and attribute of the model in much the same way as in in-DB model. We'll apply the model for in-DB scoring and get the prediction results, and finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter Notebook. We have the Vertica Python connector here that you can use, really powerful connector, allows you to do a lot of cool stuff to the database using the Jupyter front end, but essentially, you can use any other SQL front end tool or for that matter, any other Python ID which lets you connect to the database. So, exporting model. First, we'll create an logistic regression model here. Select logistic regression, we'll give it a model name, then put relation, which might be a table, time table, or review. There's response column and the predictor columns. So, we get a logistic regression model that we built. Now, we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So, we can see here that my model that we just created, it's created with Vertica models as a category, model type is logistic regression, and we have some other metadata around this model, as well. So now, we can look at some of the summary statistics of the model. We can look at the details. So, it gives us the predictor, coefficients, standard error, Z value, and P value. We can look at the regularization parameters. We didn't use any, so that would be a value of one, but if you had used, it would show it up here, the call string and also additional information regarding iteration count, rejected row count, and accepted row count. Now, we can also look at the list of attributes of the model. So, select get model attribute using parameter, model name is myModel. So, for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a column format. So, using parameter name myModel, and in this case we add attribute name equals details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, Z value, and P value here. So now, what we can do is we can export this model. So, we used the select export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported because essentially might have a lot of models that you have created, and you give it the category here, which in our example is PMML, and you get a status message here that export model has been successful. So now, let's move onto the importing models example. In much the same way that we created a model in Vertica and exported it out, you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more quickly than model training. So, in this particular case we do a select import models and we are importing a logistic regression model that was created in Spark. The category here again is PMML. So, we get the status message that the import was successful. Now, let's look at the attributes, look at the models table, and see that the model is really present there. Now previously when we ran this query because we had only myModel there, so that was the only entry you saw, but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an individuated model, rather an imported model, so you get PMML here and then other metadata regarding the model, as well. Now, let's do some of the same operations that we did with the in-DB model so we can look at the summary of the imported PMML model. So, you can see the function name, data fields, predictors, and some additional information here. Moving on. Let's look at the attributes of the PMML model. Select your model attribute. Essentially the same query that we applied earlier, but the difference here is only the model name. So, you get the attribute names, attribute field, and number of rows. We can also look at the coefficient of the PMML model, name, exponent, and coefficient here. So yeah, pretty much similar to what you can do with an in-DB model. You can also perform all operations on an important model and one additional thing we'd want to do here is to use this important model for our prediction. So in this case, we'll data do a select predict PMML and give it some values using parameters model name, and logistic regression, and match by position, it's a really cool feature. This is true in this case. Sector, true. So, if you have model being imported from another platform in which, let's say you have 50 columns, now the names of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica, but as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives us a status of one. Now, using the important model, in this case we had a certain value that we had given it, but you can also use it on a table, as well. So in that case, you also get the prediction here and you can look at the (mumbling) metrics, see how well you did. Now, just sort of wrapping this up, it's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using, like Python or R versus Vertica. What happens is, let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through the overhead of keeping to move that data out of the database to do prediction every time you want to do it. So, what you can do is you can import that model into Vertica, but what Vertica does differently than Python is that the PMML model would actually be distributed across each mode in the cluster, so it would be applying on the data segments in each of those nodes and they might be different threads running for that prediction. So, the speed that you get here from all prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data, but using all the data that's available to it, essentially. So, you can build that model. The model building process would again go through the same technique. It would actually be distributed across all nodes in a cluster, and it would be using up all the threads and processes available to it within those nodes. So, really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated, so you can export that model in a PMML format and all deploy it on the edge node. So, it's really helpful for a lot of use cases. And just some rising takeaways from our discussion today. So, Vertica's a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have more import and export capability through PMML models. Initially, we're supporting kmeans, linear, and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If you want to do that with the upcoming release, our TensorFlow indication is always there which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica model can be exported in PMML format for scoring on other platforms, and similarly, models that get build in other tools can be imported for in-DB machine learning and in-DB scoring within Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap, as well, which would keep on developing. Many ML functions and algorithms, they're already part of the in-DB library and we keep on adding to that, as well. So, thank you so much for joining the discussion today and if you have any questions we'd love to take them now. Back to you, Sue.

Published Date : Mar 30 2020

SUMMARY :

and thank you for joining us today and the limit, you can hit that in terms of cost,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
VerticaORGANIZATION

0.99+

Waqas DhillonPERSON

0.99+

70 daysQUANTITY

0.99+

Sue LeClairePERSON

0.99+

two pointsQUANTITY

0.99+

two daysQUANTITY

0.99+

SuePERSON

0.99+

seven daysQUANTITY

0.99+

one weekQUANTITY

0.99+

five daysQUANTITY

0.99+

SundayDATE

0.99+

two partsQUANTITY

0.99+

second partQUANTITY

0.99+

SaturdayDATE

0.99+

ExcelTITLE

0.99+

50 columnsQUANTITY

0.99+

4/2DATE

0.99+

FirstQUANTITY

0.99+

PythonTITLE

0.99+

eachQUANTITY

0.99+

each nodeQUANTITY

0.99+

TodayDATE

0.99+

first factorQUANTITY

0.99+

less than 0.27 secondsQUANTITY

0.99+

VerticaTITLE

0.99+

firstQUANTITY

0.99+

FridayDATE

0.99+

MondayDATE

0.99+

second aspectQUANTITY

0.99+

eighthQUANTITY

0.99+

todayDATE

0.99+

one dayQUANTITY

0.99+

two data pointsQUANTITY

0.99+

third considerationQUANTITY

0.99+

oneQUANTITY

0.99+

first stepQUANTITY

0.98+

first partQUANTITY

0.98+

first oneQUANTITY

0.98+

zero valuesQUANTITY

0.98+

secondQUANTITY

0.98+

both approachesQUANTITY

0.98+

about 4.6 secondsQUANTITY

0.98+

third thingQUANTITY

0.98+

SecondlyQUANTITY

0.98+

one toolQUANTITY

0.98+

zeroQUANTITY

0.98+

each modeQUANTITY

0.98+

OneQUANTITY

0.97+

figure BOTHER

0.97+

figure COTHER

0.97+

4.6 valueQUANTITY

0.97+

RTITLE

0.97+

Machine Learning with Vertica, Data Preparation and Model ManagementTITLE

0.97+

WaqasPERSON

0.97+

each modelQUANTITY

0.97+

two main optionsQUANTITY

0.97+

80%QUANTITY

0.97+

two componentsQUANTITY

0.96+

around 90%QUANTITY

0.96+

twoQUANTITY

0.96+

later this weekDATE

0.95+

Doug Merritt, Splunk | Splunk .conf19


 

>> Announcer: Live from Las Vegas, it's theCUBE! Covering Splunk .conf19. Brought to you by Splunk. Okay, welcome back, everyone. This is day three live CUBE coverage here in Las Vegas for Splunk's .conf. Its 10 years anniversary of their big customer event. I'm John Furrier, theCUBE. This is our seventh year covering, riding the wave with Splunk. From scrappy startup, to going public company, massive growth, now a market leader continuing to innovate. We're here with the CEO, Doug Merritt of Splunk. Thanks for joining me, good to see you. >> Thank you for being here, thanks for having me. >> John: How ya feelin'? (laughs) >> Exhausted and energized simultaneously. (laughs) it was a fun week. >> You know, every year when we have the event we discuss Splunk's success and the loyalty of the customer base, the innovation, you guys are providing the value, you got a lot of happy customers, and you got a great ecosystem and partner network growing. You're now growing even further, every year it just gets better. This year has been a lot of big highlights, new branding, so you got that next level thing goin' on, new platform, tweaks, bringing this cohesive thing. What's your highlights this year? I mean, what's the big, there's so much goin' on, what's your highlights? >> So where you started is always my highlight of the show, is being able to spend time with customers. I have never been at a company where I feel so fortunate to have the passion and the dedication and the enthusiasm and the gratitude of customers as we have here. And so that, I tell everyone at Splunk this is similar to a holiday function for a kid for me where the energy keeps me going all year long, so that always is number one, and then around the customers, what we've been doing with the technology architecture, the platform, and the depth and breadth of what we've been working on honestly for four plus years. It really, I think, has come together in a unique way at this show. >> Last year you had a lot of announcements that were intentional announcements, it's coming. They're coming now, they're here, they're shipping. >> They're here, they're here. >> What is some of the feedback you're hearing because a lot of it has a theme where, you know, we kind of pointed this out a couple of years ago, it's like a security show now, but it's not a security show, but there's a lot of security in there. What are some of the key things that have come out of the oven that people should know about that are being delivered here? >> So the core of what we're trying to communicate with Data-to-Everything is that you need a very multifaceted data platform to be able to handle the huge variety of data that we're all dealing with, and Splunk has been known and been very successful at being able to index data, messy, non-structured data, and make sense of it even though it's not structured in the index, and that's been, still is incredibly valuable. But we started almost four years ago on a journey of adding in stream processing before the data gets anywhere, to our index or anywhere else, it's moving all around the world, how do you actually find that data and then begin to take advantage of it in-flight? And we announced that the beta of Data Stream Processor last year, but it went production this year, four years of development, a ton of patents, a 40 plus person, 50 plus person, development team behind that, a lot of hard engineering, and really elegant interface to get that there. And then on the other end, to complement the index, data is landing all over the place, not just in our index, and we're very aware that different structures exist for different needs. A data warehouse has different properties than a relational database which has different properties than a NoSQL column store in-memory database, and data is going to only continue to be more dispersed. So again, four plus years ago we started on what now is Data Fabric Search which we pre-announced in beta format last year. That went production at this show, but the ability to address a distributed Splunk landscape, but more importantly we demoed the integration with HTFS and S3 landscapes as the proof point of we've built a connector framework, so that this really cannot just be a incredibly high-speed, high-cardinality search processing engine, but it really is a federated search engine as well. So now we can operate on data in the stream when it's in motion. We obviously still have all the great properties of the Splunk index, and I was really excited about Splunk 8.0 and all the features in that, and we can go get data wherever it lives across a distributed Splunk environment, but increasingly across the more and more distributed data environment. >> So this is a data platform. This is absolutely a data platform, so that's very clear. So the success of platforms, in the enterprise at least, not just small and medium-sized businesses, you can have a tool and kind of look like a platform, there's some apps out there that I would point to and say, "Hey, that looks like a tool, it's really not a platform." You guys are a platform. But the success of a platform are two things, ecosystem and apps, because if you're in a platform that's enabling value, you got to have those. Talk about how you see the ecosystem success and the app success. Is that happening in your view? >> It is happening. We have over 2,000 apps on our Splunkbase framework which is where any of our customers can go and download the application to help draw value of a Palo Alto firewall, or ensure integration with a ServiceNow trouble ticketing system, and thousands of other examples that exist. And that has grown from less than 300 apps, when I first got here six years ago, to over 2,000 today. But that is still the earliest inning, for earliest pitch and your earliest inning journey. Why are there 20,000, 200,000, two million apps out there? A piece of it is we have had to up the game on how you interface with the platform, and for us that means through a stable set of services, well-mannered, well-articulated, consistently maintained services, and that's been a huge push with the core Splunk index, but it's also a big amount of work that we've been doing on everything from the separation between Phantom runbooks and playbooks with the underlying orchestration automation, it's a key component of our Stream Processor, you know, what transformations are you doing, what enrichments are you doing? That has to live separate than the underlying technology, the Kafka transport mechanism, or Kinesis, or whatever happens in the future. So that investment to make sure we got a effective and stable set of services has been key, but then you complement that with the amazing set of partners that are out here, and making sure they're educated and enabled on how to take advantage of the platform, and then feather in things like the Splunk Ventures announcement, the Innovation Fund and Social Impact Fund, to further double down on, hey, we are here to help in every way. We're going to help with enablement, we're going to help with sell-through and marketing, and we'll help with investment. >> Yeah, I think this is smart, and I think one of the things I'll point out is that feedback we heard from customers in conversations we had here on theCUBE and the hallway is, there's a lot of great feedback on the automation, the machine learning toolkit, which is a good tell sign of the engagement level of how they're dealing with data, and this kind of speaks to data as a value... The value creation from data seems to be the theme. It's not just data for data's sake, I mean, managing data is all hard stuff, but value from the data. You mentioned the Ventures, you got a lot of tech for good stuff goin' on. You're investing in companies where they're standing up data-driven companies to solve world problems, you got other things, so you guys are adjusting. In the middle innings of the data game, platform update, business model changes. Talk about some of the consumption changes, now you got Splunk Cloud, what's goin' on on (laughs) how you charge, how are customers consuming, what moves did you guys make there and what's the result? >> Yeah, it's a great intro on data is awesome, but we all have data to get to decisions first and actions second. Without an action there is no point in gathering data, and so many companies have been working their tails off to digitize their landscapes. Why, well you want a more flexible landscape, but why the flexibility? Because there's so much data being generated that if you can get effective decisions and then actions, that landscape can adapt very, very rapidly, which goes back to machine learning and eventual AI-type opportunities. So that is absolutely, squarely where we've been focused, is translating that data into value and into actual outcomes, which is why our orchestration automation piece was so important. One of the gating factors that we felt has existed is for the Splunk index, and it's only for the Splunk index, the pricing mechanism has been data volume, and that's a little bit contrary to the promise, which is you don't know where the value is going to be within data, and whether it's a gigabyte or whether it's a petabyte, why shouldn't you be able to put whatever data you want in to experiment? And so we came out with some updates in pricing a month and change ago that we were reiterating at the show and will continue to drive on a, hopefully, very aggressive and clear marketing and communications framework, that for people that have adjusted to the data volume metric, we're trying to make that much simpler. There's now a limited set of bands, or tiers, from 100 gigs to unlimited, so that you really get visibility on, all right, I think that I want to play with five terabytes, I know what that band looks like and it's very liberal. So that if you wind up with six and a half terabytes you won't be penalized, and then there's a complimentary metric which I think is ultimately going to be the more long-lived metric for our infrastructurally-bound products, which is virtual CPU or virtual core. And when I think about our index, stream processing, federated search, the execution of automation, all those are basically a factor of how much infrastructure you're going to throw at the problem, whether it's CPU or whether it's storage or network. So I can see a day when Splunk Enterprise and the index, and everything else at that lower level, or at that infrastructure layer, are all just a series of virtual CPUs or virtual cores. But I think both, we're offering choice, we really are customer-centric, and whether you want a more liberal data volume or whether you want to switch to an infrastructure, we're there and our job is to help you understand the value translation on both of those because all that matters is turning it into action and into doing. >> It's interesting, in the news yesterday quantum supremacy was announced. Google claims it, IBM's debating it, but quantum computing just points to the trend that more compute's coming. So this is going to be a good thing for data. You mentioned the pricing thing, this brings up a topic we've been hearing all week on theCUBE is, diverse data's actually great for machine learning, great for AI. So bringing in diverse data gives you more aperture into data, and that actually helps. With the diversity comes confusion and this is where the pricing seems to hit. You're trying to create, if I get this right, pricing that matches the needs of the diverse use of data. Is that kind of how you guys are thinkin' about it? >> Meets the needs of diverse data, and also provides a lot of clarity for people on when you get to a certain threshold that we stop charging you altogether, right? Once you get above 10s of terabytes to 100 terabytes, just put as much data in as you want. The foundation of Splunk, going back to the first data, is we're the only technology that still exists on the index side that takes raw, non-formatted data, doesn't force you to cleanse or scrub it in any way, and then takes all that raw data and actually provides value through the way that we interact with the data with our query language. And that design architecture, I've said it for five, six years now, is completely unique in the industry. Everybody else thinks that you've got to get to the data you want to operate on, and then put it somewhere, and the way that life works is much more organic and emergent. You've got chaos happening, and then how do you find patterns and value out of that chaos? Well, that chaos winds up being pretty voluminous. So how do we help more organizations? Some of the leading organizations are at five to 10 petabytes of data per day going through the index. How do we help everybody get there? 'Cause you don't know the nugget across that petabyte or 10 petabyte set is going to be the key to solving a critical issue, so let's make it easy for you to put that data in to find those nuggets, but then once you know what the pattern is, now you're in a different world, now you're in the structured data world of metrics, or KPIs, or events, or multidimensional data that is much more curated, and by nature that's going to be more fine-grained. There's not as much volume there as there is in the raw data. >> Doug, I notice also at the event here there's a focus on verticals. Can you comment on the strategy there, is that by design? Is there a vertical focus? >> It's definitely by design. >> Share some insight into that. >> So we launched with an IT operations focus, we wound up progressing over the years to a security operations focus, and then our doubling down with Omnition, SignalFx, VictorOps, and now Streamlio is a new acquisition on the DevOps and next gen app dev buying centers. As a company and how we go to market and what we are doing with our own solutions, we stay incredibly focused on those three very technical buying centers, but we've also seen that data is data. So the data you're bringing in to solve a security problem can be used to solve a manufacturing problem, or a logistics and supply chain problem, or a customer sentiment analysis problem, and so how do you make use of that data across those different buying centers? We've set up a verticals group to seed, continue to seed, the opportunity within those different verticals. >> And that's compatible with the horizontally scalable Splunk platform. That's kind of why that exists, right? >> That the overall platform that was in every keynote, starting with mine, is completely agnostic and horizontal. The solutions on top, the security operations, ITOps, and DevOps, are very specific to those users but they're using the horizontal platform, and then you wind up walking into the Accenture booth and seeing how they've taken similar data that the SecOps teams gathered to actually provide insight on effective rail transport for DB cargo, or effective cell tower triangulation and capacity for a major Australian cell company, or effective manufacturing and logistics supply chain optimization for a manufacturer and all their different retail distribution centers. >> Awesome, you know, I know you've talked with Jeff Frick in the past, and Stu Miniman and Dave Vellante about user experience, I know that's something that's near and dear to your heart. You guys, it has been rumored, there's going to be some user experience work done on the onboarding for your Splunk Cloud and making it easier to get in to this new Splunk platform. What can we expect on the user experience side? (laughs) >> So, for any of you out there that want to try, we've got Splunk Investigate, that's one of the first applications on top of the fully decomposed, services layered, stateless Splunk Cloud. Mission Control actually is a complementary other, those are the first two apps on top of that new framework. And the UI and experience that is in Splunk Investigate I think is a good example of both the ease of coming to and using the product. There's a very liberal amount of data you get for free just to experiment with Splunk Investigate, but then the onboarding experience of data is I think very elegant. The UI is, I love the UI, it's a Jupyter-style workbook-type interface, but if you think about what do investigators need, investigators need both some bread crumbs on where to start and how to end, but then they also need the ability to bring in anybody that's necessary so that you can actually swarm and attack a problem very efficiently. And so when you go back and look at, why did we buy VictorOps? Well, it wasn't because we think that the IT alerting space is a massive space we're going to own, it's because collaboration is incredibly important to swarm incidents of any type, whether they're security incidents or manufacturing incidents. So the facilities at VictorOps gave, on allowing distributed teams and virtual teams to very quickly get to resolution. You're going to find those baked into all products like Mission Control 'cause it's one of the key facilities of, that Tim talked about in his keynote, of indulgent design, mobility, high collaboration, 'cause luckily people still matter, and while ML is helping all of us be more productive it isn't taking away the need for us, but how do you get us to cooperate effectively? And so our cloud-based apps, I encourage any of you out there, go try Splunk Investigate, it's a beautiful product and I think you'll be blown away by it. >> Great success on the product side, and then great success on the customer side, you got great, loyal customers. But I got to ask you about the next level Splunk. As you look at this event, what jumps out at me is the cohesiveness of the story around the platform and the apps, ecosystem's great, but the new branding, Data-to-Everything. It's not product-specific 'cause you have product leadership. This is a whole next level Splunk. What is the next level Splunk vision? >> And I love the pink and orange, in bold colors. So when I've thought about what are the issues that are some of the blockers to Splunk eventually fulfilling the destiny that we could have, the number one is awareness. Who the heck is Splunk? People have very high variance of their understanding of Splunk. Log aggregation, security tool, IT tool, and what we've seen over and over is it is much more this data platform, and certainly with the announcements, it's becoming more of this data fabric or platform that can be used for anything. So how do we bring awareness to Splunk? Well, let's help create a category, and it's not up to us to create the category, it's up to all of you to create the category, but Data-to-Everything in our minds represents the power of data, and while we will continue internally to focus on those technical buying centers, everything is solvable with data. So we're trying to really reinforce the importance of data and the capabilities that something like Splunk brings. Cloud becomes a really important message to that because that makes it, execution to that, 'cause it makes it so much easier for people to immediately try something and get value, but on-prem will always be important as well 'cause data has gravity, data has risk, data has cost to move. And there are so many use cases where you would just never push data to the cloud, and it's not because we don't love cloud. If you have a factory that's producing 100 terabytes an hour in a area where you've got poor bandwidth, there's no option for a cloud connect there of high scale, so you better be able to process, make sense of, and act on that data locally. >> And you guys are great in the cloud too, on-premise, but final word, I want to get your thoughts to end this segment, I know you got to run, thanks for your time, and congratulations on all your success. Data for good. There's a lot of tech for bad kind of narratives goin' on, but there's a real resurgence of tech for good. A lot of people, entrepreneurs, for-profit, for-nonprofit, are doing ventures for good. Data is a real theme. Data for good is something that you have, that's part of the Data-to-Everything. Talk about the data for good real quick. >> Yeah, we were really excited about what we've done with Splunk4Good as our nonprofit focused entity. The Splunk Pledge which is a classic 1-1-1 approach to make sure that we're able to help organizations that need the help do something meaningful within their world, and then the Splunk Social Impact Fund which is trying to put our money where our mouth is to ensure that if funding and scarcity of funds is an issue of getting to effective outcomes, that we can be there to support. At this show we've featured three awesome charities, Conservation International, NetHope, and the Global Emancipation Network, that are all trying to tackle really thorny problems with different, in different ways, different problems in different ways, but data winds up being at the heart of one of the ways to unlock what they're trying to get done. We're really excited and proud that we're able to actually make meaningful donations to all three of those, but it is a constant theme within Splunk, and I think something that all of us, from the tech community and non-tech community are going to have to help evangelize, is with every invention and with every thing that occurs in the world there is the power to take it and make a less noble execution of it, you know, there's always potential harmful activities, and then there's the power to actually drive good, and data is one of those. >> Awesome. >> Data can be used as a weapon, it can be used negatively, but it also needs to be liberated so that it can be used positively. While we're all kind of concerned about our own privacy and really, really personal data, we're not going to get to the type of healthcare and genetic, massive shifts in changes and benefits without having a way to begin to share some of this data. So putting controls around data is going to be important, putting people in the middle of the process to decide what happens to their data, and some consequences around misuse of data is going to be important. But continuing to keep a mindset of all good happens as we become more liberal, globalization is good, free flow of good-- >> The value is in the data. >> Free flow of people, free flow of data ultimately is very good. >> Doug, thank you so much for spending the time to come on theCUBE, and again congratulations on great culture. Also is worth noting, just to give you a plug here, because it's, I think, very valuable, one of the best places to work for women in tech. You guys recently got some recognition on that. That is a huge accomplishment, congratulations. >> Thank you, thank you, we had a great diversity track here which is really important as well. But we love partnering with you guys, thank you for spending an entire week with us and for helping to continue to evangelize and help people understand what the power of technology and data can do for them. >> Hey, video is data, and we're bringin' that data to you here on theCUBE, and of course, CUBE cloud coming soon. I'm John Furrier here live at Splunk .conf with Doug Merritt the CEO. We'll be back with more coverage after this short break. (futuristic music)

Published Date : Oct 24 2019

SUMMARY :

Brought to you by Splunk. Exhausted and energized simultaneously. and the loyalty of the customer base, and the gratitude of customers as we have here. Last year you had a lot of announcements What is some of the feedback you're hearing and data is going to only continue to be more dispersed. and the app success. and download the application to help draw value and this kind of speaks to data as a value... and it's only for the Splunk index, pricing that matches the needs of the diverse use of data. and the way that life works Doug, I notice also at the event here and so how do you make use of that data with the horizontally scalable Splunk platform. and then you wind up walking into the Accenture booth and making it easier to get in the ease of coming to and using the product. But I got to ask you about the next level Splunk. and the capabilities that something like Splunk brings. Data for good is something that you have, and then there's the power to actually drive good, putting people in the middle of the process to decide free flow of data ultimately is very good. one of the best places to work for women in tech. and for helping to continue to evangelize and we're bringin' that data to you here on theCUBE,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DougPERSON

0.99+

Doug MerrittPERSON

0.99+

Dave VellantePERSON

0.99+

NetHopeORGANIZATION

0.99+

Jeff FrickPERSON

0.99+

fiveQUANTITY

0.99+

John FurrierPERSON

0.99+

TimPERSON

0.99+

100 gigsQUANTITY

0.99+

IBMORGANIZATION

0.99+

last yearDATE

0.99+

JohnPERSON

0.99+

Stu MinimanPERSON

0.99+

Last yearDATE

0.99+

Conservation InternationalORGANIZATION

0.99+

Las VegasLOCATION

0.99+

less than 300 appsQUANTITY

0.99+

thousandsQUANTITY

0.99+

four yearsQUANTITY

0.99+

100 terabytesQUANTITY

0.99+

GoogleORGANIZATION

0.99+

Global Emancipation NetworkORGANIZATION

0.99+

SplunkORGANIZATION

0.99+

bothQUANTITY

0.99+

yesterdayDATE

0.99+

this yearDATE

0.99+

six yearsQUANTITY

0.99+

StreamlioORGANIZATION

0.99+

OmnitionORGANIZATION

0.99+

six and a half terabytesQUANTITY

0.99+

Splunk4GoodORGANIZATION

0.99+

SignalFxORGANIZATION

0.99+

five terabytesQUANTITY

0.99+

10 yearsQUANTITY

0.99+

four plus yearsQUANTITY

0.99+

over 2,000 appsQUANTITY

0.99+

VictorOpsORGANIZATION

0.99+

four plus years agoDATE

0.99+

OneQUANTITY

0.98+

first dataQUANTITY

0.98+

10 petabytesQUANTITY

0.98+

seventh yearQUANTITY

0.98+

six years agoDATE

0.98+

10 petabyteQUANTITY

0.98+

Splunk VenturesORGANIZATION

0.98+

50 plus personQUANTITY

0.98+

first two appsQUANTITY

0.98+

20,000, 200,000, two million appsQUANTITY

0.98+

over 2,000QUANTITY

0.97+

a ton of patentsQUANTITY

0.97+

threeQUANTITY

0.97+

oneQUANTITY

0.97+

two thingsQUANTITY

0.97+

40 plus personQUANTITY

0.96+

todayDATE

0.96+

Splunk 8.0TITLE

0.96+

firstQUANTITY

0.95+

four years agoDATE

0.95+

Splunk InvestigateTITLE

0.95+

couple of years agoDATE

0.95+

first applicationsQUANTITY

0.94+

This yearDATE

0.94+

above 10s of terabytesQUANTITY

0.93+

SplunkTITLE

0.93+

VenturesORGANIZATION

0.91+

Palo AltoLOCATION

0.88+

Splunk CloudTITLE

0.87+

three very technical buying centersQUANTITY

0.87+

NoSQLTITLE

0.87+

an hourQUANTITY

0.87+

secondQUANTITY

0.85+

Deploying AI in the Enterprise


 

(orchestral music) >> Hi, I'm Peter Burris and welcome to another digital community event. As we do with all digital community events, we're gonna start off by having a series of conversations with real thought leaders about a topic that's pressing to today's enterprises as they try to achieve new classes of business outcomes with technology. At the end of that series of conversations, we're gonna go into a crowd chat and give you an opportunity to voice your opinions and ask your questions. So stay with us throughout. So, what are we going to be talking about today? We're going to be talking about the challenge that businesses face as they try to apply AI, ML, and new classes of analytics to their very challenging, very difficult, but nonetheless very value-producing outcomes associated with data. The challenge that all these businesses have is that often, you spend too much time in the infrastructure and not enough time solving the problem. And so what's required is new classes of technology and new classes of partnerships and business arrangements that allow for us to mask the underlying infrastructure complexity from data science practitioners, so that they can focus more time and attention on building out the outcomes that the business wants and a sustained business capability so that we can continue to do so. Once again, at the end of this series of conversations, stay with us, so that we can have that crowd chat and you can, again, ask your questions, provide your insights, and participate with the community to help all of us move faster in this crucial direction for better AI, better ML and better analytics. So, the first conversation we're going to have is with Anant Chintamaneni. Anant's the Vice President of Products at BlueData. Anant, welcome to theCUBE. >> Hi Peter, it's great to be here. I think the topic that you just outlined is a very fascinating and interesting one. Over the last 10 years, data and analytics have been used to create transformative experiences and drive a lot of business growth. You look at companies like Uber, AirBnB, and you know, Spotify, practically, every industry's being disrupted. And the reason why they're able to do this is because data is in their DNA; it's their key asset and they've leveraged it in every aspect of their product development to deliver amazing experiences and drive business growth. And the reason why they're able to do this is they've been able to leverage open-source technologies, data science techniques, and big data, fast data, all types of data to extract that business value and inject analytics into every part of their business process. Enterprises of all sizes want to take advantage of that same assets that the new digital companies are taking and drive digital transformation and innovation, in their organizations. But there's a number of challenges. First and foremost, if you look at the enterprises where data was not necessarily in their DNA and to inject that into their DNA, it is a big challenge. The executives, the executive branch, definitely wants to understand where they want to apply AI, how to kind of identify which huge cases to go after. There is some recognition coming in. They want faster time-to-value and they're willing to invest in that. >> And they want to focus more on the actual outcomes they seek as opposed to the technology selection that's required to achieve those outcomes. >> Absolutely. I think it's, you know, a boardroom mandate for them to drive new business outcomes, new business models, but I think there is still some level of misalignment between the executive branch and the data worker community which they're trying to upgrade with the new-age data scientists, the AI developer and then you have IT in the middle who has to basically bridge the gap and enable the digital transformation journey and provide the infrastructure, provide the capabilities. >> So we've got a situation where people readily acknowledge the potential of some of these new AI, ML, big data related technologies, but we've got a mismatch between the executives that are trying to do evidence-based management, drive new models, the IT organization who's struggling to deal with data-first technologies, and data scientists who are few and far between, and leave quickly if they don't get the tooling that they need. So, what's the way forward, that's the problem. How do we move forward? >> Yeah, so I think, you know, I think we have to double-click into some of the problems. So the data scientists, they want to build a tool chain that leverages the best in-class, open source technologies to solve the problem at hand and they don't want, they want to be able to compile these tool chains, they want to be able to apply and create new algorithms and operationalize and do it in a very iterative cycle. It's a continuous development, continuous improvement process which is at odds with what IT can deliver, which is they have to deliver data that is dispersed all over the place to these data scientists. They need to be able to provide infrastructure, which today, they're not, there's an impotence mismatch. It takes them months, if not years, to be able to make those available, make that infrastructure available. And last but not the least, security and control. It's just fundamentally not the way they've worked where they can make data and new tool chains available very quickly to the data scientists. And the executives, it's all about faster time-to-value so there's a little bit of an expectation mismatch as well there and so those are some of the fundamental problems. There's also reproducibility, like, once you've created an analytics model, to be able to reproduce that at scale, to be then able to govern that and make sure that it's producing the right results is fundamentally a challenge. >> Audibility of that process. >> Absolutely, audibility. And, in general, being able to apply this sort of model for many different business problems so you can drive outcomes in different parts of your business. So there's a huge number of problems here. And so what I believe, and what we've seen with some of these larger companies, the new digital companies that are driving business valley ways, they have invested in a unified platform where they've made the infrastructure invisible by leveraging cloud technologies or containers and essentially, made it such that the data scientists don't have to worry about the infrastructure, they can be a lot more agile, they can quickly create the tool chains that work for the specific business problem at hand, scale it up and down as needed, be able to access data where it lies, whether it's on-prem, whether it's in the cloud or whether it's a hybrid model. And so that's something that's required from a unified platform where you can do your rapid prototyping, you can do your development and ultimately, the business outcome and the value comes when you operationalize it and inject it into your business processes. So, I think fundamentally, this start, this kind of a unified platform, is critical. Which, I think, a lot of the new age companies have, but is missing with a lot of the enterprises. >> So, a big challenge for the enterprise over the next few years is to bring these three groups together; the business, data science world and infrastructure world or others to help with those problems and apply it successfully to some of the new business challenges that we have. >> Yeah, and I would add one last point is that we are on this continuous journey, as I mentioned, this is a world of open source technologies that are coming out from a lot of the large organizations out there. Whether it's your Googles and your Facebooks. And so there is an evolution in these technologies much like we've evolved from big data and data management to capture the data. The next sort of phase is around data exploitation with artificial intelligence and machine learning type techniques. And so, it's extremely important that this platform enables these organizations to future proof themselves. So as new technologies come in, they can leverage them >> Great point. >> for delivering exponential business value. >> Deliver value now, but show a path to delivery value in the future as all of these technologies and practices evolve. >> Absolutely. >> Excellent, all right, Anant Chintamaneni, thanks very much for giving us some insight into the nature of the problems that enterprises face and some of the way forward. We're gonna be right back, and we're gonna talk about how to actually do this in a second. (light techno music) >> Introducing, BlueData EPIC. The leading container-based software platform for distributed AI, machine learning, deep learning and analytics environments. Whether on-prem, in the cloud or in a hybrid model. Data scientists need to build models utilizing various stacks of AI, ML and DL applications and libraries. However, installing and validating these environments is time consuming and prone to errors. BlueData provides the ability to spin up these environments on demand. The BlueData EPIC app store includes, best of breed, ready to run docker based application images. Like TensorFlow and H2O driverless AI. Teams can also add their own images, to provide the latest tools that data scientists prefer. And ensure compliance with enterprise standards. They can use the quick launch button. which provides pre configured templates with the appropriate application image and resources. For example, they can instantly launch a new Sandbox environment using the template for TensorFlow with a Jupyter Notebook. Within just a few minutes, it'll be automatically configured with GPUs and easy access to their data. Users can launch experiments and make GPUs automatically available for analysis. In this case, the H2O environment was set up with one GPU. With BlueData EPIC, users can also deploy end points with the appropriate run time. And the inference run times can use CPUs or GPUs. With a container based BlueData Platform, you can deploy fully configured distributed environments within a matter of minutes. Whether on-prem, in the public cloud, or in a hybrid a architecture. BlueData was recently acquired by Hewlett Packward Enterprise. And now, HPE and BlueData are joining forces to help you on your AI journey. (light techno music) To learn more, visit www.BlueData.com >> And we're back. I'm Peter Burris and we're continuing to have this conversation about how businesses are turning experience with the problems of advance analytics and the solutions that they seek into actual systems that deliver continuous on going value and achieve the business capabilities required to make possible these advanced outcomes associated with analytics, AI and ML. And to do that, we've got two great guests with us. We've got Kumar Sreekanti, who is the co-founder and CEO of BlueData. Kumar, welcome back to theCUBE. >> Thank you, it is nice to be here, back again. >> And Kumar, you're being joined by a customer. Ramesh Thyagarajan, is the executive director of the Advisory Board Company which is part of Optum now. Ramesh, welcome to theCUBE. >> Great to be here. >> Alright, so Kumar let's start with you. I mentioned up front, this notion of turning technology and understanding into actual business capabilities to deliver outcomes. What has been BlueData's journey along, to make that happen? >> Yeah, it all started six years ago, Peter. It was a bold vision and a big idea and no pun intended on big data which was an emerging market then. And as everybody knows, the data was enormous and there was a lot of innovation around the periphery. but nobody was paying attention to how to make the big data consumable in enterprise. And I saw an enormous opportunity to make this data more consumable in the enterprise and to give a cloud-like experience with the agility and elasticity. So, our vision was to build a software infrastructure platform like VMware, specially focused on data intensity distributed applications and this platform will allow enterprises to build cloud like experiences both on enterprise as well as on hybrid clouds. So that it pays the journey for their cloud experience. So I was very fortunate to put together a team and I found good partners like Intel. So that actually is the genesis for the BlueData. So, if you look back into the last six years, big data itself has went through a lot of evolution and so the marketplace and the enterprises have gone from offline analytics to AI, ML based work loads that are actually giving them predictive and descriptive analytics. What BlueData has done is by making the infrastructure invisible, by making the tool set completely available as the tool set itself is evolving and in the process, we actually created so many game changing software technologies. For example, we are the first end-to-end content-arised enterprise solution that gives you distributed applications. And we built a technology called DataTap, that provides computed data operation so that you don't have to actually copy the data, which is a boom for enterprises. We also actually built multitenancy so those enterprises can run multiple work loads on the same data and Ramesh will tell you in a second here, in the healthcare enterprise, the multitenancy is such a very important element. And finally, we also actually contributed to many open source technologies including, we have a project called KubeDirector which is actually is our own Kubernetes and how to run stateful workloads on Kubernetes. which we have actually very happy to see that people like, customers like Ramesh are using the BlueData. >> Sounds like quite a journey and obviously you've intercepted companies like the advisory board company. So Ramesh, a lot of enterprises have mastered or you know, gotten, understood how to create data lakes with a dupe but then found that they still weren't able to connect to some of the outcomes that they saw. Is that the experience that you had. >> Right, to be precise, that is one of the kind of problems we have. It's not just the data lake that we need to be able to do the workflows or other things, but we also, being a traditional company, being in the business for a long time, we have a lot of data assets that are not part of this data lake. We're finding it hard to, how do we get the data, getting them and putting them in a data lake is a duplication of work. We were looking for some kind of solutions that will help us to gather the benefits of leaving the data alone but still be able to get into it. >> This is where (mumbles). >> This is where we were looking for things and then I was lucky and fortunate to run into Kumar and his crew in one of the Hadoop conferences and then they demonstrated the way it can be done so immediately hit upon, it's a big hit with us and then we went back and then did a POC, very quickly adapt to the technology and that is also one of the benefits of corrupting this technology is the level of contrary memorization they are doing, it is helping me to address many needs. My data analyst, the data engineers and the data scientists so I'm able to serve all of them which otherwise wouldn't be possible for me with just this plain very (mumbles). >> So it sounds as though the partnership with BlueData has allowed you to focus on activities and problems and challenges above the technology so that you can actually start bringing data science, business objectives and infrastructure people together. Have I got that right? >> Absolutely. So BlueData is helping me to tie them all together and provide an excess value to my business. We being in the healthcare, the importance is we need to be able to look at the large data sets for a period of time in order to figure out how a patient's health journey is happening. That is very important so that we can figure out the ways and means in which we can lower the cost of health care and also provide insights to the physician, they can help get people better at health. >> So we're getting great outcomes today especially around, as you said that patient journey where all the constituents can get access to those insights without necessarily having to learn a whole bunch of new infrastructure stuff but presumably you need more. We're talking about a new world that you mentioned before upfront, talking about a new world, AI, ML, a lot of changes. A lot of our enterprise customers are telling us it's especially important that they find companies that not only deliver something today but demonstrate a commitment to sustain that value delivery process especially as the whole analytics world evolves. Are you experiencing that as well? >> Yes, we are experiencing and one of the great advantage of the platform, BlueData platform that gave me this ability to, I had the new functionality, be it the TensorFlow, be it the H2O, be it the heart studio, anything that I needed, I call them, they give me the images that are plug-and-play, just put them and all the prompting is practically transparent to nobody need to know how it is achieved. Now, in order to get to the next level of the predictive and prescriptive analytics, it is not just you having the data, you need to be able to have your curated data asset set process on top of a platform that will help you to get the data scientists to make you. One of the biggest challenges that are scientist is not able to get their hands on data. BlueData platform gives me the ability to do it and ensure all the security meets and all the compliances with the various other regulated compliances we need to make. >> Kamar, congratulations. >> Thank you. >> Sounds like you have a happy customer. >> Thank you. >> One of the challenges that every entrepreneur faces is how did you scale the business. So talk to us about where you are in the decisions that you made recently to achieve that. >> As an entrepreneur, when you start a company, odds are against you, right? You're always worried about it, right. You make so many sacrifices, yourself and your team and all that but the the customer is the king. The most important thing for us to find satisfied customers like Rameshan so we were very happy and BlueData was very successful in finding that customer because i think as you pointed out, as Ramesh pointed out, we provide that clean solution for the customer but as you go through this journey as a co-founder and CEO, you always worry about how do you scale to the next level. So we had partnerships with many companies including HPE and we found when this opportunity came in front of me with myself and my board, we saw this opportunity of combining the forces of BlueData satisfied customers and innovative technology and the team with the HPs brand name, their world-class service, their investment in R&D and they have a very long, large list of enterprise customers. We think putting these two things together provides that next journey in the BlueData's innovation and BlueData's customers. >> Excellent, so once again Kumar Sreekanti, co-founder and CEO of BlueData and Ramesh Thyagarajan who is the executive director of the advisory board company and part of Optum, I want to thank both of you for being on theCUBE. >> Thank you >> Thank you, great to be here. >> Now let's hear a little bit more about how this notion of bringing BlueData and HPE together is generating new classes of value that are making things happen today but are also gonna make things happen for customers in the future and to do that we've got Dave Velante who's with Silicon Angle Wiki Bond joined by Patrick Osbourne who's with HPE in our Marlborough studio so Dave over to you. >> Thanks Peter. We're here with Patrick Osbourne, the vice president and general manager of big data and analytics at Hewlett Packard Enterprise. Patrick, thanks for coming on. >> Thanks for having us. >> So we heard from Kumar, let's hear from you. Why did HPE purchase, acquire BlueData? >> So if you think about it from three angles. Platform, people and customers, right. Great platform, built for scale addressing a number of these new workloads and big data analytics and certainly AI, the people that they have are amazing, right, great engineering team, awesome customer success team, team of data scientists, right. So you know, all the folks that have some really, really great knowledge in this space so they're gonna be a great addition to HPE and also on the customer side, great logos, major fortune five customers in the financial services vertical, healthcare, pharma, manufacturing so a huge opportunity for us to scale that within HP context. >> Okay, so talk about how it fits into your strategy, specifically what are you gonna do with it? What are the priorities, can you share some roadmap? >> Yeah, so you take a look at HPE strategy. We talk about hybrid cloud and specifically edge to core to cloud and the common theme that runs through that is data, data-driven enterprises. So for us we see BlueData, Epic platform as a way to you know, help our customers quickly deploy these new mode to applications that are fueling their digital transformation. So we have some great plans. We're gonna certainly invest in all the functions, right. So we're gonna do a force multiplier on not only on product engineering and product delivery but also go to market and customer success. We're gonna come out in our business day one with some really good reference architectures, with some of our partners like Cloud Era, H2O, we've got some very scalable building block architectures to marry up the BlueData platform with our Apollo systems for those of you have seen that in the market, we've got our Elastic platform for analytics for customers who run these workloads, now you'd be able to virtualize those in containers and we'll have you know, we're gonna be building out a big services practice in this area. So a lot of customers often talk to us about, we don't have the people to do this, right. So we're gonna bring those people to you as HPE through Point Next, advisory services, implementation, ongoing help with customers. So it's going to be a really fantastic start. >> Apollo, as you mentioned Apollo. I think of Apollo sometimes as HPC high performance computing and we've had a lot of discussion about how that's sort of seeping in to mainstream, is that what you're seeing? >> Yeah absolutely, I mean we know that a lot of our customers have traditional workloads, you know, they're on the path to almost completely virtualizing those, right, but where a lot of the innovation is going on right now is in this mode two world, right. So your big data and analytics pipeline is getting longer, you're introducing new experiences on top of your product and that's fueling you know, essentially commercial HPC and now that folks are using techniques like AI and modeling inference to make those services more scalable, more automated, we're starting to bringing these more of these platforms, these scalable architectures like Apollo. >> So it sounds like your roadmap has a lot of integration plans across the HPE portfolio. We certainly saw that with Nimble, but BlueData was working with a lot of different companies, its software, is the plan to remain open or is this an HPE thing? >> Yeah, we absolutely want to be open. So we know that we have lots of customers that choose, so the HP is all about hybrid cloud, right and that has a couple different implications. We want to talk about your choice of on-prem versus off-prem so BlueData has a great capability to run some of these workloads. It essentially allows you to do separation of compute and storage, right in the world of AI and analytics we can run it off-prem as well in the public cloud but then we also have choice for customers, you know, any customer's private cloud. So that means they want to run on other infrastructure besides HPE, we're gonna support that, we have existing customers that do that. We're also gonna provide infrastructure that marries the software and the hardware together with frameworks like Info Site that we feel will be a you know, much better experience for the customers but we'll absolutely be open and absolutely have choice. >> All right, what about the business impact to take the customer perspective, what can they expect? >> So I think from a customer perspective, we're really just looking to accelerate deployment of AI in the enterprise, right and that has a lot of implications for us. We're gonna have very scalable infrastructure for them, we're gonna be really focused on this very dynamic AI and ML application ecosystems through partnerships and support within the BlueData platform. We want to provide a SAS experience, right. So whether that's GPUs or accelerators as a service, analytics as a service, we really want to fuel innovation as a service. We want to empower those data scientists there, those are they're really hard to find you know, they're really hard to retain within your organization so we want to unlock all that capability and really just we want to focus on innovation of the customers. >> Yeah, and they spend a lot of time wrangling data so you're really going to simplify that with the cloud (mumbles). Patrick thank you, I appreciate it. >> Thank you very much. >> Alright Peter, back to you in Palo Alto. >> And welcome back, I'm Peter Burris and we've been talking a lot in the industry about how new tooling, new processes can achieve new classes of analytics, AI and ML outcomes within a business but if you don't get the people side of that right, you're not going to achieve the full range of benefits that you might get out of your investments. Now to talk a little bit about how important the data science practitioner is in this equation, we've got two great guests with us. Nanda Vijaydev is the chief data scientists of BlueData. Welcome to theCUBE. >> Thank you Peter, happy to be here. >> Ingrid Burton is the CMO and business leader at H2O.AI, Ingrid, welcome to the CUBE. >> Thank you so much for having us. >> So Nanda Vijaydev, let's start with you. Again, having a nice platform, very, very important but how does that turn into making the data science practitioner's life easier so they can deliver more business value. >> Yeah thank you, it's a great question. I think end of the day for a data scientist, what's most important is, did you understand the question that somebody asked you and what is expected of you when you deliver something and then you go about finding, what do I need for them, I need data, I need systems and you know, I need to work with people, the experts in the process to make sure that the hypothesis I'm doing is structured in a nice way where it is testable, it's modular and I have you know, a way for them to go back to show my results and keep doing this in an iterative manner. That's the biggest thing because the satisfaction for a data scientist is when you actually take this and make use of it, put it in production, right. To make this whole thing easier, we definitely need some way of bringing it all together. That's really where, especially compared to the traditional data science where everything was monolithic, it was one system, there was a very set way of doing things but now it is not so you know, with the growing types of data, with the growing types of computation algorithms that's available, there's a lot of opportunity and at the same time there is a lot of uncertainty. So it's really about putting that structure and it's really making sure you get the best of everything and still deliver the results, that is the focus that all data scientists strive for. >> And especially you wanted, the data scientists wants to operate in the world of uncertainty related to the business question and reducing uncertainty and not deal with the underlying some uncertainty associated with the infrastructure. >> Absolutely, absolutely you know, as a data scientist a lot of time used to spend in the past about where is the data, then the question was, what data do you want and give it to you because the data always came in a nice structured, row-column format, it had already lost a lot of context of what we had to look for. So it is really not about you know, getting the you know, it's really not about going back to systems that are pre-built or pre-processed, it's getting access to that real, raw data. It's getting access to the information as it came so you can actually make the best judgment of how to go forward with it. >> So you describe the world with business, technology and data science practitioners are working together but let's face it, there's an enormous amount of change in the industry and quite frankly, a deficit of expertise and I think that requires new types of partnerships, new types of collaboration, a real (mumbles) approach and Ingrid, I want to talk about what H2O.AI is doing as a partner of BlueData, HPE to ensure that you're complementing these skills in pursuit or in service to the customer's objectives. >> Absolutely, thank you for that. So as Nanda described, you know, data scientists want to get to answers and what we do at H2O.AI is we provide the algorithms, the platforms for data scientist to be successful. So when they want to try and solve a problem, they need to work with their business leaders, they need to work with IT and they actually don't want to do all the heavy lifting, they want to solve that problem. So what we do is we do automatic machine learning platforms, we do that with optimizing algorithms and doing all the kind of, a lot of the heavy lifting that novice data scientists need and help expert data scientists as well. I talk about it as algorithms to answers and actually solving business problems with predictions and that's what machine learning is really all about but really what we're seeing in the industry right now and BlueData is a great example of kind of taking away some of the hard stuff away from a data scientist and making them successful. So working with BlueData and HPE, making us together really solve the problems that businesses are looking for, it's really transformative and we've been through like the digital transformation journey, all of us have been through that. We are now what I would term an AI transformation of sorts and businesses are going to the next step. They had their data, they got their data, infrastructure is kind of seamlessly working together, the clusters and containerization that's very important. Now what we're trying to do is get to the answers and using automatic machine learning platforms is probably the best way forward. >> That's still hard stuff but we're trying to get rid of data science practitioners, focusing on hard stuff that doesn't directly deliver value. >> It doesn't deliver anything for them, right. They shouldn't have to worry about the infrastructure, they should worry about getting the answers to the business problems they've been asked to solve. >> So let's talk a little bit about some of the new business problems that are going to be able to be solved by these kinds of partnerships between BlueData and H2O.AI. Start, Nanda, what do you, what gets you excited when we think about the new types of business problems that customers are gonna be able to solve. >> Yeah, I think it is really you know, the question that comes to you is not filtered through someone else's lens, right. Someone is trying an optimization problem, someone is trying to do a new product discovery so all this is based on a combination of both data-driven and evidence-based, right. For us as a data scientist, what excites me is that I have the flexibility now that I can choose the best of the breed technologies. I should not be restricted to what is given to me by an IT organization or something like that but at the same time, in an organization, for things to work, there has to be some level of control. So it is really having this type of environments or having some platforms where some, there is a team that can work on the control aspect but as a data scientist, I don't have to worry about it. I have my flexibility of tools of choice that I can use. At the same time, when you talk about data, security is a big deal in companies and a lot of times data scientists don't get access to data because of the layers and layers of security that they have to go through, right. So the excitement of the opportunity for me is if someone else takes care of the problem you know, just tell me where is the source of data that I can go to, don't filter the data for me you know, don't already structure the data for me but just tell me it's an approved source, right then it gives me more flexibility to actually go and take that information and build. So the having those controls taken care of well before I get into the picture as a data scientist, it makes it extremely easy for us to focus on you know, to her point, focus on the problem, right, focus on accessing the best of the breed technology and you know, give back and have that interaction with the business users on an ongoing basis. >> So especially focus on, so speed to value so that you're not messing around with a bunch of underlying infrastructure, governance remaining in place so that you know what are the appropriate limits of using the data with security that is embedded within that entire model without removing fidelity out of the quality of data. >> Absolutely. >> Would you agree with those? >> I totally agree with all the points that she brought up and we have joint customers in the market today, they're solving very complex problems. We have customers in financial services, joint customers there. We have customers in healthcare that are really trying to solve today's business problems and these are everything from, how do I give new credit to somebody? How do I know what next product to give them? How do I know what customer recommendations can I make next? Why did that customer churn? How do I reach new people? How do I do drug discovery? How do I give a patient a better prescription? How do I pinpoint disease than when I couldn't have seen it before? Now we have all that data that's available and it's very rich and data is a team sport. It takes data scientists, it takes business leaders and it takes IT to make it all work together and together the two companies are really working to solve problems that our customers are facing, working with our customers because they have the intellectual knowledge of what their problems are. We are providing the tools to help them solve those problems. >> Fantastic conversation about what is necessary to ensure that the data science practitioner remains at the center and is the ultimate test of whether or not these systems and these capabilities are working for business. Nanda Vijaydev, chief data scientist of BlueData, Ingrid Burton CMO and business leader, H2O.AI, thank you very much for being on theCUBE. >> Thank you. >> Thank you so much. >> So let's now spend some time talking about how ultimately, all of this comes together and what you're going to do as you participate in the crowd chat. To do that let me throw it back to Dave Velante in our Marlborough studios. >> We're back with Patrick Osbourne, alright Patrick, let's wrap up here and summarize. We heard how you're gonna help data science teams, right. >> Yup, speed, agility, time to value. >> Alright and I know a bunch of folks at BlueData, the engineering team is very, very strong so you picked up a good asset there. >> Yeah, it means amazing technology, the founders have a long lineage of software development and adoption in the market so we're just gonna, we're gonna invested them and let them loose. >> And then we heard they're sort of better together story from you, you got a roadmap, you're making some investments here, as I heard. >> Yeah, I mean so if we're really focused on hybrid cloud and we want to have all these as a services experience, whether it's through Green Lake or providing innovation, AI, GPUs as a service is something that we're gonna be you know, continuing to provide our customers as we move along. >> Okay and then we heard the data science angle and the data science community and the partner angle, that's exciting. >> Yeah, I mean, I think it's two approaches as well too. We have data scientists, right. So we're gonna bring that capability to bear whether it's through the product experience or through a professional services organization and then number two, you know, this is a very dynamic ecosystem from an application standpoint. There's commercial applications, there's certainly open source and we're gonna bring a fully vetted, full stack experience for our customers that they can feel confident in this you know, it's a very dynamic space. >> Excellent, well thank you very much. >> Thank you. Alright, now it's your turn. Go into the crowd chat and start talking. Ask questions, we're gonna have polls, we've got experts in there so let's crouch chat.

Published Date : May 7 2019

SUMMARY :

and give you an opportunity to voice your opinions and to inject that into their DNA, it is a big challenge. on the actual outcomes they seek and provide the infrastructure, provide the capabilities. and leave quickly if they don't get the tooling So the data scientists, they want to build a tool chain that the data scientists don't have to worry and apply it successfully to some and data management to capture the data. but show a path to delivery value in the future that enterprises face and some of the way forward. to help you on your AI journey. and the solutions that they seek into actual systems of the Advisory Board Company which is part of Optum now. What has been BlueData's journey along, to make that happen? and in the process, we actually created Is that the experience that you had. of leaving the data alone but still be able to get into it. and that is also one of the benefits and challenges above the technology and also provide insights to the physician, that you mentioned before upfront, and one of the great advantage of the platform, So talk to us about where you are in the decisions and all that but the the customer is the king. and part of Optum, I want to thank both of you in the future and to do that we've got Dave Velante and general manager of big data and analytics So we heard from Kumar, let's hear from you. and certainly AI, the people that they have are amazing, So a lot of customers often talk to us about, about how that's sort of seeping in to mainstream, and modeling inference to make those services more scalable, its software, is the plan to remain open and storage, right in the world of AI and analytics those are they're really hard to find you know, Yeah, and they spend a lot of time wrangling data of benefits that you might get out of your investments. Ingrid Burton is the CMO and business leader at H2O into making the data science practitioner's life easier and at the same time there is a lot of uncertainty. the data scientists wants to operate in the world of how to go forward with it. and Ingrid, I want to talk about what H2O and businesses are going to the next step. that doesn't directly deliver value. to the business problems they've been asked to solve. of the new business problems that are going to be able and a lot of times data scientists don't get access to data So especially focus on, so speed to value and it takes IT to make it all work together to ensure that the data science practitioner remains To do that let me throw it back to Dave Velante We're back with Patrick Osbourne, Alright and I know a bunch of folks at BlueData, and adoption in the market so we're just gonna, And then we heard they're sort of better together story that we're gonna be you know, continuing and the data science community and then number two, you know, Go into the crowd chat and start talking.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
PeterPERSON

0.99+

Ramesh ThyagarajanPERSON

0.99+

Kumar SreekantiPERSON

0.99+

Dave VelantePERSON

0.99+

Peter BurrisPERSON

0.99+

KumarPERSON

0.99+

Nanda VijaydevPERSON

0.99+

AirBnBORGANIZATION

0.99+

UberORGANIZATION

0.99+

BlueDataORGANIZATION

0.99+

Patrick OsbournePERSON

0.99+

PatrickPERSON

0.99+

Ingrid BurtonPERSON

0.99+

RameshPERSON

0.99+

Anant ChintamaneniPERSON

0.99+

SpotifyORGANIZATION

0.99+

NandaPERSON

0.99+

HPEORGANIZATION

0.99+

Palo AltoLOCATION

0.99+

two companiesQUANTITY

0.99+

IngridPERSON

0.99+

AnantPERSON

0.99+

Hewlett Packward EnterpriseORGANIZATION

0.99+

H2O.AIORGANIZATION

0.99+

bothQUANTITY

0.99+

HPsORGANIZATION

0.99+

FacebooksORGANIZATION

0.99+

GooglesORGANIZATION

0.99+

DavePERSON

0.99+

IntelORGANIZATION

0.99+

MarlboroughLOCATION

0.99+

FirstQUANTITY

0.99+

firstQUANTITY

0.99+

oneQUANTITY

0.99+

one systemQUANTITY

0.99+

todayDATE

0.99+

two approachesQUANTITY

0.99+

ApolloORGANIZATION

0.99+

www.BlueData.comOTHER

0.99+

HPORGANIZATION

0.99+

Hewlett Packard EnterpriseORGANIZATION

0.98+

theCUBEORGANIZATION

0.98+

six years agoDATE

0.98+

two thingsQUANTITY

0.98+

OneQUANTITY

0.98+

Stephan Fabel, Canonical | OpenStack Summit 2018


 

(upbeat music) >> Announcer: Live from Vancouver, Canada. It's The Cube covering Openstack Summit, North America, 2018. Brought to you by Red Hat, The Open Stack Foundation, and it's ecosystem partners. >> Welcome back to The Cube's coverage of Openstack Summit 2018 in Vancouver. I'm Stu Miniman with cohost of the week, John Troyer. Happy to welcome back to the program Stephan Fabel, who is the Director of Ubuntu product and development at Canonical. Great to see you. >> Yeah, great to be here, thank you for having me. Alright, so, boy, there's so much going on at this show. We've been talking about doing more things and in more places, is the theme that the Open Stack Foundation put into place, and we had a great conversation with Mark Shuttleworth, and going to dig in a little bit deeper in some of the areas with you. >> Stephan: Okay, absolutely. >> So we have the Cube, and we're go into all of the Kubernetes, Kubeflow, and all those other things that we'll mispronounce how they go. >> Stephan: Yes, yes, absolutely. >> What's your impression of the show first of all? >> Well I think that it's really, you know, there's a consolidation going on, right? I mean, we really have the people who are serious about open infrastructure here, serious about OpenStack. They're serious about Kubenetes. They want to implement, and they want to implement at a speed that fits the agility of their business. They want to really move quick with the obstrain release. I think the time for enterprise hardening delays an inertia there is over. I think people are really looking at the core of OpenStack, that's mature, it's stable, it's time for us to kind of move, get going, get success early, get it soon, then grow. I think most of the enterprise, most of the customers we talk to adopt that notion. >> One of the things that sometimes helps is help us lay out the stack a little bit here because we actually commented that some of the base infrastructure pieces we're not talking as much about because they're kind of mature, but OpenStack very much at the infrastructure level, your compute, storage, and network need to understand. But then we when we start doing things like Kubernetes as well, I can either do or, or on top of, and things like that, so give us your view as to what'd you put, what Canonical's seeing, and what customers-- how you lay out that stack? >> I think you're right, I think there's a little bit of path-finding here that needs to be done on the Kubernetes side, but ultimately, I think it's going to really converge around OpenStack being operative-centric, and operative-friendly, working and operating the infrastructure, scaling that out in a meaningful manner, providing multitenancy to all the different departments. Having Kubernetes be developer-centric and really help to on-board and accelerate the workload that option of the next gen initiatives, right? So, what we see is absolutely a use case for Kubernetes and OpenStack to work perfectly well together, be an extension of each other, possibly also sit next to each other without being too incumbenent there. But I think that ultimately having something like Kubernetes contain a based developer APIs that are providing that orchestration layer are the next thing, and they run just perfectly fine on Canonical OpenStack. >> Yeah, there certainly has been a lot of talk about that here at the show. Let's see, let's go a level above that, things we run on Kubernetes, I wanted to talk a little bit about ML and AI and Kubeflow. It seems like we're, I'd almost say that we're, this is like, if we were a movie, we're in a sequel like AI-5; this time, it's real. I really do see real enterprise applications incorporating these technologies into the workflow for what otherwise might be kind of boring, you know, line of business, can you talk a little bit about where we are in this evolution? >> You mean, John, only since we've been talking about it since the mid-1800s, so yeah. >> I was just about to point that out, I mean, AI's not new, right? We've seen it since about 60 years. It's been around for quite some time. I think that there is an unprecedented amount of sponsorship of new startups in this area, in this space, and there's a reason why this is heating up. I think the reason why ultimately it's there is because we're talking about a scale that's unprecedented, right? We thought the biggest problem we had with devices was going to be the IP addresses running out, and it turns out, that's not true at all, right? At a certain scale, and at a certain distributed nature of your rollout, you're going to have to deal with just such complexity and interaction between the underlying, the under-cloud, the over-cloud, the infrastructure, the developers. How do I roll this out? If I spin up 1000 BMs over here, why am I experiencing dropped calls over there? It's those types of things that need to be self-correlated. They need to be identified, they need to be worked out, so there's a whole operator angle just to be able to cope with that whole scenario. I think there's projects that are out there that are trying to ultimately address that, for example, Acumos (mumbles) Then, there is, of course, the new applications, right? Smart cities to connect to cars, all those car manufacturers who are, right now, faced with the problem: how do I deal with mobile, distributed inference rollout on the edge while still capturing the data continually, train my model, update, then again, distribute out to the edge to get a better experience. How do I catch up to some of the market leaders here that are out there? As the established car manufacturers are going to come and catch up, put more and more miles autonomously on the asphalt, we're going to basically have to deal with a whole lot more of proctization of machine-learning applications that just have to be managed at scale. And so we believe for all certain good company in that belief that having to manage large applications at scale, that containers and Kubernetes is a great way to do that, right? They did that for web apps. They did that for the next generation applications. This is one example where with the right operators in mind, the right CRDs, the right frameworks on top of Kubernetes managed correctly, you are actually in a great position to just go to market with that. >> I wonder if you might have a customer example that might go to walk us through kind of where they are in this discussion, talk to many companies, you know, the whole IOT even pieces were early in this. So what's actually real today, how much is planning, is this years we're talking before some of these really come to fruition? >> So yeah, I can't name a customer, but I can say that every single car manufacturer we're talking to is absolutely interested in solving the operational problem of running machine-learning frameworks as a service, making sure those are up running and up to speed at any given point in time, spin them up in a multitenant fashion, make sure that the GPU enablement is actually done properly at all layers of the virtualization. These are real operational challenges that they're facing today, and they're looking to solve with us. Pick a large car manufacturer you want. >> John: Nice. We're going down to something that I can type on my own keyboard then, and go to GitHub, right? That's one of the places to go where it is run, TensorFlow of machine-learning framework on Kubernetes is Kubeflow, and that little bit yesterday on stage, you want to talk about that maybe? >> Oh, absolutely, yes. That's the core of our current strategy right now. We're looking at Kubeflow as one of the key enablers of machine-learning frameworks as a service on top of Kubernetes, and I think they're a great example because they can really show how that as a service can be implemented on top of a virtualization platform, whether that be KVM, pure KVM, on bare metal, on OpenStack, and actually provide machine-learning frameworks such as TensorFlow, Pipe Torch, Seldon Core. You have all those frameworks being supported, and then basically start mix and matching. I think ultimately it's so interesting to us because the data scientists are really not the ones that are expected to manage all this, right? Yet they are the core of having to interact with it. In the next generation of the workloads, we're talking to PHDs and data scientists that have no interest whatsoever in understanding how all of this works on the back end, right? They just want to know this is where I'm going to submit my artifact that I'm creating, this is how it works in general. Companies pay them a lot of money to do just that, and to just do the model because that's where, until the right model is found, that is exactly where the value is. >> So Stephan, does Canonical go talk to the data scientists, or is there a class of operators who are facilitating the data scientists? >> Yes, we talk to the data scientists who understand their problems, we talk to the operators to understand their problems, and then we work with partners such as Google to try and find solutions to that. >> Great, what kind of conversations are you having here at the show? I can't imagine there's too many of those, great to hear if there are, but where are they? I think everybody here knows containers, very few know Kubernetes, and how far up the stack of building new stuff are they? >> You'd be surprised, I mean, we put this out there, and so far, I want to say the majority of the customer conversations we've had took an AI turn and said, this is what we're trying to do next year, this is what we're trying to do later in the year, this is what we're currently struggling with. So glad you have an approach because otherwise, we would spend a ton of time thinking about this, a ton of time trying to solve this in our own way that then gets us stuck in some deep end that we don't want to be. So, help us understand this, help us pave the way. >> John: Nice, nice. I don't want to leave without talking also about Microcades, that's a Kubernetes snap, you code some clojure download, Can we talk a little bit about that? >> Yeah, glad to. This was an idea that we conceived that came out of this notion of alright, well if I do have, talking to a data scientist, if I do have a data scientist, where does he start? >> Stu: Does Kubernetes have a learning curve to date? >> It does, yeah, it does. So here's the thing, as a developer, you have, what options do you have right when you get started? You can either go out and get a community stood up on one of the public clouds, but what if you're in the plane, right? You don't have a connection, you want to work on your local laptop. Possibly, that laptop also has a GPU, and you're a data scientist and you want to try this out because you know you're going to submit this training job now to a (mumbles) that runs un-prem behind the firewall with a limited training set, right? This is the situation we're talking about. So ultimately, the motivation for creating Microcades was we want to make this very, very equivalent. Now you can deploy Kubeflow on top of Microcades today, and it'll run just fine. You get your TensorBoard, you have Jupyter notebook, and you can do your work, and you can do it in a fashion that will then be compatible to your on-prem and public machine-learning framework. So that was your original motivation for why we went down this road, but then we noticed you know what, this is actually a wider need. People are thinking about local Kubernetes in many different ways. There are a couple of solutions out there. They tend to be cumbersome, or more cumbersome than developers would like it. So we actually said, you know, maybe we should turn this into a more general purpose solution. So hence, Microcades. It works like a snap on your machine, you kick that off, you have Kubernetes API, and under 30 seconds or little longer if your download speed plays a factor here, you enable DNS and you're good to go. >> Stephan, I just want to give you the opportunity, is there anything in the Queens Release that your customers have been specifically waiting for or any other product announcements before we wrap? >> Sure, we're very excited about the Queens Release. We think Queens Release is one of the great examples of the maturity of the code base and really the knot towards the operator, and that, I think was the big challenge beyond the olden days of OpenStack where the operators took a long time for the operators to be heard, and to establish that conversation. We'd like to say and to see that OpenStack Queens has matured in that respect, and we like things like Octavia. We're very exciting about (mumbles) as a service, taking its own life and being treated as a first-class citizen. I think that it was a great decision of the community to get on that road. We're supporting as a part of our distribution. >> Alright, well, appreciate the update. Really fascinating to hear about all, you know, everybody's thinking about it and really starting to move on all the ML and AI stuff. Alright, for John Troyer, I'm Tru Miniman. Lots more coverage here from OpenStack Summit 2018 in Vancouver. Thanks for watching The Cube. (upbeat music)

Published Date : May 22 2018

SUMMARY :

Brought to you by Red Hat, The Open Stack Foundation, Great to see you. Yeah, great to be here, thank you for having me. So we have the Cube, and we're go into all of the I mean, we really have the people who are serious about and what customers-- how you lay out that stack? of path-finding here that needs to be done about that here at the show. since the mid-1800s, so yeah. As the established car manufacturers are going to in this discussion, talk to many companies, a multitenant fashion, make sure that the GPU That's one of the places to go where it is run, and to just do the model because Yes, we talk to the data scientists who understand that we don't want to be. I don't want to leave without talking also about Microcades, talking to a data scientist, and you can do your work, and you can do of the community to get on that road. Really fascinating to hear about all, you know,

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
StephanPERSON

0.99+

Mark ShuttleworthPERSON

0.99+

JohnPERSON

0.99+

John TroyerPERSON

0.99+

Stephan FabelPERSON

0.99+

Red HatORGANIZATION

0.99+

GoogleORGANIZATION

0.99+

VancouverLOCATION

0.99+

Open Stack FoundationORGANIZATION

0.99+

KubernetesTITLE

0.99+

CanonicalORGANIZATION

0.99+

Stu MinimanPERSON

0.99+

Vancouver, CanadaLOCATION

0.99+

OpenStackTITLE

0.99+

next yearDATE

0.99+

mid-1800sDATE

0.99+

yesterdayDATE

0.99+

Tru MinimanPERSON

0.99+

under 30 secondsQUANTITY

0.99+

OpenStack Summit 2018EVENT

0.99+

GitHubORGANIZATION

0.98+

QueensORGANIZATION

0.98+

Openstack Summit 2018EVENT

0.98+

one exampleQUANTITY

0.98+

OneQUANTITY

0.98+

OpenStack Summit 2018EVENT

0.98+

KubeflowTITLE

0.97+

Openstack SummitEVENT

0.97+

1000 BMsQUANTITY

0.97+

TensorFlowTITLE

0.96+

about 60 yearsQUANTITY

0.96+

oneQUANTITY

0.96+

JupyterORGANIZATION

0.94+

The CubeORGANIZATION

0.94+

StuPERSON

0.94+

todayDATE

0.94+

asphaltTITLE

0.93+

North AmericaLOCATION

0.92+

UbuntuORGANIZATION

0.88+

The Open Stack FoundationORGANIZATION

0.87+

KubernetesORGANIZATION

0.86+

CubeCOMMERCIAL_ITEM

0.77+

Queens ReleaseTITLE

0.77+

single carQUANTITY

0.76+

Seldon CoreTITLE

0.75+

Pipe TorchTITLE

0.72+

KubeflowORGANIZATION

0.7+

The CubeTITLE

0.69+

OctaviaTITLE

0.67+

firstQUANTITY

0.57+

coupleQUANTITY

0.5+

MicrocadesORGANIZATION

0.5+

KubenetesORGANIZATION

0.49+

2018DATE

0.48+

TensorBoardTITLE

0.48+

KubernetesCOMMERCIAL_ITEM

0.42+

ReleaseTITLE

0.4+

Piotr Mierzejewski, IBM | Dataworks Summit EU 2018


 

>> Announcer: From Berlin, Germany, it's theCUBE covering Dataworks Summit Europe 2018 brought to you by Hortonworks. (upbeat music) >> Well hello, I'm James Kobielus and welcome to theCUBE. We are here at Dataworks Summit 2018, in Berlin, Germany. It's a great event, Hortonworks is the host, they made some great announcements. They've had partners doing the keynotes and the sessions, breakouts, and IBM is one of their big partners. Speaking of IBM, from IBM we have a program manager, Piotr, I'll get this right, Piotr Mierzejewski, your focus is on data science machine learning and data science experience which is one of the IBM Products for working data scientists to build and to train models in team data science enterprise operational environments, so Piotr, welcome to theCUBE. I don't think we've had you before. >> Thank you. >> You're a program manager. I'd like you to discuss what you do for IBM, I'd like you to discuss Data Science Experience. I know that Hortonworks is a reseller of Data Science Experience, so I'd like you to discuss the partnership going forward and how you and Hortonworks are serving your customers, data scientists and others in those teams who are building and training and deploying machine learning and deep learning, AI, into operational applications. So Piotr, I give it to you now. >> Thank you. Thank you for inviting me here, very excited. This is a very loaded question, and I would like to begin, before I get actually to why the partnership makes sense, I would like to begin with two things. First, there is no machine learning about data. And second, machine learning is not easy. Especially, especially-- >> James: I never said it was! (Piotr laughs) >> Well there is this kind of perception, like you can have a data scientist working on their Mac, working on some machine learning algorithms and they can create a recommendation engine, let's say in a two, three days' time. This is because of the explosion of open-source in that space. You have thousands of libraries, from Python, from R, from Scala, you have access to Spark. All these various open-source offerings that are enabling data scientists to actually do this wonderful work. However, when you start talking about bringing machine learning to the enterprise, this is not an easy thing to do. You have to think about governance, resiliency, the data access, actual model deployments, which are not trivial. When you have to expose this in a uniform fashion to actually various business units. Now all this has to actually work in a private cloud, public clouds environment, on a variety of hardware, a variety of different operating systems. Now that is not trivial. (laughs) Now when you deploy a model, as the data scientist is going to deploy the model, he needs to be able to actually explain how the model was created. He has to be able to explain what the data was used. He needs to ensure-- >> Explicable AI, or explicable machine learning, yeah, that's a hot focus of our concern, of enterprises everywhere, especially in a world where governance and tracking and lineage GDPR and so forth, so hot. >> Yes, you've mentioned all the right things. Now, so given those two things, there's no ML web data, and ML is not easy, why the partnership between Hortonworks and IBM makes sense, well, you're looking at the number one industry leading big data plot from Hortonworks. Then, you look at a DSX local, which, I'm proud to say, I've been there since the first line of code, and I'm feeling very passionate about the product, is the merger between the two, ability to integrate them tightly together gives your data scientists secure access to data, ability to leverage the spark that runs inside a Hortonworks cluster, ability to actually work in a platform like DSX that doesn't limit you to just one kind of technology but allows you to work with multiple technologies, ability to actually work on not only-- >> When you say technologies here, you're referring to frameworks like TensorFlow, and-- >> Precisely. Very good, now that part I'm going to get into very shortly, (laughs) so please don't steal my thunder. >> James: Okay. >> Now, what I was saying is that not only DSX and Hortonworks integrated to the point that you can actually manage your Hadoop clusters, Hadoop environments within a DSX, you can actually work on your Python models and your analytics within DSX and then push it remotely to be executed where your data is. Now, why is this important? If you work with the data that's megabytes, gigabytes, maybe you know you can pull it in, but in truly what you want to do when you move to the terabytes and the petabytes of data, what happens is that you actually have to push the analytics to where your data resides, and leverage for example YARN, a resource manager, to distribute your workloads and actually train your models on your actually HDP cluster. That's one of the huge volume propositions. Now, mind you to say this is all done in a secure fashion, with ability to actually install DSX on the edge notes of the HDP clusters. >> James: Hmm... >> As of HDP 264, DSX has been certified to actually work with HDP. Now, this partnership embarked, we embarked on this partnership about 10 months ago. Now, often happens that there is announcements, but there is not much materializing after such announcement. This is not true in case of DSX and HDP. We have had, just recently we have had a release of the DSX 1.2 which I'm super excited about. Now, let's talk about those open-source toolings in the various platforms. Now, you don't want to force your data scientists to actually work with just one environment. Some of them might prefer to work on Spark, some of them like their RStudio, they're statisticians, they like R, others like Python, with Zeppelin, say Jupyter Notebook. Now, how about Tensorflow? What are you going to do when actually, you know, you have to do the deep learning workloads, when you want to use neural nets? Well, DSX does support ability to actually bring in GPU notes and do the Tensorflow training. As a sidecar approach, you can append the note, you can scale the platform horizontally and vertically, and train your deep learning workloads, and actually remove the sidecar out. So you should put it towards the cluster and remove it at will. Now, DSX also actually not only satisfies the needs of your programmer data scientists, that actually code in Python and Scala or R, but actually allows your business analysts to work and create models in a visual fashion. As of DSX 1.2, you can actually, we have embedded, integrated, an SPSS modeler, redesigned, rebranded, this is an amazing technology from IBM that's been on for a while, very well established, but now with the new interface, embedded inside a DSX platform, allows your business analysts to actually train and create the model in a visual fashion and, what is beautiful-- >> Business analysts, not traditional data scientists. >> Not traditional data scientists. >> That sounds equivalent to how IBM, a few years back, was able to bring more of a visual experience to SPSS proper to enable the business analysts of the world to build and do data-mining and so forth with structured data. Go ahead, I don't want to steal your thunder here. >> No, no, precisely. (laughs) >> But I see it's the same phenomenon, you bring the same capability to greatly expand the range of data professionals who can do, in this case, do machine learning hopefully as well as professional, dedicated data scientists. >> Certainly, now what we have to also understand is that data science is actually a team sport. It involves various stakeholders from the organization. From executive, that actually gives you the business use case to your data engineers that actually understand where your data is and can grant the access-- >> James: They manage the Hadoop clusters, many of them, yeah. >> Precisely. So they manage the Hadoop clusters, they actually manage your relational databases, because we have to realize that not all the data is in the datalinks yet, you have legacy systems, which DSX allows you to actually connect to and integrate to get data from. It also allows you to actually consume data from streaming sources, so if you actually have a Kafka message cob and actually were streaming data from your applications or IoT devices, you can actually integrate all those various data sources and federate them within the DSX to use for machine training models. Now, this is all around predictive analytics. But what if I tell you that right now with the DSX you can actually do prescriptive analytics as well? With the 1.2, again I'm going to be coming back to this 1.2 DSX with the most recent release we have actually added decision optimization, an industry-leading solution from IBM-- >> Prescriptive analytics, gotcha-- >> Yes, for prescriptive analysis. So now if you have warehouses, or you have a fleet of trucks, or you want to optimize the flow in let's say, a utility company, whether it be for power or could it be for, let's say for water, you can actually create and train prescriptive models within DSX and deploy them the same fashion as you will deploy and manage your SPSS streams as well as the machine learning models from Spark, from Python, so with XGBoost, Tensorflow, Keras, all those various aspects. >> James: Mmmhmm. >> Now what's going to get really exciting in the next two months, DSX will actually bring in natural learning language processing and text analysis and sentiment analysis by Vio X. So Watson Explorer, it's another offering from IBM... >> James: It's called, what is the name of it? >> Watson Explorer. >> Oh Watson Explorer, yes. >> Watson Explorer, yes. >> So now you're going to have this collaborative message platform, extendable! Extendable collaborative platform that can actually install and run in your data centers without the need to access internet. That's actually critical. Yes, we can deploy an IWS. Yes we can deploy an Azure. On Google Cloud, definitely we can deploy in Softlayer and we're very good at that, however in the majority of cases we find that the customers have challenges for bringing the data out to the cloud environments. Hence, with DSX, we designed it to actually deploy and run and scale everywhere. Now, how we have done it, we've embraced open source. This was a huge shift within IBM to realize that yes we do have 350,000 employees, yes we could develop container technologies, but why? Why not embrace what is actually industry standards with the Docker and equivalent as they became industry standards? Bring in RStudio, the Jupyter, the Zeppelin Notebooks, bring in the ability for a data scientist to choose the environments they want to work with and actually extend them and make the deployments of web services, applications, the models, and those are actually full releases, I'm not only talking about the model, I'm talking about the scripts that can go with that ability to actually pull the data in and allow the models to be re-trained, evaluated and actually re-deployed without taking them down. Now that's what actually becomes, that's what is the true differentiator when it comes to DSX, and all done in either your public or private cloud environments. >> So that's coming in the next version of DSX? >> Outside of DSX-- >> James: We're almost out of time, so-- >> Oh, I'm so sorry! >> No, no, no. It's my job as the host to let you know that. >> Of course. (laughs) >> So if you could summarize where DSX is going in 30 seconds or less as a product, the next version is, what is it? >> It's going to be the 1.2.1. >> James: Okay. >> 1.2.1 and we're expecting to release at the end of June. What's going to be unique in the 1.2.1 is infusing the text and sentiment analysis, so natural language processing with predictive and prescriptive analysis for both developers and your business analysts. >> James: Yes. >> So essentially a platform not only for your data scientist but pretty much every single persona inside the organization >> Including your marketing professionals who are baking sentiment analysis into what they do. Thank you very much. This has been Piotr Mierzejewski of IBM. He's a Program Manager for DSX and for ML, AI, and data science solutions and of course a strong partnership is with Hortonworks. We're here at Dataworks Summit in Berlin. We've had two excellent days of conversations with industry experts including Piotr. We want to thank everyone, we want to thank the host of this event, Hortonworks for having us here. We want to thank all of our guests, all these experts, for sharing their time out of their busy schedules. We want to thank everybody at this event for all the fascinating conversations, the breakouts have been great, the whole buzz here is exciting. GDPR's coming down and everybody's gearing up and getting ready for that, but everybody's also focused on innovative and disruptive uses of AI and machine learning and business, and using tools like DSX. I'm James Kobielus for the entire CUBE team, SiliconANGLE Media, wishing you all, wherever you are, whenever you watch this, have a good day and thank you for watching theCUBE. (upbeat music)

Published Date : Apr 19 2018

SUMMARY :

brought to you by Hortonworks. and to train models in team data science and how you and Hortonworks are serving your customers, Thank you for inviting me here, very excited. from Python, from R, from Scala, you have access to Spark. GDPR and so forth, so hot. that doesn't limit you to just one kind of technology Very good, now that part I'm going to get into very shortly, and then push it remotely to be executed where your data is. Now, you don't want to force your data scientists of the world to build and do data-mining (laughs) you bring the same capability the business use case to your data engineers James: They manage the Hadoop clusters, With the 1.2, again I'm going to be coming back to this as you will deploy and manage your SPSS streams in the next two months, DSX will actually bring in and allow the models to be re-trained, evaluated It's my job as the host to let you know that. (laughs) is infusing the text and sentiment analysis, and of course a strong partnership is with Hortonworks.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Piotr MierzejewskiPERSON

0.99+

James KobielusPERSON

0.99+

JamesPERSON

0.99+

IBMORGANIZATION

0.99+

PiotrPERSON

0.99+

HortonworksORGANIZATION

0.99+

30 secondsQUANTITY

0.99+

BerlinLOCATION

0.99+

IWSORGANIZATION

0.99+

PythonTITLE

0.99+

SparkTITLE

0.99+

twoQUANTITY

0.99+

FirstQUANTITY

0.99+

ScalaTITLE

0.99+

Berlin, GermanyLOCATION

0.99+

350,000 employeesQUANTITY

0.99+

DSXORGANIZATION

0.99+

MacCOMMERCIAL_ITEM

0.99+

two thingsQUANTITY

0.99+

RStudioTITLE

0.99+

DSXTITLE

0.99+

DSX 1.2TITLE

0.98+

both developersQUANTITY

0.98+

secondQUANTITY

0.98+

GDPRTITLE

0.98+

Watson ExplorerTITLE

0.98+

Dataworks Summit 2018EVENT

0.98+

first lineQUANTITY

0.98+

Dataworks Summit Europe 2018EVENT

0.98+

SiliconANGLE MediaORGANIZATION

0.97+

end of JuneDATE

0.97+

TensorFlowTITLE

0.97+

thousands of librariesQUANTITY

0.96+

RTITLE

0.96+

JupyterORGANIZATION

0.96+

1.2.1OTHER

0.96+

two excellent daysQUANTITY

0.95+

Dataworks SummitEVENT

0.94+

Dataworks Summit EU 2018EVENT

0.94+

SPSSTITLE

0.94+

oneQUANTITY

0.94+

AzureTITLE

0.92+

one kindQUANTITY

0.92+

theCUBEORGANIZATION

0.92+

HDPORGANIZATION

0.91+

Alan Gates, Hortonworks | Dataworks Summit 2018


 

(techno music) >> (announcer) From Berlin, Germany it's theCUBE covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Well hello, welcome to theCUBE. We're here on day two of DataWorks Summit 2018 in Berlin, Germany. I'm James Kobielus. I'm lead analyst for Big Data Analytics in the Wikibon team of SiliconANGLE Media. And who we have here today, we have Alan Gates whose one of the founders of Hortonworks and Hortonworks of course is the host of DataWorks Summit and he's going to be, well, hello Alan. Welcome to theCUBE. >> Hello, thank you. >> Yeah, so Alan, so you and I go way back. Essentially, what we'd like you to do first of all is just explain a little bit of the genesis of Hortonworks. Where it came from, your role as a founder from the beginning, how that's evolved over time but really how the company has evolved specifically with the folks on the community, the Hadoop community, the Open Source community. You have a deepening open source stack with you build upon with Atlas and Ranger and so forth. Gives us a sense for all of that Alan. >> Sure. So as I think it's well-known, we started as the team at Yahoo that really was driving a lot of the development of Hadoop. We were one of the major players in the Hadoop community. Worked on that for, I was in that team for four years. I think the team itself was going for about five. And it became clear that there was an opportunity to build a business around this. Some others had already started to do so. We wanted to participate in that. We worked with Yahoo to spin out Hortonworks and actually they were a great partner in that. Helped us get than spun out. And the leadership team of the Hadoop team at Yahoo became the founders of Hortonworks and brought along a number of the other engineering, a bunch of the other engineers to help get started. And really at the beginning, we were. It was Hadoop, Pig, Hive, you know, a few of the very, Hbase, the kind of, the beginning projects. So pretty small toolkit. And we were, our early customers were very engineering heavy people, or companies who knew how to take those tools and build something directly on those tools right? >> Well, you started off with the Hadoop community as a whole started off with a focus on the data engineers of the world >> Yes. >> And I think it's shifted, and confirm for me, over time that you focus increasing with your solutions on the data scientists who are doing the development of the applications, and the data stewards from what I can see at this show. >> I think it's really just a part of the adoption curve right? When you're early on that curve, you have people who are very into the technology, understand how it works, and want to dive in there. So those tend to be, as you said, the data engineering types in this space. As that curve grows out, you get, it comes wider and wider. There's still plenty of data engineers that are our customers, that are working with us but as you said, the data analysts, the BI people, data scientists, data stewards, all those people are now starting to adopt it as well. And they need different tools than the data engineers do. They don't want to sit down and write Java code or you know, some of the data scientists might want to work in Python in a notebook like Zeppelin or Jupyter but some, may want to use SQL or even Tablo or something on top of SQL to do the presentation. Of course, data stewards want tools more like Atlas to help manage all their stuff. So that does drive us to one, put more things into the toolkit so you see the addition of projects like Apache Atlas and Ranger for security and all that. Another area of growth, I would say is also the kind of data that we're focused on. So early on, we were focused on data at rest. You know, we're going to store all this stuff in HDFS and as the kind of data scene has evolved, there's a lot more focus now on a couple things. One is data, what we call data-in-motion for our HDF product where you've got in a stream manager like Kafka or something like that >> (James) Right >> So there's processing that kind of data. But now we also see a lot of data in various places. It's not just oh, okay I have a Hadoop cluster on premise at my company. I might have some here, some on premise somewhere else and I might have it in several clouds as well. >> K, your focus has shifted like the industry in general towards streaming data in multi-clouds where your, it's more stateful interactions and so forth? I think you've made investments in Apache NiFi so >> (Alan) yes. >> Give us a sense for your NiFi versus Kafka and so forth inside of your product strategy or your >> Sure. So NiFi is really focused on that data at the edge, right? So you're bringing data in from sensors, connected cars, airplane engines, all those sorts of things that are out there generating data and you need, you need to figure out what parts of the data to move upstream, what parts not to. What processing can I do here so that I don't have to move upstream? When I have a error event or a warning event, can I turn up the amount of data I'm sending in, right? Say this airplane engine is suddenly heating up maybe a little more than it's supposed to. Maybe I should ship more of the logs upstream when the plane lands and connects that I would if, otherwise. That's the kind o' thing that Apache NiFi focuses on. I'm not saying it runs in all those places by my point is, it's that kind o' edge processing. Kafka is still going to be running in a data center somewhere. It's still a pretty heavy weight technology in terms of memory and disk space and all that so it's not going to be run on some sensor somewhere. But it is that data-in-motion right? I've got millions of events streaming through a set of Kafka topics watching all that sensor data that's coming in from NiFi and reacting to it, maybe putting some of it in the data warehouse for later analysis, all those sorts of things. So that's kind o' the differentiation there between Kafka and NiFi. >> Right, right, right. So, going forward, do you see more of your customers working internet of things projects, is that, we don't often, at least in the industry of popular mind, associate Hortonworks with edge computing and so forth. Is that? >> I think that we will have more and more customers in that space. I mean, our goal is to help our customers with their data wherever it is. >> (James) Yeah. >> When it's on the edge, when it's in the data center, when it's moving in between, when it's in the cloud. All those places, that's where we want to help our customers store and process their data. Right? So, I wouldn't want to say that we're going to focus on just the edge or the internet of things but that certainly has to be part of our strategy 'cause it's has to be part of what our customers are doing. >> When I think about the Hortonworks community, now we have to broaden our understanding because you have a tight partnership with IBM which obviously is well-established, huge and global. Give us a sense for as you guys have teamed more closely with IBM, how your community has changed or broadened or shifted in its focus or has it? >> I don't know that it's shifted the focus. I mean IBM was already part of the Hadoop community. They were already contributing. Obviously, they've contributed very heavily on projects like Spark and some of those. They continue some of that contribution. So I wouldn't say that it's shifted it, it's just we are working more closely together as we both contribute to those communities, working more closely together to present solutions to our mutual customer base. But I wouldn't say it's really shifted the focus for us. >> Right, right. Now at this show, we're in Europe right now, but it doesn't matter that we're in Europe. GDPR is coming down fast and furious now. Data Steward Studio, we had the demonstration today, it was announced yesterday. And it looks like a really good tool for the main, the requirements for compliance which is discover and inventory your data which is really set up a consent portal, what I like to refer to. So the data subject can then go and make a request to have my data forgotten and so forth. Give us a sense going forward, for how or if Hortonworks, IBM, and others in your community are going to work towards greater standardization in the functional capabilities of the tools and platforms for enabling GDPR compliance. 'Cause it seems to me that you're going to need, the industry's going to need to have some reference architecture for these kind o' capabilities so that going forward, either your ecosystem of partners can build add on tools in some common, like the framework that was laid out today looks like a good basis. Is there anything that you're doing in terms of pushing towards more Open Source standardization in that area? >> Yes, there is. So actually one of my responsibilities is the technical management of our relationship with ODPI which >> (James) yes. >> Mandy Chessell referenced yesterday in her keynote and that is where we're working with IBM, with ING, with other companies to build exactly those standards. Right? Because we do want to build it around Apache Atlas. We feel like that's a good tool for the basis of that but we know one, that some people are going to want to bring their own tools to it. They're not necessarily going to want to use that one platform so we want to do it in an open way that they can still plug in their metadata repositories and communicate with others and we want to build the standards on top of that of how do you properly implement these features that GDPR requires like right to be forgotten, like you know, what are the protocols around PIII data? How do you prevent a breach? How do you respond to a breach? >> Will that all be under the umbrella of ODPI, that initiative of the partnership or will it be a separate group or? >> Well, so certainly Apache Atlas is part of Apache and remains so. What ODPI is really focused up is that next layer up of how do we engage, not the programmers 'cause programmers can gage really well at the Apache level but the next level up. We want to engage the data professionals, the people whose job it is, the compliance officers. The people who don't sit and write code and frankly if you connect them to the engineers, there's just going to be an impedance mismatch in that conversation. >> You got policy wonks and you got tech wonks so. They understand each other at the wonk level. >> That's a good way to put it. And so that's where ODPI is really coming is that group of compliance people that speak a completely different language. But we still need to get them all talking to each other as you said, so that there's specifications around. How do we do this? And what is compliance? >> Well Alan, thank you very much. We're at the end of our time for this segment. This has been great. It's been great to catch up with you and Hortonworks has been evolving very rapidly and it seems to me that, going forward, I think you're well-positioned now for the new GDPR age to take your overall solution portfolio, your partnerships, and your capabilities to the next level and really in terms of in an Open Source framework. In many ways though, you're not entirely 100% like nobody is, purely Open Source. You're still very much focused on open frameworks for building fairly scalable, very scalable solutions for enterprise deployment. Well, this has been Jim Kobielus with Alan Gates of Hortonworks here at theCUBE on theCUBE at DataWorks Summit 2018 in Berlin. We'll be back fairly quickly with another guest and thank you very much for watching our segment. (techno music)

Published Date : Apr 19 2018

SUMMARY :

Brought to you by Hortonworks. of Hortonworks and Hortonworks of course is the host a little bit of the genesis of Hortonworks. a bunch of the other engineers to help get started. of the applications, and the data stewards So those tend to be, as you said, the data engineering types But now we also see a lot of data in various places. So NiFi is really focused on that data at the edge, right? So, going forward, do you see more of your customers working I mean, our goal is to help our customers with their data When it's on the edge, when it's in the data center, as you guys have teamed more closely with IBM, I don't know that it's shifted the focus. the industry's going to need to have some So actually one of my responsibilities is the that GDPR requires like right to be forgotten, like and frankly if you connect them to the engineers, You got policy wonks and you got tech wonks so. as you said, so that there's specifications around. It's been great to catch up with you and

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
IBMORGANIZATION

0.99+

James KobielusPERSON

0.99+

Mandy ChessellPERSON

0.99+

AlanPERSON

0.99+

YahooORGANIZATION

0.99+

Jim KobielusPERSON

0.99+

EuropeLOCATION

0.99+

HortonworksORGANIZATION

0.99+

Alan GatesPERSON

0.99+

four yearsQUANTITY

0.99+

JamesPERSON

0.99+

INGORGANIZATION

0.99+

BerlinLOCATION

0.99+

yesterdayDATE

0.99+

ApacheORGANIZATION

0.99+

SQLTITLE

0.99+

JavaTITLE

0.99+

GDPRTITLE

0.99+

PythonTITLE

0.99+

100%QUANTITY

0.99+

Berlin, GermanyLOCATION

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

DataWorks SummitEVENT

0.99+

AtlasORGANIZATION

0.99+

DataWorks Summit 2018EVENT

0.98+

Data Steward StudioORGANIZATION

0.98+

todayDATE

0.98+

oneQUANTITY

0.98+

NiFiORGANIZATION

0.98+

Dataworks Summit 2018EVENT

0.98+

HadoopORGANIZATION

0.98+

one platformQUANTITY

0.97+

2018EVENT

0.97+

bothQUANTITY

0.97+

millions of eventsQUANTITY

0.96+

HbaseORGANIZATION

0.95+

TabloTITLE

0.95+

ODPIORGANIZATION

0.94+

Big Data AnalyticsORGANIZATION

0.94+

OneQUANTITY

0.93+

theCUBEORGANIZATION

0.93+

NiFiCOMMERCIAL_ITEM

0.92+

day twoQUANTITY

0.92+

about fiveQUANTITY

0.91+

KafkaTITLE

0.9+

ZeppelinORGANIZATION

0.89+

AtlasTITLE

0.85+

RangerORGANIZATION

0.84+

JupyterORGANIZATION

0.83+

firstQUANTITY

0.82+

Apache AtlasORGANIZATION

0.82+

HadoopTITLE

0.79+

Data Science for All: It's a Whole New Game


 

>> There's a movement that's sweeping across businesses everywhere here in this country and around the world. And it's all about data. Today businesses are being inundated with data. To the tune of over two and a half million gigabytes that'll be generated in the next 60 seconds alone. What do you do with all that data? To extract insights you typically turn to a data scientist. But not necessarily anymore. At least not exclusively. Today the ability to extract value from data is becoming a shared mission. A team effort that spans the organization extending far more widely than ever before. Today, data science is being democratized. >> Data Sciences for All: It's a Whole New Game. >> Welcome everyone, I'm Katie Linendoll. I'm a technology expert writer and I love reporting on all things tech. My fascination with tech started very young. I began coding when I was 12. Received my networking certs by 18 and a degree in IT and new media from Rochester Institute of Technology. So as you can tell, technology has always been a sure passion of mine. Having grown up in the digital age, I love having a career that keeps me at the forefront of science and technology innovations. I spend equal time in the field being hands on as I do on my laptop conducting in depth research. Whether I'm diving underwater with NASA astronauts, witnessing the new ways which mobile technology can help rebuild the Philippine's economy in the wake of super typhoons, or sharing a first look at the newest iPhones on The Today Show, yesterday, I'm always on the hunt for the latest and greatest tech stories. And that's what brought me here. I'll be your host for the next hour and as we explore the new phenomenon that is taking businesses around the world by storm. And data science continues to become democratized and extends beyond the domain of the data scientist. And why there's also a mandate for all of us to become data literate. Now that data science for all drives our AI culture. And we're going to be able to take to the streets and go behind the scenes as we uncover the factors that are fueling this phenomenon and giving rise to a movement that is reshaping how businesses leverage data. And putting organizations on the road to AI. So coming up, I'll be doing interviews with data scientists. We'll see real world demos and take a look at how IBM is changing the game with an open data science platform. We'll also be joined by legendary statistician Nate Silver, founder and editor-in-chief of FiveThirtyEight. Who will shed light on how a data driven mindset is changing everything from business to our culture. We also have a few people who are joining us in our studio, so thank you guys for joining us. Come on, I can do better than that, right? Live studio audience, the fun stuff. And for all of you during the program, I want to remind you to join that conversation on social media using the hashtag DSforAll, it's data science for all. Share your thoughts on what data science and AI means to you and your business. And, let's dive into a whole new game of data science. Now I'd like to welcome my co-host General Manager IBM Analytics, Rob Thomas. >> Hello, Katie. >> Come on guys. >> Yeah, seriously. >> No one's allowed to be quiet during this show, okay? >> Right. >> Or, I'll start calling people out. So Rob, thank you so much. I think you know this conversation, we're calling it a data explosion happening right now. And it's nothing new. And when you and I chatted about it. You've been talking about this for years. You have to ask, is this old news at this point? >> Yeah, I mean, well first of all, the data explosion is not coming, it's here. And everybody's in the middle of it right now. What is different is the economics have changed. And the scale and complexity of the data that organizations are having to deal with has changed. And to this day, 80% of the data in the world still sits behind corporate firewalls. So, that's becoming a problem. It's becoming unmanageable. IT struggles to manage it. The business can't get everything they need. Consumers can't consume it when they want. So we have a challenge here. >> It's challenging in the world of unmanageable. Crazy complexity. If I'm sitting here as an IT manager of my business, I'm probably thinking to myself, this is incredibly frustrating. How in the world am I going to get control of all this data? And probably not just me thinking it. Many individuals here as well. >> Yeah, indeed. Everybody's thinking about how am I going to put data to work in my organization in a way I haven't done before. Look, you've got to have the right expertise, the right tools. The other thing that's happening in the market right now is clients are dealing with multi cloud environments. So data behind the firewall in private cloud, multiple public clouds. And they have to find a way. How am I going to pull meaning out of this data? And that brings us to data science and AI. That's how you get there. >> I understand the data science part but I think we're all starting to hear more about AI. And it's incredible that this buzz word is happening. How do businesses adopt to this AI growth and boom and trend that's happening in this world right now? >> Well, let me define it this way. Data science is a discipline. And machine learning is one technique. And then AI puts both machine learning into practice and applies it to the business. So this is really about how getting your business where it needs to go. And to get to an AI future, you have to lay a data foundation today. I love the phrase, "there's no AI without IA." That means you're not going to get to AI unless you have the right information architecture to start with. >> Can you elaborate though in terms of how businesses can really adopt AI and get started. >> Look, I think there's four things you have to do if you're serious about AI. One is you need a strategy for data acquisition. Two is you need a modern data architecture. Three is you need pervasive automation. And four is you got to expand job roles in the organization. >> Data acquisition. First pillar in this you just discussed. Can we start there and explain why it's so critical in this process? >> Yeah, so let's think about how data acquisition has evolved through the years. 15 years ago, data acquisition was about how do I get data in and out of my ERP system? And that was pretty much solved. Then the mobile revolution happens. And suddenly you've got structured and non-structured data. More than you've ever dealt with. And now you get to where we are today. You're talking terabytes, petabytes of data. >> [Katie] Yottabytes, I heard that word the other day. >> I heard that too. >> Didn't even know what it meant. >> You know how many zeros that is? >> I thought we were in Star Wars. >> Yeah, I think it's a lot of zeroes. >> Yodabytes, it's new. >> So, it's becoming more and more complex in terms of how you acquire data. So that's the new data landscape that every client is dealing with. And if you don't have a strategy for how you acquire that and manage it, you're not going to get to that AI future. >> So a natural segue, if you are one of these businesses, how do you build for the data landscape? >> Yeah, so the question I always hear from customers is we need to evolve our data architecture to be ready for AI. And the way I think about that is it's really about moving from static data repositories to more of a fluid data layer. >> And we continue with the architecture. New data architecture is an interesting buzz word to hear. But it's also one of the four pillars. So if you could dive in there. >> Yeah, I mean it's a new twist on what I would call some core data science concepts. For example, you have to leverage tools with a modern, centralized data warehouse. But your data warehouse can't be stagnant to just what's right there. So you need a way to federate data across different environments. You need to be able to bring your analytics to the data because it's most efficient that way. And ultimately, it's about building an optimized data platform that is designed for data science and AI. Which means it has to be a lot more flexible than what clients have had in the past. >> All right. So we've laid out what you need for driving automation. But where does the machine learning kick in? >> Machine learning is what gives you the ability to automate tasks. And I think about machine learning. It's about predicting and automating. And this will really change the roles of data professionals and IT professionals. For example, a data scientist cannot possibly know every algorithm or every model that they could use. So we can automate the process of algorithm selection. Another example is things like automated data matching. Or metadata creation. Some of these things may not be exciting but they're hugely practical. And so when you think about the real use cases that are driving return on investment today, it's things like that. It's automating the mundane tasks. >> Let's go ahead and come back to something that you mentioned earlier because it's fascinating to be talking about this AI journey, but also significant is the new job roles. And what are those other participants in the analytics pipeline? >> Yeah I think we're just at the start of this idea of new job roles. We have data scientists. We have data engineers. Now you see machine learning engineers. Application developers. What's really happening is that data scientists are no longer allowed to work in their own silo. And so the new job roles is about how does everybody have data first in their mind? And then they're using tools to automate data science, to automate building machine learning into applications. So roles are going to change dramatically in organizations. >> I think that's confusing though because we have several organizations who saying is that highly specialized roles, just for data science? Or is it applicable to everybody across the board? >> Yeah, and that's the big question, right? Cause everybody's thinking how will this apply? Do I want this to be just a small set of people in the organization that will do this? But, our view is data science has to for everybody. It's about bring data science to everybody as a shared mission across the organization. Everybody in the company has to be data literate. And participate in this journey. >> So overall, group effort, has to be a common goal, and we all need to be data literate across the board. >> Absolutely. >> Done deal. But at the end of the day, it's kind of not an easy task. >> It's not. It's not easy but it's maybe not as big of a shift as you would think. Because you have to put data in the hands of people that can do something with it. So, it's very basic. Give access to data. Data's often locked up in a lot of organizations today. Give people the right tools. Embrace the idea of choice or diversity in terms of those tools. That gets you started on this path. >> It's interesting to hear you say essentially you need to train everyone though across the board when it comes to data literacy. And I think people that are coming into the work force don't necessarily have a background or a degree in data science. So how do you manage? >> Yeah, so in many cases that's true. I will tell you some universities are doing amazing work here. One example, University of California Berkeley. They offer a course for all majors. So no matter what you're majoring in, you have a course on foundations of data science. How do you bring data science to every role? So it's starting to happen. We at IBM provide data science courses through CognitiveClass.ai. It's for everybody. It's free. And look, if you want to get your hands on code and just dive right in, you go to datascience.ibm.com. The key point is this though. It's more about attitude than it is aptitude. I think anybody can figure this out. But it's about the attitude to say we're putting data first and we're going to figure out how to make this real in our organization. >> I also have to give a shout out to my alma mater because I have heard that there is an offering in MS in data analytics. And they are always on the forefront of new technologies and new majors and on trend. And I've heard that the placement behind those jobs, people graduating with the MS is high. >> I'm sure it's very high. >> So go Tigers. All right, tangential. Let me get back to something else you touched on earlier because you mentioned that a number of customers ask you how in the world do I get started with AI? It's an overwhelming question. Where do you even begin? What do you tell them? >> Yeah, well things are moving really fast. But the good thing is most organizations I see, they're already on the path, even if they don't know it. They might have a BI practice in place. They've got data warehouses. They've got data lakes. Let me give you an example. AMC Networks. They produce a lot of the shows that I'm sure you watch Katie. >> [Katie] Yes, Breaking Bad, Walking Dead, any fans? >> [Rob] Yeah, we've got a few. >> [Katie] Well you taught me something I didn't even know. Because it's amazing how we have all these different industries, but yet media in itself is impacted too. And this is a good example. >> Absolutely. So, AMC Networks, think about it. They've got ads to place. They want to track viewer behavior. What do people like? What do they dislike? So they have to optimize every aspect of their business from marketing campaigns to promotions to scheduling to ads. And their goal was transform data into business insights and really take the burden off of their IT team that was heavily burdened by obviously a huge increase in data. So their VP of BI took the approach of using machine learning to process large volumes of data. They used a platform that was designed for AI and data processing. It's the IBM analytics system where it's a data warehouse, data science tools are built in. It has in memory data processing. And just like that, they were ready for AI. And they're already seeing that impact in their business. >> Do you think a movement of that nature kind of presses other media conglomerates and organizations to say we need to be doing this too? >> I think it's inevitable that everybody, you're either going to be playing, you're either going to be leading, or you'll be playing catch up. And so, as we talk to clients we think about how do you start down this path now, even if you have to iterate over time? Because otherwise you're going to wake up and you're going to be behind. >> One thing worth noting is we've talked about analytics to the data. It's analytics first to the data, not the other way around. >> Right. So, look. We as a practice, we say you want to bring data to where the data sits. Because it's a lot more efficient that way. It gets you better outcomes in terms of how you train models and it's more efficient. And we think that leads to better outcomes. Other organization will say, "Hey move the data around." And everything becomes a big data movement exercise. But once an organization has started down this path, they're starting to get predictions, they want to do it where it's really easy. And that means analytics applied right where the data sits. >> And worth talking about the role of the data scientist in all of this. It's been called the hot job of the decade. And a Harvard Business Review even dubbed it the sexiest job of the 21st century. >> Yes. >> I want to see this on the cover of Vogue. Like I want to see the first data scientist. Female preferred, on the cover of Vogue. That would be amazing. >> Perhaps you can. >> People agree. So what changes for them? Is this challenging in terms of we talk data science for all. Where do all the data science, is it data science for everyone? And how does it change everything? >> Well, I think of it this way. AI gives software super powers. It really does. It changes the nature of software. And at the center of that is data scientists. So, a data scientist has a set of powers that they've never had before in any organization. And that's why it's a hot profession. Now, on one hand, this has been around for a while. We've had actuaries. We've had statisticians that have really transformed industries. But there are a few things that are new now. We have new tools. New languages. Broader recognition of this need. And while it's important to recognize this critical skill set, you can't just limit it to a few people. This is about scaling it across the organization. And truly making it accessible to all. >> So then do we need more data scientists? Or is this something you train like you said, across the board? >> Well, I think you want to do a little bit of both. We want more. But, we can also train more and make the ones we have more productive. The way I think about it is there's kind of two markets here. And we call it clickers and coders. >> [Katie] I like that. That's good. >> So, let's talk about what that means. So clickers are basically somebody that wants to use tools. Create models visually. It's drag and drop. Something that's very intuitive. Those are the clickers. Nothing wrong with that. It's been valuable for years. There's a new crop of data scientists. They want to code. They want to build with the latest open source tools. They want to write in Python or R. These are the coders. And both approaches are viable. Both approaches are critical. Organizations have to have a way to meet the needs of both of those types. And there's not a lot of things available today that do that. >> Well let's keep going on that. Because I hear you talking about the data scientists role and how it's critical to success, but with the new tools, data science and analytics skills can extend beyond the domain of just the data scientist. >> That's right. So look, we're unifying coders and clickers into a single platform, which we call IBM Data Science Experience. And as the demand for data science expertise grows, so does the need for these kind of tools. To bring them into the same environment. And my view is if you have the right platform, it enables the organization to collaborate. And suddenly you've changed the nature of data science from an individual sport to a team sport. >> So as somebody that, my background is in IT, the question is really is this an additional piece of what IT needs to do in 2017 and beyond? Or is it just another line item to the budget? >> So I'm afraid that some people might view it that way. As just another line item. But, I would challenge that and say data science is going to reinvent IT. It's going to change the nature of IT. And every organization needs to think about what are the skills that are critical? How do we engage a broader team to do this? Because once they get there, this is the chance to reinvent how they're performing IT. >> [Katie] Challenging or not? >> Look it's all a big challenge. Think about everything IT organizations have been through. Some of them were late to things like mobile, but then they caught up. Some were late to cloud, but then they caught up. I would just urge people, don't be late to data science. Use this as your chance to reinvent IT. Start with this notion of clickers and coders. This is a seminal moment. Much like mobile and cloud was. So don't be late. >> And I think it's critical because it could be so costly to wait. And Rob and I were even chatting earlier how data analytics is just moving into all different kinds of industries. And I can tell you even personally being effected by how important the analysis is in working in pediatric cancer for the last seven years. I personally implement virtual reality headsets to pediatric cancer hospitals across the country. And it's great. And it's working phenomenally. And the kids are amazed. And the staff is amazed. But the phase two of this project is putting in little metrics in the hardware that gather the breathing, the heart rate to show that we have data. Proof that we can hand over to the hospitals to continue making this program a success. So just in-- >> That's a great example. >> An interesting example. >> Saving lives? >> Yes. >> That's also applying a lot of what we talked about. >> Exciting stuff in the world of data science. >> Yes. Look, I just add this is an existential moment for every organization. Because what you do in this area is probably going to define how competitive you are going forward. And think about if you don't do something. What if one of your competitors goes and creates an application that's more engaging with clients? So my recommendation is start small. Experiment. Learn. Iterate on projects. Define the business outcomes. Then scale up. It's very doable. But you've got to take the first step. >> First step always critical. And now we're going to get to the fun hands on part of our story. Because in just a moment we're going to take a closer look at what data science can deliver. And where organizations are trying to get to. All right. Thank you Rob and now we've been joined by Siva Anne who is going to help us navigate this demo. First, welcome Siva. Give him a big round of applause. Yeah. All right, Rob break down what we're going to be looking at. You take over this demo. >> All right. So this is going to be pretty interesting. So Siva is going to take us through. So he's going to play the role of a financial adviser. Who wants to help better serve clients through recommendations. And I'm going to really illustrate three things. One is how do you federate data from multiple data sources? Inside the firewall, outside the firewall. How do you apply machine learning to predict and to automate? And then how do you move analytics closer to your data? So, what you're seeing here is a custom application for an investment firm. So, Siva, our financial adviser, welcome. So you can see at the top, we've got market data. We pulled that from an external source. And then we've got Siva's calendar in the middle. He's got clients on the right side. So page down, what else do you see down there Siva? >> [Siva] I can see the recent market news. And in here I can see that JP Morgan is calling for a US dollar rebound in the second half of the year. And, I have upcoming meeting with Leo Rakes. I can get-- >> [Rob] So let's go in there. Why don't you click on Leo Rakes. So, you're sitting at your desk, you're deciding how you're going to spend the day. You know you have a meeting with Leo. So you click on it. You immediately see, all right, so what do we know about him? We've got data governance implemented. So we know his age, we know his degree. We can see he's not that aggressive of a trader. Only six trades in the last few years. But then where it gets interesting is you go to the bottom. You start to see predicted industry affinity. Where did that come from? How do we have that? >> [Siva] So these green lines and red arrows here indicate the trending affinity of Leo Rakes for particular industry stocks. What we've done here is we've built machine learning models using customer's demographic data, his stock portfolios, and browsing behavior to build a model which can predict his affinity for a particular industry. >> [Rob] Interesting. So, I like to think of this, we call it celebrity experiences. So how do you treat every customer like they're a celebrity? So to some extent, we're reading his mind. Because without asking him, we know that he's going to have an affinity for auto stocks. So we go down. Now we look at his portfolio. You can see okay, he's got some different holdings. He's got Amazon, Google, Apple, and then he's got RACE, which is the ticker for Ferrari. You can see that's done incredibly well. And so, as a financial adviser, you look at this and you say, all right, we know he loves auto stocks. Ferrari's done very well. Let's create a hedge. Like what kind of security would interest him as a hedge against his position for Ferrari? Could we go figure that out? >> [Siva] Yes. Given I know that he's gotten an affinity for auto stocks, and I also see that Ferrari has got some terminus gains, I want to lock in these gains by hedging. And I want to do that by picking a auto stock which has got negative correlation with Ferrari. >> [Rob] So this is where we get to the idea of in database analytics. Cause you start clicking that and immediately we're getting instant answers of what's happening. So what did we find here? We're going to compare Ferrari and Honda. >> [Siva] I'm going to compare Ferrari with Honda. And what I see here instantly is that Honda has got a negative correlation with Ferrari, which makes it a perfect mix for his stock portfolio. Given he has an affinity for auto stocks and it correlates negatively with Ferrari. >> [Rob] These are very powerful tools at the hand of a financial adviser. You think about it. As a financial adviser, you wouldn't think about federating data, machine learning, pretty powerful. >> [Siva] Yes. So what we have seen here is that using the common SQL engine, we've been able to federate queries across multiple data sources. Db2 Warehouse in the cloud, IBM's Integrated Analytic System, and Hortonworks powered Hadoop platform for the new speeds. We've been able to use machine learning to derive innovative insights about his stock affinities. And drive the machine learning into the appliance. Closer to where the data resides to deliver high performance analytics. >> [Rob] At scale? >> [Siva] We're able to run millions of these correlations across stocks, currency, other factors. And even score hundreds of customers for their affinities on a daily basis. >> That's great. Siva, thank you for playing the role of financial adviser. So I just want to recap briefly. Cause this really powerful technology that's really simple. So we federated, we aggregated multiple data sources from all over the web and internal systems. And public cloud systems. Machine learning models were built that predicted Leo's affinity for a certain industry. In this case, automotive. And then you see when you deploy analytics next to your data, even a financial adviser, just with the click of a button is getting instant answers so they can go be more productive in their next meeting. This whole idea of celebrity experiences for your customer, that's available for everybody, if you take advantage of these types of capabilities. Katie, I'll hand it back to you. >> Good stuff. Thank you Rob. Thank you Siva. Powerful demonstration on what we've been talking about all afternoon. And thank you again to Siva for helping us navigate. Should be give him one more round of applause? We're going to be back in just a moment to look at how we operationalize all of this data. But in first, here's a message from me. If you're a part of a line of business, your main fear is disruption. You know data is the new goal that can create huge amounts of value. So does your competition. And they may be beating you to it. You're convinced there are new business models and revenue sources hidden in all the data. You just need to figure out how to leverage it. But with the scarcity of data scientists, you really can't rely solely on them. You may need more people throughout the organization that have the ability to extract value from data. And as a data science leader or data scientist, you have a lot of the same concerns. You spend way too much time looking for, prepping, and interpreting data and waiting for models to train. You know you need to operationalize the work you do to provide business value faster. What you want is an easier way to do data prep. And rapidly build models that can be easily deployed, monitored and automatically updated. So whether you're a data scientist, data science leader, or in a line of business, what's the solution? What'll it take to transform the way you work? That's what we're going to explore next. All right, now it's time to delve deeper into the nuts and bolts. The nitty gritty of operationalizing data science and creating a data driven culture. How do you actually do that? Well that's what these experts are here to share with us. I'm joined by Nir Kaldero, who's head of data science at Galvanize, which is an education and training organization. Tricia Wang, who is co-founder of Sudden Compass, a consultancy that helps companies understand people with data. And last, but certainly not least, Michael Li, founder and CEO of Data Incubator, which is a data science train company. All right guys. Shall we get right to it? >> All right. >> So data explosion happening right now. And we are seeing it across the board. I just shared an example of how it's impacting my philanthropic work in pediatric cancer. But you guys each have so many unique roles in your business life. How are you seeing it just blow up in your fields? Nir, your thing? >> Yeah, for example like in Galvanize we train many Fortune 500 companies. And just by looking at the demand of companies that wants us to help them go through this digital transformation is mind-blowing. Data point by itself. >> Okay. Well what we're seeing what's going on is that data science like as a theme, is that it's actually for everyone now. But what's happening is that it's actually meeting non technical people. But what we're seeing is that when non technical people are implementing these tools or coming at these tools without a base line of data literacy, they're often times using it in ways that distance themselves from the customer. Because they're implementing data science tools without a clear purpose, without a clear problem. And so what we do at Sudden Compass is that we work with companies to help them embrace and understand the complexity of their customers. Because often times they are misusing data science to try and flatten their understanding of the customer. As if you can just do more traditional marketing. Where you're putting people into boxes. And I think the whole ROI of data is that you can now understand people's relationships at a much more complex level at a greater scale before. But we have to do this with basic data literacy. And this has to involve technical and non technical people. >> Well you can have all the data in the world, and I think it speaks to, if you're not doing the proper movement with it, forget it. It means nothing at the same time. >> No absolutely. I mean, I think that when you look at the huge explosion in data, that comes with it a huge explosion in data experts. Right, we call them data scientists, data analysts. And sometimes they're people who are very, very talented, like the people here. But sometimes you have people who are maybe re-branding themselves, right? Trying to move up their title one notch to try to attract that higher salary. And I think that that's one of the things that customers are coming to us for, right? They're saying, hey look, there are a lot of people that call themselves data scientists, but we can't really distinguish. So, we have sort of run a fellowship where you help companies hire from a really talented group of folks, who are also truly data scientists and who know all those kind of really important data science tools. And we also help companies internally. Fortune 500 companies who are looking to grow that data science practice that they have. And we help clients like McKinsey, BCG, Bain, train up their customers, also their clients, also their workers to be more data talented. And to build up that data science capabilities. >> And Nir, this is something you work with a lot. A lot of Fortune 500 companies. And when we were speaking earlier, you were saying many of these companies can be in a panic. >> Yeah. >> Explain that. >> Yeah, so you know, not all Fortune 500 companies are fully data driven. And we know that the winners in this fourth industrial revolution, which I like to call the machine intelligence revolution, will be companies who navigate and transform their organization to unlock the power of data science and machine learning. And the companies that are not like that. Or not utilize data science and predictive power well, will pretty much get shredded. So they are in a panic. >> Tricia, companies have to deal with data behind the firewall and in the new multi cloud world. How do organizations start to become driven right to the core? >> I think the most urgent question to become data driven that companies should be asking is how do I bring the complex reality that our customers are experiencing on the ground in to a corporate office? Into the data models. So that question is critical because that's how you actually prevent any big data disasters. And that's how you leverage big data. Because when your data models are really far from your human models, that's when you're going to do things that are really far off from how, it's going to not feel right. That's when Tesco had their terrible big data disaster that they're still recovering from. And so that's why I think it's really important to understand that when you implement big data, you have to further embrace thick data. The qualitative, the emotional stuff, that is difficult to quantify. But then comes the difficult art and science that I think is the next level of data science. Which is that getting non technical and technical people together to ask how do we find those unknown nuggets of insights that are difficult to quantify? Then, how do we do the next step of figuring out how do you mathematically scale those insights into a data model? So that actually is reflective of human understanding? And then we can start making decisions at scale. But you have to have that first. >> That's absolutely right. And I think that when we think about what it means to be a data scientist, right? I always think about it in these sort of three pillars. You have the math side. You have to have that kind of stats, hardcore machine learning background. You have the programming side. You don't work with small amounts of data. You work with large amounts of data. You've got to be able to type the code to make those computers run. But then the last part is that human element. You have to understand the domain expertise. You have to understand what it is that I'm actually analyzing. What's the business proposition? And how are the clients, how are the users actually interacting with the system? That human element that you were talking about. And I think having somebody who understands all of those and not just in isolation, but is able to marry that understanding across those different topics, that's what makes a data scientist. >> But I find that we don't have people with those skill sets. And right now the way I see teams being set up inside companies is that they're creating these isolated data unicorns. These data scientists that have graduated from your programs, which are great. But, they don't involve the people who are the domain experts. They don't involve the designers, the consumer insight people, the people, the salespeople. The people who spend time with the customers day in and day out. Somehow they're left out of the room. They're consulted, but they're not a stakeholder. >> Can I actually >> Yeah, yeah please. >> Can I actually give a quick example? So for example, we at Galvanize train the executives and the managers. And then the technical people, the data scientists and the analysts. But in order to actually see all of the RY behind the data, you also have to have a creative fluid conversation between non technical and technical people. And this is a major trend now. And there's a major gap. And we need to increase awareness and kind of like create a new, kind of like environment where technical people also talks seamlessly with non technical ones. >> [Tricia] We call-- >> That's one of the things that we see a lot. Is one of the trends in-- >> A major trend. >> data science training is it's not just for the data science technical experts. It's not just for one type of person. So a lot of the training we do is sort of data engineers. People who are more on the software engineering side learning more about the stats of math. And then people who are sort of traditionally on the stat side learning more about the engineering. And then managers and people who are data analysts learning about both. >> Michael, I think you said something that was of interest too because I think we can look at IBM Watson as an example. And working in healthcare. The human component. Because often times we talk about machine learning and AI, and data and you get worried that you still need that human component. Especially in the world of healthcare. And I think that's a very strong point when it comes to the data analysis side. Is there any particular example you can speak to of that? >> So I think that there was this really excellent paper a while ago talking about all the neuro net stuff and trained on textual data. So looking at sort of different corpuses. And they found that these models were highly, highly sexist. They would read these corpuses and it's not because neuro nets themselves are sexist. It's because they're reading the things that we write. And it turns out that we write kind of sexist things. And they would sort of find all these patterns in there that were sort of latent, that had a lot of sort of things that maybe we would cringe at if we sort of saw. And I think that's one of the really important aspects of the human element, right? It's being able to come in and sort of say like, okay, I know what the biases of the system are, I know what the biases of the tools are. I need to figure out how to use that to make the tools, make the world a better place. And like another area where this comes up all the time is lending, right? So the federal government has said, and we have a lot of clients in the financial services space, so they're constantly under these kind of rules that they can't make discriminatory lending practices based on a whole set of protected categories. Race, sex, gender, things like that. But, it's very easy when you train a model on credit scores to pick that up. And then to have a model that's inadvertently sexist or racist. And that's where you need the human element to come back in and say okay, look, you're using the classic example would be zip code, you're using zip code as a variable. But when you look at it, zip codes actually highly correlated with race. And you can't do that. So you may inadvertently by sort of following the math and being a little naive about the problem, inadvertently introduce something really horrible into a model and that's where you need a human element to sort of step in and say, okay hold on. Slow things down. This isn't the right way to go. >> And the people who have -- >> I feel like, I can feel her ready to respond. >> Yes, I'm ready. >> She's like let me have at it. >> And the people here it is. And the people who are really great at providing that human intelligence are social scientists. We are trained to look for bias and to understand bias in data. Whether it's quantitative or qualitative. And I really think that we're going to have less of these kind of problems if we had more integrated teams. If it was a mandate from leadership to say no data science team should be without a social scientist, ethnographer, or qualitative researcher of some kind, to be able to help see these biases. >> The talent piece is actually the most crucial-- >> Yeah. >> one here. If you look about how to enable machine intelligence in organization there are the pillars that I have in my head which is the culture, the talent and the technology infrastructure. And I believe and I saw in working very closely with the Fortune 100 and 200 companies that the talent piece is actually the most important crucial hard to get. >> [Tricia] I totally agree. >> It's absolutely true. Yeah, no I mean I think that's sort of like how we came up with our business model. Companies were basically saying hey, I can't hire data scientists. And so we have a fellowship where we get 2,000 applicants each quarter. We take the top 2% and then we sort of train them up. And we work with hiring companies who then want to hire from that population. And so we're sort of helping them solve that problem. And the other half of it is really around training. Cause with a lot of industries, especially if you're sort of in a more regulated industry, there's a lot of nuances to what you're doing. And the fastest way to develop that data science or AI talent may not necessarily be to hire folks who are coming out of a PhD program. It may be to take folks internally who have a lot of that domain knowledge that you have and get them trained up on those data science techniques. So we've had large insurance companies come to us and say hey look, we hire three or four folks from you a quarter. That doesn't move the needle for us. What we really need is take the thousand actuaries and statisticians that we have and get all of them trained up to become a data scientist and become data literate in this new open source world. >> [Katie] Go ahead. >> All right, ladies first. >> Go ahead. >> Are you sure? >> No please, fight first. >> Go ahead. >> Go ahead Nir. >> So this is actually a trend that we have been seeing in the past year or so that companies kind of like start to look how to upscale and look for talent within the organization. So they can actually move them to become more literate and navigate 'em from analyst to data scientist. And from data scientist to machine learner. So this is actually a trend that is happening already for a year or so. >> Yeah, but I also find that after they've gone through that training in getting people skilled up in data science, the next problem that I get is executives coming to say we've invested in all of this. We're still not moving the needle. We've already invested in the right tools. We've gotten the right skills. We have enough scale of people who have these skills. Why are we not moving the needle? And what I explain to them is look, you're still making decisions in the same way. And you're still not involving enough of the non technical people. Especially from marketing, which is now, the CMO's are much more responsible for driving growth in their companies now. But often times it's so hard to change the old way of marketing, which is still like very segmentation. You know, demographic variable based, and we're trying to move people to say no, you have to understand the complexity of customers and not put them in boxes. >> And I think underlying a lot of this discussion is this question of culture, right? >> Yes. >> Absolutely. >> How do you build a data driven culture? And I think that that culture question, one of the ways that comes up quite often in especially in large, Fortune 500 enterprises, is that they are very, they're not very comfortable with sort of example, open source architecture. Open source tools. And there is some sort of residual bias that that's somehow dangerous. So security vulnerability. And I think that that's part of the cultural challenge that they often have in terms of how do I build a more data driven organization? Well a lot of the talent really wants to use these kind of tools. And I mean, just to give you an example, we are partnering with one of the major cloud providers to sort of help make open source tools more user friendly on their platform. So trying to help them attract the best technologists to use their platform because they want and they understand the value of having that kind of open source technology work seamlessly on their platforms. So I think that just sort of goes to show you how important open source is in this movement. And how much large companies and Fortune 500 companies and a lot of the ones we work with have to embrace that. >> Yeah, and I'm seeing it in our work. Even when we're working with Fortune 500 companies, is that they've already gone through the first phase of data science work. Where I explain it was all about the tools and getting the right tools and architecture in place. And then companies started moving into getting the right skill set in place. Getting the right talent. And what you're talking about with culture is really where I think we're talking about the third phase of data science, which is looking at communication of these technical frameworks so that we can get non technical people really comfortable in the same room with data scientists. That is going to be the phase, that's really where I see the pain point. And that's why at Sudden Compass, we're really dedicated to working with each other to figure out how do we solve this problem now? >> And I think that communication between the technical stakeholders and management and leadership. That's a very critical piece of this. You can't have a successful data science organization without that. >> Absolutely. >> And I think that actually some of the most popular trainings we've had recently are from managers and executives who are looking to say, how do I become more data savvy? How do I figure out what is this data science thing and how do I communicate with my data scientists? >> You guys made this way too easy. I was just going to get some popcorn and watch it play out. >> Nir, last 30 seconds. I want to leave you with an opportunity to, anything you want to add to this conversation? >> I think one thing to conclude is to say that companies that are not data driven is about time to hit refresh and figure how they transition the organization to become data driven. To become agile and nimble so they can actually see what opportunities from this important industrial revolution. Otherwise, unfortunately they will have hard time to survive. >> [Katie] All agreed? >> [Tricia] Absolutely, you're right. >> Michael, Trish, Nir, thank you so much. Fascinating discussion. And thank you guys again for joining us. We will be right back with another great demo. Right after this. >> Thank you Katie. >> Once again, thank you for an excellent discussion. Weren't they great guys? And thank you for everyone who's tuning in on the live webcast. As you can hear, we have an amazing studio audience here. And we're going to keep things moving. I'm now joined by Daniel Hernandez and Siva Anne. And we're going to turn our attention to how you can deliver on what they're talking about using data science experience to do data science faster. >> Thank you Katie. Siva and I are going to spend the next 10 minutes showing you how you can deliver on what they were saying using the IBM Data Science Experience to do data science faster. We'll demonstrate through new features we introduced this week how teams can work together more effectively across the entire analytics life cycle. How you can take advantage of any and all data no matter where it is and what it is. How you could use your favorite tools from open source. And finally how you could build models anywhere and employ them close to where your data is. Remember the financial adviser app Rob showed you? To build an app like that, we needed a team of data scientists, developers, data engineers, and IT staff to collaborate. We do this in the Data Science Experience through a concept we call projects. When I create a new project, I can now use the new Github integration feature. We're doing for data science what we've been doing for developers for years. Distributed teams can work together on analytics projects. And take advantage of Github's version management and change management features. This is a huge deal. Let's explore the project we created for the financial adviser app. As you can see, our data engineer Joane, our developer Rob, and others are collaborating this project. Joane got things started by bringing together the trusted data sources we need to build the app. Taking a closer look at the data, we see that our customer and profile data is stored on our recently announced IBM Integrated Analytics System, which runs safely behind our firewall. We also needed macro economic data, which she was able to find in the Federal Reserve. And she stored it in our Db2 Warehouse on Cloud. And finally, she selected stock news data from NASDAQ.com and landed that in a Hadoop cluster, which happens to be powered by Hortonworks. We added a new feature to the Data Science Experience so that when it's installed with Hortonworks, it automatically uses a need of security and governance controls within the cluster so your data is always secure and safe. Now we want to show you the news data we stored in the Hortonworks cluster. This is the mean administrative console. It's powered by an open source project called Ambari. And here's the news data. It's in parquet files stored in HDFS, which happens to be a distributive file system. To get the data from NASDAQ into our cluster, we used IBM's BigIntegrate and BigQuality to create automatic data pipelines that acquire, cleanse, and ingest that news data. Once the data's available, we use IBM's Big SQL to query that data using SQL statements that are much like the ones we would use for any relation of data, including the data that we have in the Integrated Analytics System and Db2 Warehouse on Cloud. This and the federation capabilities that Big SQL offers dramatically simplifies data acquisition. Now we want to show you how we support a brand new tool that we're excited about. Since we launched last summer, the Data Science Experience has supported Jupyter and R for data analysis and visualization. In this week's update, we deeply integrated another great open source project called Apache Zeppelin. It's known for having great visualization support, advanced collaboration features, and is growing in popularity amongst the data science community. This is an example of Apache Zeppelin and the notebook we created through it to explore some of our data. Notice how wonderful and easy the data visualizations are. Now we want to walk you through the Jupyter notebook we created to explore our customer preference for stocks. We use notebooks to understand and explore data. To identify the features that have some predictive power. Ultimately, we're trying to assess what ultimately is driving customer stock preference. Here we did the analysis to identify the attributes of customers that are likely to purchase auto stocks. We used this understanding to build our machine learning model. For building machine learning models, we've always had tools integrated into the Data Science Experience. But sometimes you need to use tools you already invested in. Like our very own SPSS as well as SAS. Through new import feature, you can easily import those models created with those tools. This helps you avoid vendor lock-in, and simplify the development, training, deployment, and management of all your models. To build the models we used in app, we could have coded, but we prefer a visual experience. We used our customer profile data in the Integrated Analytic System. Used the Auto Data Preparation to cleanse our data. Choose the binary classification algorithms. Let the Data Science Experience evaluate between logistic regression and gradient boosted tree. It's doing the heavy work for us. As you can see here, the Data Science Experience generated performance metrics that show us that the gradient boosted tree is the best performing algorithm for the data we gave it. Once we save this model, it's automatically deployed and available for developers to use. Any application developer can take this endpoint and consume it like they would any other API inside of the apps they built. We've made training and creating machine learning models super simple. But what about the operations? A lot of companies are struggling to ensure their model performance remains high over time. In our financial adviser app, we know that customer data changes constantly, so we need to always monitor model performance and ensure that our models are retrained as is necessary. This is a dashboard that shows the performance of our models and lets our teams monitor and retrain those models so that they're always performing to our standards. So far we've been showing you the Data Science Experience available behind the firewall that we're using to build and train models. Through a new publish feature, you can build models and deploy them anywhere. In another environment, private, public, or anywhere else with just a few clicks. So here we're publishing our model to the Watson machine learning service. It happens to be in the IBM cloud. And also deeply integrated with our Data Science Experience. After publishing and switching to the Watson machine learning service, you can see that our stock affinity and model that we just published is there and ready for use. So this is incredibly important. I just want to say it again. The Data Science Experience allows you to train models behind your own firewall, take advantage of your proprietary and sensitive data, and then deploy those models wherever you want with ease. So summarize what we just showed you. First, IBM's Data Science Experience supports all teams. You saw how our data engineer populated our project with trusted data sets. Our data scientists developed, trained, and tested a machine learning model. Our developers used APIs to integrate machine learning into their apps. And how IT can use our Integrated Model Management dashboard to monitor and manage model performance. Second, we support all data. On premises, in the cloud, structured, unstructured, inside of your firewall, and outside of it. We help you bring analytics and governance to where your data is. Third, we support all tools. The data science tools that you depend on are readily available and deeply integrated. This includes capabilities from great partners like Hortonworks. And powerful tools like our very own IBM SPSS. And fourth, and finally, we support all deployments. You can build your models anywhere, and deploy them right next to where your data is. Whether that's in the public cloud, private cloud, or even on the world's most reliable transaction platform, IBM z. So see for yourself. Go to the Data Science Experience website, take us for a spin. And if you happen to be ready right now, our recently created Data Science Elite Team can help you get started and run experiments alongside you with no charge. Thank you very much. >> Thank you very much Daniel. It seems like a great time to get started. And thanks to Siva for taking us through it. Rob and I will be back in just a moment to add some perspective right after this. All right, once again joined by Rob Thomas. And Rob obviously we got a lot of information here. >> Yes, we've covered a lot of ground. >> This is intense. You got to break it down for me cause I think we zoom out and see the big picture. What better data science can deliver to a business? Why is this so important? I mean we've heard it through and through. >> Yeah, well, I heard it a couple times. But it starts with businesses have to embrace a data driven culture. And it is a change. And we need to make data accessible with the right tools in a collaborative culture because we've got diverse skill sets in every organization. But data driven companies succeed when data science tools are in the hands of everyone. And I think that's a new thought. I think most companies think just get your data scientist some tools, you'll be fine. This is about tools in the hands of everyone. I think the panel did a great job of describing about how we get to data science for all. Building a data culture, making it a part of your everyday operations, and the highlights of what Daniel just showed us, that's some pretty cool features for how organizations can get to this, which is you can see IBM's Data Science Experience, how that supports all teams. You saw data analysts, data scientists, application developer, IT staff, all working together. Second, you saw how we support all tools. And your choice of tools. So the most popular data science libraries integrated into one platform. And we saw some new capabilities that help companies avoid lock-in, where you can import existing models created from specialist tools like SPSS or others. And then deploy them and manage them inside of Data Science Experience. That's pretty interesting. And lastly, you see we continue to build on this best of open tools. Partnering with companies like H2O, Hortonworks, and others. Third, you can see how you use all data no matter where it lives. That's a key challenge every organization's going to face. Private, public, federating all data sources. We announced new integration with the Hortonworks data platform where we deploy machine learning models where your data resides. That's been a key theme. Analytics where the data is. And lastly, supporting all types of deployments. Deploy them in your Hadoop cluster. Deploy them in your Integrated Analytic System. Or deploy them in z, just to name a few. A lot of different options here. But look, don't believe anything I say. Go try it for yourself. Data Science Experience, anybody can use it. Go to datascience.ibm.com and look, if you want to start right now, we just created a team that we call Data Science Elite. These are the best data scientists in the world that will come sit down with you and co-create solutions, models, and prove out a proof of concept. >> Good stuff. Thank you Rob. So you might be asking what does an organization look like that embraces data science for all? And how could it transform your role? I'm going to head back to the office and check it out. Let's start with the perspective of the line of business. What's changed? Well, now you're starting to explore new business models. You've uncovered opportunities for new revenue sources and all that hidden data. And being disrupted is no longer keeping you up at night. As a data science leader, you're beginning to collaborate with a line of business to better understand and translate the objectives into the models that are being built. Your data scientists are also starting to collaborate with the less technical team members and analysts who are working closest to the business problem. And as a data scientist, you stop feeling like you're falling behind. Open source tools are keeping you current. You're also starting to operationalize the work that you do. And you get to do more of what you love. Explore data, build models, put your models into production, and create business impact. All in all, it's not a bad scenario. Thanks. All right. We are back and coming up next, oh this is a special time right now. Cause we got a great guest speaker. New York Magazine called him the spreadsheet psychic and number crunching prodigy who went from correctly forecasting baseball games to correctly forecasting presidential elections. He even invented a proprietary algorithm called PECOTA for predicting future performance by baseball players and teams. And his New York Times bestselling book, The Signal and the Noise was named by Amazon.com as the number one best non-fiction book of 2012. He's currently the Editor in Chief of the award winning website, FiveThirtyEight and appears on ESPN as an on air commentator. Big round of applause. My pleasure to welcome Nate Silver. >> Thank you. We met backstage. >> Yes. >> It feels weird to re-shake your hand, but you know, for the audience. >> I had to give the intense firm grip. >> Definitely. >> The ninja grip. So you and I have crossed paths kind of digitally in the past, which it really interesting, is I started my career at ESPN. And I started as a production assistant, then later back on air for sports technology. And I go to you to talk about sports because-- >> Yeah. >> Wow, has ESPN upped their game in terms of understanding the importance of data and analytics. And what it brings. Not just to MLB, but across the board. >> No, it's really infused into the way they present the broadcast. You'll have win probability on the bottom line. And they'll incorporate FiveThirtyEight metrics into how they cover college football for example. So, ESPN ... Sports is maybe the perfect, if you're a data scientist, like the perfect kind of test case. And the reason being that sports consists of problems that have rules. And have structure. And when problems have rules and structure, then it's a lot easier to work with. So it's a great way to kind of improve your skills as a data scientist. Of course, there are also important real world problems that are more open ended, and those present different types of challenges. But it's such a natural fit. The teams. Think about the teams playing the World Series tonight. The Dodgers and the Astros are both like very data driven, especially Houston. Golden State Warriors, the NBA Champions, extremely data driven. New England Patriots, relative to an NFL team, it's shifted a little bit, the NFL bar is lower. But the Patriots are certainly very analytical in how they make decisions. So, you can't talk about sports without talking about analytics. >> And I was going to save the baseball question for later. Cause we are moments away from game seven. >> Yeah. >> Is everyone else watching game seven? It's been an incredible series. Probably one of the best of all time. >> Yeah, I mean-- >> You have a prediction here? >> You can mention that too. So I don't have a prediction. FiveThirtyEight has the Dodgers with a 60% chance of winning. >> [Katie] LA Fans. >> So you have two teams that are about equal. But the Dodgers pitching staff is in better shape at the moment. The end of a seven game series. And they're at home. >> But the statistics behind the two teams is pretty incredible. >> Yeah. It's like the first World Series in I think 56 years or something where you have two 100 win teams facing one another. There have been a lot of parity in baseball for a lot of years. Not that many offensive overall juggernauts. But this year, and last year with the Cubs and the Indians too really. But this year, you have really spectacular teams in the World Series. It kind of is a showcase of modern baseball. Lots of home runs. Lots of strikeouts. >> [Katie] Lots of extra innings. >> Lots of extra innings. Good defense. Lots of pitching changes. So if you love the modern baseball game, it's been about the best example that you've had. If you like a little bit more contact, and fewer strikeouts, maybe not so much. But it's been a spectacular and very exciting World Series. It's amazing to talk. MLB is huge with analysis. I mean, hands down. But across the board, if you can provide a few examples. Because there's so many teams in front offices putting such an, just a heavy intensity on the analysis side. And where the teams are going. And if you could provide any specific examples of teams that have really blown your mind. Especially over the last year or two. Because every year it gets more exciting if you will. I mean, so a big thing in baseball is defensive shifts. So if you watch tonight, you'll probably see a couple of plays where if you're used to watching baseball, a guy makes really solid contact. And there's a fielder there that you don't think should be there. But that's really very data driven where you analyze where's this guy hit the ball. That part's not so hard. But also there's game theory involved. Because you have to adjust for the fact that he knows where you're positioning the defenders. He's trying therefore to make adjustments to his own swing and so that's been a major innovation in how baseball is played. You know, how bullpens are used too. Where teams have realized that actually having a guy, across all sports pretty much, realizing the importance of rest. And of fatigue. And that you can be the best pitcher in the world, but guess what? After four or five innings, you're probably not as good as a guy who has a fresh arm necessarily. So I mean, it really is like, these are not subtle things anymore. It's not just oh, on base percentage is valuable. It really effects kind of every strategic decision in baseball. The NBA, if you watch an NBA game tonight, see how many three point shots are taken. That's in part because of data. And teams realizing hey, three points is worth more than two, once you're more than about five feet from the basket, the shooting percentage gets really flat. And so it's revolutionary, right? Like teams that will shoot almost half their shots from the three point range nowadays. Larry Bird, who wound up being one of the greatest three point shooters of all time, took only eight three pointers his first year in the NBA. It's quite noticeable if you watch baseball or basketball in particular. >> Not to focus too much on sports. One final question. In terms of Major League Soccer, and now in NFL, we're having the analysis and having wearables where it can now showcase if they wanted to on screen, heart rate and breathing and how much exertion. How much data is too much data? And when does it ruin the sport? >> So, I don't think, I mean, again, it goes sport by sport a little bit. I think in basketball you actually have a more exciting game. I think the game is more open now. You have more three pointers. You have guys getting higher assist totals. But you know, I don't know. I'm not one of those people who thinks look, if you love baseball or basketball, and you go in to work for the Astros, the Yankees or the Knicks, they probably need some help, right? You really have to be passionate about that sport. Because it's all based on what questions am I asking? As I'm a fan or I guess an employee of the team. Or a player watching the game. And there isn't really any substitute I don't think for the insight and intuition that a curious human has to kind of ask the right questions. So we can talk at great length about what tools do you then apply when you have those questions, but that still comes from people. I don't think machine learning could help with what questions do I want to ask of the data. It might help you get the answers. >> If you have a mid-fielder in a soccer game though, not exerting, only 80%, and you're seeing that on a screen as a fan, and you're saying could that person get fired at the end of the day? One day, with the data? >> So we found that actually some in soccer in particular, some of the better players are actually more still. So Leo Messi, maybe the best player in the world, doesn't move as much as other soccer players do. And the reason being that A) he kind of knows how to position himself in the first place. B) he realizes that you make a run, and you're out of position. That's quite fatiguing. And particularly soccer, like basketball, is a sport where it's incredibly fatiguing. And so, sometimes the guys who conserve their energy, that kind of old school mentality, you have to hustle at every moment. That is not helpful to the team if you're hustling on an irrelevant play. And therefore, on a critical play, can't get back on defense, for example. >> Sports, but also data is moving exponentially as we're just speaking about today. Tech, healthcare, every different industry. Is there any particular that's a favorite of yours to cover? And I imagine they're all different as well. >> I mean, I do like sports. We cover a lot of politics too. Which is different. I mean in politics I think people aren't intuitively as data driven as they might be in sports for example. It's impressive to follow the breakthroughs in artificial intelligence. It started out just as kind of playing games and playing chess and poker and Go and things like that. But you really have seen a lot of breakthroughs in the last couple of years. But yeah, it's kind of infused into everything really. >> You're known for your work in politics though. Especially presidential campaigns. >> Yeah. >> This year, in particular. Was it insanely challenging? What was the most notable thing that came out of any of your predictions? >> I mean, in some ways, looking at the polling was the easiest lens to look at it. So I think there's kind of a myth that last year's result was a big shock and it wasn't really. If you did the modeling in the right way, then you realized that number one, polls have a margin of error. And so when a candidate has a three point lead, that's not particularly safe. Number two, the outcome between different states is correlated. Meaning that it's not that much of a surprise that Clinton lost Wisconsin and Michigan and Pennsylvania and Ohio. You know I'm from Michigan. Have friends from all those states. Kind of the same types of people in those states. Those outcomes are all correlated. So what people thought was a big upset for the polls I think was an example of how data science done carefully and correctly where you understand probabilities, understand correlations. Our model gave Trump a 30% chance of winning. Others models gave him a 1% chance. And so that was interesting in that it showed that number one, that modeling strategies and skill do matter quite a lot. When you have someone saying 30% versus 1%. I mean, that's a very very big spread. And number two, that these aren't like solved problems necessarily. Although again, the problem with elections is that you only have one election every four years. So I can be very confident that I have a better model. Even one year of data doesn't really prove very much. Even five or 10 years doesn't really prove very much. And so, being aware of the limitations to some extent intrinsically in elections when you only get one kind of new training example every four years, there's not really any way around that. There are ways to be more robust to sparce data environments. But if you're identifying different types of business problems to solve, figuring out what's a solvable problem where I can add value with data science is a really key part of what you're doing. >> You're such a leader in this space. In data and analysis. It would be interesting to kind of peek back the curtain, understand how you operate but also how large is your team? How you're putting together information. How quickly you're putting it out. Cause I think in this right now world where everybody wants things instantly-- >> Yeah. >> There's also, you want to be first too in the world of journalism. But you don't want to be inaccurate because that's your credibility. >> We talked about this before, right? I think on average, speed is a little bit overrated in journalism. >> [Katie] I think it's a big problem in journalism. >> Yeah. >> Especially in the tech world. You have to be first. You have to be first. And it's just pumping out, pumping out. And there's got to be more time spent on stories if I can speak subjectively. >> Yeah, for sure. But at the same time, we are reacting to the news. And so we have people that come in, we hire most of our people actually from journalism. >> [Katie] How many people do you have on your team? >> About 35. But, if you get someone who comes in from an academic track for example, they might be surprised at how fast journalism is. That even though we might be slower than the average website, the fact that there's a tragic event in New York, are there things we have to say about that? A candidate drops out of the presidential race, are things we have to say about that. In periods ranging from minutes to days as opposed to kind of weeks to months to years in the academic world. The corporate world moves faster. What is a little different about journalism is that you are expected to have more precision where people notice when you make a mistake. In corporations, you have maybe less transparency. If you make 10 investments and seven of them turn out well, then you'll get a lot of profit from that, right? In journalism, it's a little different. If you make kind of seven predictions or say seven things, and seven of them are very accurate and three of them aren't, you'll still get criticized a lot for the three. Just because that's kind of the way that journalism is. And so the kind of combination of needing, not having that much tolerance for mistakes, but also needing to be fast. That is tricky. And I criticize other journalists sometimes including for not being data driven enough, but the best excuse any journalist has, this is happening really fast and it's my job to kind of figure out in real time what's going on and provide useful information to the readers. And that's really difficult. Especially in a world where literally, I'll probably get off the stage and check my phone and who knows what President Trump will have tweeted or what things will have happened. But it really is a kind of 24/7. >> Well because it's 24/7 with FiveThirtyEight, one of the most well known sites for data, are you feeling micromanagey on your people? Because you do have to hit this balance. You can't have something come out four or five days later. >> Yeah, I'm not -- >> Are you overseeing everything? >> I'm not by nature a micromanager. And so you try to hire well. You try and let people make mistakes. And the flip side of this is that if a news organization that never had any mistakes, never had any corrections, that's raw, right? You have to have some tolerance for error because you are trying to decide things in real time. And figure things out. I think transparency's a big part of that. Say here's what we think, and here's why we think it. If we have a model to say it's not just the final number, here's a lot of detail about how that's calculated. In some case we release the code and the raw data. Sometimes we don't because there's a proprietary advantage. But quite often we're saying we want you to trust us and it's so important that you trust us, here's the model. Go play around with it yourself. Here's the data. And that's also I think an important value. >> That speaks to open source. And your perspective on that in general. >> Yeah, I mean, look, I'm a big fan of open source. I worry that I think sometimes the trends are a little bit away from open source. But by the way, one thing that happens when you share your data or you share your thinking at least in lieu of the data, and you can definitely do both is that readers will catch embarrassing mistakes that you made. By the way, even having open sourceness within your team, I mean we have editors and copy editors who often save you from really embarrassing mistakes. And by the way, it's not necessarily people who have a training in data science. I would guess that of our 35 people, maybe only five to 10 have a kind of formal background in what you would call data science. >> [Katie] I think that speaks to the theme here. >> Yeah. >> [Katie] That everybody's kind of got to be data literate. >> But yeah, it is like you have a good intuition. You have a good BS detector basically. And you have a good intuition for hey, this looks a little bit out of line to me. And sometimes that can be based on domain knowledge, right? We have one of our copy editors, she's a big college football fan. And we had an algorithm we released that tries to predict what the human being selection committee will do, and she was like, why is LSU rated so high? Cause I know that LSU sucks this year. And we looked at it, and she was right. There was a bug where it had forgotten to account for their last game where they lost to Troy or something and so -- >> That also speaks to the human element as well. >> It does. In general as a rule, if you're designing a kind of regression based model, it's different in machine learning where you have more, when you kind of build in the tolerance for error. But if you're trying to do something more precise, then so much of it is just debugging. It's saying that looks wrong to me. And I'm going to investigate that. And sometimes it's not wrong. Sometimes your model actually has an insight that you didn't have yourself. But fairly often, it is. And I think kind of what you learn is like, hey if there's something that bothers me, I want to go investigate that now and debug that now. Because the last thing you want is where all of a sudden, the answer you're putting out there in the world hinges on a mistake that you made. Cause you never know if you have so to speak, 1,000 lines of code and they all perform something differently. You never know when you get in a weird edge case where this one decision you made winds up being the difference between your having a good forecast and a bad one. In a defensible position and a indefensible one. So we definitely are quite diligent and careful. But it's also kind of knowing like, hey, where is an approximation good enough and where do I need more precision? Cause you could also drive yourself crazy in the other direction where you know, it doesn't matter if the answer is 91.2 versus 90. And so you can kind of go 91.2, three, four and it's like kind of A) false precision and B) not a good use of your time. So that's where I do still spend a lot of time is thinking about which problems are "solvable" or approachable with data and which ones aren't. And when they're not by the way, you're still allowed to report on them. We are a news organization so we do traditional reporting as well. And then kind of figuring out when do you need precision versus when is being pointed in the right direction good enough? >> I would love to get inside your brain and see how you operate on just like an everyday walking to Walgreens movement. It's like oh, if I cross the street in .2-- >> It's not, I mean-- >> Is it like maddening in there? >> No, not really. I mean, I'm like-- >> This is an honest question. >> If I'm looking for airfares, I'm a little more careful. But no, part of it's like you don't want to waste time on unimportant decisions, right? I will sometimes, if I can't decide what to eat at a restaurant, I'll flip a coin. If the chicken and the pasta both sound really good-- >> That's not high tech Nate. We want better. >> But that's the point, right? It's like both the chicken and the pasta are going to be really darn good, right? So I'm not going to waste my time trying to figure it out. I'm just going to have an arbitrary way to decide. >> Serious and business, how organizations in the last three to five years have just evolved with this data boom. How are you seeing it as from a consultant point of view? Do you think it's an exciting time? Do you think it's a you must act now time? >> I mean, we do know that you definitely see a lot of talent among the younger generation now. That so FiveThirtyEight has been at ESPN for four years now. And man, the quality of the interns we get has improved so much in four years. The quality of the kind of young hires that we make straight out of college has improved so much in four years. So you definitely do see a younger generation for which this is just part of their bloodstream and part of their DNA. And also, particular fields that we're interested in. So we're interested in people who have both a data and a journalism background. We're interested in people who have a visualization and a coding background. A lot of what we do is very much interactive graphics and so forth. And so we do see those skill sets coming into play a lot more. And so the kind of shortage of talent that had I think frankly been a problem for a long time, I'm optimistic based on the young people in our office, it's a little anecdotal but you can tell that there are so many more programs that are kind of teaching students the right set of skills that maybe weren't taught as much a few years ago. >> But when you're seeing these big organizations, ESPN as perfect example, moving more towards data and analytics than ever before. >> Yeah. >> You would say that's obviously true. >> Oh for sure. >> If you're not moving that direction, you're going to fall behind quickly. >> Yeah and the thing is, if you read my book or I guess people have a copy of the book. In some ways it's saying hey, there are lot of ways to screw up when you're using data. And we've built bad models. We've had models that were bad and got good results. Good models that got bad results and everything else. But the point is that the reason to be out in front of the problem is so you give yourself more runway to make errors and mistakes. And to learn kind of what works and what doesn't and which people to put on the problem. I sometimes do worry that a company says oh we need data. And everyone kind of agrees on that now. We need data science. Then they have some big test case. And they have a failure. And they maybe have a failure because they didn't know really how to use it well enough. But learning from that and iterating on that. And so by the time that you're on the third generation of kind of a problem that you're trying to solve, and you're watching everyone else make the mistake that you made five years ago, I mean, that's really powerful. But that doesn't mean that getting invested in it now, getting invested both in technology and the human capital side is important. >> Final question for you as we run out of time. 2018 beyond, what is your biggest project in terms of data gathering that you're working on? >> There's a midterm election coming up. That's a big thing for us. We're also doing a lot of work with NBA data. So for four years now, the NBA has been collecting player tracking data. So they have 3D cameras in every arena. So they can actually kind of quantify for example how fast a fast break is, for example. Or literally where a player is and where the ball is. For every NBA game now for the past four or five years. And there hasn't really been an overall metric of player value that's taken advantage of that. The teams do it. But in the NBA, the teams are a little bit ahead of journalists and analysts. So we're trying to have a really truly next generation stat. It's a lot of data. Sometimes I now more oversee things than I once did myself. And so you're parsing through many, many, many lines of code. But yeah, so we hope to have that out at some point in the next few months. >> Anything you've personally been passionate about that you've wanted to work on and kind of solve? >> I mean, the NBA thing, I am a pretty big basketball fan. >> You can do better than that. Come on, I want something real personal that you're like I got to crunch the numbers. >> You know, we tried to figure out where the best burrito in America was a few years ago. >> I'm going to end it there. >> Okay. >> Nate, thank you so much for joining us. It's been an absolute pleasure. Thank you. >> Cool, thank you. >> I thought we were going to chat World Series, you know. Burritos, important. I want to thank everybody here in our audience. Let's give him a big round of applause. >> [Nate] Thank you everyone. >> Perfect way to end the day. And for a replay of today's program, just head on over to ibm.com/dsforall. I'm Katie Linendoll. And this has been Data Science for All: It's a Whole New Game. Test one, two. One, two, three. Hi guys, I just want to quickly let you know as you're exiting. A few heads up. Downstairs right now there's going to be a meet and greet with Nate. And we're going to be doing that with clients and customers who are interested. So I would recommend before the game starts, and you lose Nate, head on downstairs. And also the gallery is open until eight p.m. with demos and activations. And tomorrow, make sure to come back too. Because we have exciting stuff. I'll be joining you as your host. And we're kicking off at nine a.m. So bye everybody, thank you so much. >> [Announcer] Ladies and gentlemen, thank you for attending this evening's webcast. If you are not attending all cloud and cognitive summit tomorrow, we ask that you recycle your name badge at the registration desk. Thank you. Also, please note there are two exits on the back of the room on either side of the room. Have a good evening. Ladies and gentlemen, the meet and greet will be on stage. Thank you.

Published Date : Nov 1 2017

SUMMARY :

Today the ability to extract value from data is becoming a shared mission. And for all of you during the program, I want to remind you to join that conversation on And when you and I chatted about it. And the scale and complexity of the data that organizations are having to deal with has It's challenging in the world of unmanageable. And they have to find a way. AI. And it's incredible that this buzz word is happening. And to get to an AI future, you have to lay a data foundation today. And four is you got to expand job roles in the organization. First pillar in this you just discussed. And now you get to where we are today. And if you don't have a strategy for how you acquire that and manage it, you're not going And the way I think about that is it's really about moving from static data repositories And we continue with the architecture. So you need a way to federate data across different environments. So we've laid out what you need for driving automation. And so when you think about the real use cases that are driving return on investment today, Let's go ahead and come back to something that you mentioned earlier because it's fascinating And so the new job roles is about how does everybody have data first in their mind? Everybody in the company has to be data literate. So overall, group effort, has to be a common goal, and we all need to be data literate But at the end of the day, it's kind of not an easy task. It's not easy but it's maybe not as big of a shift as you would think. It's interesting to hear you say essentially you need to train everyone though across the And look, if you want to get your hands on code and just dive right in, you go to datascience.ibm.com. And I've heard that the placement behind those jobs, people graduating with the MS is high. Let me get back to something else you touched on earlier because you mentioned that a number They produce a lot of the shows that I'm sure you watch Katie. And this is a good example. So they have to optimize every aspect of their business from marketing campaigns to promotions And so, as we talk to clients we think about how do you start down this path now, even It's analytics first to the data, not the other way around. We as a practice, we say you want to bring data to where the data sits. And a Harvard Business Review even dubbed it the sexiest job of the 21st century. Female preferred, on the cover of Vogue. And how does it change everything? And while it's important to recognize this critical skill set, you can't just limit it And we call it clickers and coders. [Katie] I like that. And there's not a lot of things available today that do that. Because I hear you talking about the data scientists role and how it's critical to success, And my view is if you have the right platform, it enables the organization to collaborate. And every organization needs to think about what are the skills that are critical? Use this as your chance to reinvent IT. And I can tell you even personally being effected by how important the analysis is in working And think about if you don't do something. And now we're going to get to the fun hands on part of our story. And then how do you move analytics closer to your data? And in here I can see that JP Morgan is calling for a US dollar rebound in the second half But then where it gets interesting is you go to the bottom. data, his stock portfolios, and browsing behavior to build a model which can predict his affinity And so, as a financial adviser, you look at this and you say, all right, we know he loves And I want to do that by picking a auto stock which has got negative correlation with Ferrari. Cause you start clicking that and immediately we're getting instant answers of what's happening. And what I see here instantly is that Honda has got a negative correlation with Ferrari, As a financial adviser, you wouldn't think about federating data, machine learning, pretty And drive the machine learning into the appliance. And even score hundreds of customers for their affinities on a daily basis. And then you see when you deploy analytics next to your data, even a financial adviser, And as a data science leader or data scientist, you have a lot of the same concerns. But you guys each have so many unique roles in your business life. And just by looking at the demand of companies that wants us to help them go through this And I think the whole ROI of data is that you can now understand people's relationships Well you can have all the data in the world, and I think it speaks to, if you're not doing And I think that that's one of the things that customers are coming to us for, right? And Nir, this is something you work with a lot. And the companies that are not like that. Tricia, companies have to deal with data behind the firewall and in the new multi cloud And so that's why I think it's really important to understand that when you implement big And how are the clients, how are the users actually interacting with the system? And right now the way I see teams being set up inside companies is that they're creating But in order to actually see all of the RY behind the data, you also have to have a creative That's one of the things that we see a lot. So a lot of the training we do is sort of data engineers. And I think that's a very strong point when it comes to the data analysis side. And that's where you need the human element to come back in and say okay, look, you're And the people who are really great at providing that human intelligence are social scientists. the talent piece is actually the most important crucial hard to get. It may be to take folks internally who have a lot of that domain knowledge that you have And from data scientist to machine learner. And what I explain to them is look, you're still making decisions in the same way. And I mean, just to give you an example, we are partnering with one of the major cloud And what you're talking about with culture is really where I think we're talking about And I think that communication between the technical stakeholders and management You guys made this way too easy. I want to leave you with an opportunity to, anything you want to add to this conversation? I think one thing to conclude is to say that companies that are not data driven is And thank you guys again for joining us. And we're going to turn our attention to how you can deliver on what they're talking about And finally how you could build models anywhere and employ them close to where your data is. And thanks to Siva for taking us through it. You got to break it down for me cause I think we zoom out and see the big picture. And we saw some new capabilities that help companies avoid lock-in, where you can import And as a data scientist, you stop feeling like you're falling behind. We met backstage. And I go to you to talk about sports because-- And what it brings. And the reason being that sports consists of problems that have rules. And I was going to save the baseball question for later. Probably one of the best of all time. FiveThirtyEight has the Dodgers with a 60% chance of winning. So you have two teams that are about equal. It's like the first World Series in I think 56 years or something where you have two 100 And that you can be the best pitcher in the world, but guess what? And when does it ruin the sport? So we can talk at great length about what tools do you then apply when you have those And the reason being that A) he kind of knows how to position himself in the first place. And I imagine they're all different as well. But you really have seen a lot of breakthroughs in the last couple of years. You're known for your work in politics though. What was the most notable thing that came out of any of your predictions? And so, being aware of the limitations to some extent intrinsically in elections when It would be interesting to kind of peek back the curtain, understand how you operate but But you don't want to be inaccurate because that's your credibility. I think on average, speed is a little bit overrated in journalism. And there's got to be more time spent on stories if I can speak subjectively. And so we have people that come in, we hire most of our people actually from journalism. And so the kind of combination of needing, not having that much tolerance for mistakes, Because you do have to hit this balance. And so you try to hire well. And your perspective on that in general. But by the way, one thing that happens when you share your data or you share your thinking And you have a good intuition for hey, this looks a little bit out of line to me. And I think kind of what you learn is like, hey if there's something that bothers me, It's like oh, if I cross the street in .2-- I mean, I'm like-- But no, part of it's like you don't want to waste time on unimportant decisions, right? We want better. It's like both the chicken and the pasta are going to be really darn good, right? Serious and business, how organizations in the last three to five years have just And man, the quality of the interns we get has improved so much in four years. But when you're seeing these big organizations, ESPN as perfect example, moving more towards But the point is that the reason to be out in front of the problem is so you give yourself Final question for you as we run out of time. And so you're parsing through many, many, many lines of code. You can do better than that. You know, we tried to figure out where the best burrito in America was a few years Nate, thank you so much for joining us. I thought we were going to chat World Series, you know. And also the gallery is open until eight p.m. with demos and activations. If you are not attending all cloud and cognitive summit tomorrow, we ask that you recycle your

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Tricia WangPERSON

0.99+

KatiePERSON

0.99+

Katie LinendollPERSON

0.99+

RobPERSON

0.99+

GoogleORGANIZATION

0.99+

JoanePERSON

0.99+

DanielPERSON

0.99+

Michael LiPERSON

0.99+

Nate SilverPERSON

0.99+

AppleORGANIZATION

0.99+

HortonworksORGANIZATION

0.99+

TrumpPERSON

0.99+

NatePERSON

0.99+

HondaORGANIZATION

0.99+

SivaPERSON

0.99+

McKinseyORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

Larry BirdPERSON

0.99+

2017DATE

0.99+

Rob ThomasPERSON

0.99+

MichiganLOCATION

0.99+

YankeesORGANIZATION

0.99+

New YorkLOCATION

0.99+

ClintonPERSON

0.99+

IBMORGANIZATION

0.99+

TescoORGANIZATION

0.99+

MichaelPERSON

0.99+

AmericaLOCATION

0.99+

LeoPERSON

0.99+

four yearsQUANTITY

0.99+

fiveQUANTITY

0.99+

30%QUANTITY

0.99+

AstrosORGANIZATION

0.99+

TrishPERSON

0.99+

Sudden CompassORGANIZATION

0.99+

Leo MessiPERSON

0.99+

two teamsQUANTITY

0.99+

1,000 linesQUANTITY

0.99+

one yearQUANTITY

0.99+

10 investmentsQUANTITY

0.99+

NASDAQORGANIZATION

0.99+

The Signal and the NoiseTITLE

0.99+

TriciaPERSON

0.99+

Nir KalderoPERSON

0.99+

80%QUANTITY

0.99+

BCGORGANIZATION

0.99+

Daniel HernandezPERSON

0.99+

ESPNORGANIZATION

0.99+

H2OORGANIZATION

0.99+

FerrariORGANIZATION

0.99+

last yearDATE

0.99+

18QUANTITY

0.99+

threeQUANTITY

0.99+

Data IncubatorORGANIZATION

0.99+

PatriotsORGANIZATION

0.99+

Rob Thomas, IBM | Big Data NYC 2017


 

>> Voiceover: Live from midtown Manhattan, it's theCUBE! Covering Big Data New York City 2017. Brought to you by, SiliconANGLE Media and as ecosystems sponsors. >> Okay, welcome back everyone, live in New York City this is theCUBE's coverage of, eighth year doing Hadoop World now, evolved into Strata Hadoop, now called Strata Data, it's had many incarnations but O'Reilly Media running their event in conjunction with Cloudera, mainly an O'Reilly media show. We do our own show called Big Data NYC here with our community with theCUBE bringing you the best interviews, the best people, entrepreneurs, thought leaders, experts, to get the data and try to project the future and help users find the value in data. My next guest is Rob Thomas, who is the General Manager of IBM Analytics, theCUBE Alumni, been on multiple times successfully executing in the San Francisco Bay area. Great to see you again. >> Yeah John, great to see you, thanks for having me. >> You know IBM is really been interesting through its own transformation and a lot of people will throw IBM in that category but you guys have been transforming okay and the scoreboard yet has to yet to show in my mind what's truly happening because if you still look at this industry, we're only eight years into what Hadoop evolved into now as a large data set but the analytics game just seems to be getting started with the cloud now coming over the top, you're starting to see a lot of cloud conversations in the air. Certainly there's a lot of AI washing, you know, AI this, but it's machine learning and deep learning at the heart of it as innovation but a lot more work on the analytics side is coming. You guys are at the center of that. What's the update? What's your view of this analytics market? >> Most enterprises struggle with complexity. That's the number one problem when it comes to analytics. It's not imagination, it's not willpower, in many cases, it's not even investment, it's just complexity. We are trying to make data really simple to use and the way I would describe it is we're moving from a world of products to platforms. Today, if you want to go solve a data governance problem you're typically integrating 10, 15 different products. And the burden then is on the client. So, we're trying to make analytics a platform game. And my view is an enterprise has to have three platforms if they're serious about analytics. They need a data manager platform for managing all types of data, public, private cloud. They need unified governance so governance of all types of data and they need a data science platform machine learning. If a client has those three platforms, they will be successful with data. And what I see now is really mixed. We've got 10 products that do that, five products that do this, but it has to be integrated in a platform. >> You as an IBM or the customer has these tools? >> Yeah, when I go see clients that's what I see is data... >> John: Disparate data log. >> Yeah, they have disparate tools and so we are unifying what we deliver from a product perspective to this platform concept. >> You guys announce an integrated analytic system, got to see my notes here, I want to get into that in a second but interesting you bring up the word platform because you know, platforms have always been kind of reserved for the big supplier but you're talking about customers having a platform, not a supplier delivering a platform per se 'cause this is where the integration thing becomes interesting. We were joking yesterday on theCUBE here, kind of just kind of ad hoc conceptually like the world has turned into a tool shed. I mean everyone has a tool shed or knows someone that has a tool shed where you have the tools in the back and they're rusty. And so, this brings up the tool conversation, there's too many tools out there that try to be platforms. >> Rob: Yes. >> And if you have too many tools, you're not really doing the platform game right. And complexity also turns into when you bought a hammer it turned into a lawn mower. Right so, a lot of these companies have been groping and trying to iterate what their tool was into something else it wasn't built for. So, as the industry evolves, that's natural Darwinism if you will, they will fall to the wayside. So talk about that dynamic because you still need tooling >> Rob: Yes. but tool will be a function of the work as Peter Burris would say, so talk about how does a customer really get that platform out there without sacrificing the tooling that they may have bought or want to get rid of. >> Well, so think about the, in enterprise today, what the data architecture looks like is, I've got this box that has this software on it, use your terms, has these types of tools on it, and it's isolated and if you want a different set of tooling, okay, move that data to this other box where we have the other tooling. So, it's very isolated in terms of how platforms have evolved or technology platforms today. When I talk about an integrated platform, we are big contributors to Kubernetes. We're making that foundational in terms of what we're doing on Private Cloud and Public Cloud is if you move to that model, suddenly what was a bunch of disparate tools are now microservices against a common architecture. And so it totally changes the nature of the data platform in an enterprise. It's a much more fluid data layer. The term I use sometimes is you have data as a service now, available to all your employees. That's totally different than I want to do this project, so step one, make room in the data center, step two, bring in a server. It's a much more flexible approach so that's what I mean when I say platform. >> So operationalizing it is a lot easier than just going down the linear path of provisioning. All right, so let's bring up the complexity issue because integrated and unified are two different concepts that kind of mean the same thing depending on how you look at it. When you look at the data integration problem, you've got all this complexity around governance, it's a lot of moving parts of data. How does a customer actually execute without compromising the integrity of their policies that they need to have in place? So in other words, what are the baby steps that someone can take, the customers take through with what you guys are dealing with them, how do they get into the game, how do they take steps towards the outcome? They might not have the big money to push it all at once, they might want to take a risk of risk management approach. >> I think there's a clear recipe for doing this right and we have experience of doing it well and doing it not so well, so over time we've gotten some, I'd say a pretty good perspective on that. My view is very simple, data governance has to start with a catalog. And the analogy I use is, you have to do for data what libraries do for books. And think about a library, the first thing you do with books, card catalog. You know where, you basically itemize everything, you know exactly where it sits. If you've got multiple copies of the same book, you can distinguish between which one is which. As books get older they go to archives, to microfilm or something like that. That's what you have to do with your data. >> On the front end. >> On the front end. And it starts with a catalog. And that reason I say that is, I see some organizations that start with, hey, let's go start ETL, I'll create a new warehouse, create a new Hadoop environment. That might be the right thing to do but without having a basis of what you have, which is the catalog, that's where I think clients need to start. >> Well, I would just add one more level of complexity just to kind of reinforce, first of all I agree with you but here's another example that would reinforce this step. Let's just say you write some machine learning and some algorithms and a new policy from the government comes down. Hey, you know, we're dealing with Bitcoin differently or whatever, some GPRS kind of thing happens where someone gets hacked and a new law comes out. How do you inject that policy? You got to rewrite the code, so I'm thinking that if you do this right, you don't have to do a lot of rewriting of applications to the library or the catalog will handle it. Is that right, am I getting that right? >> That's right 'cause then you have a baseline is what I would describe it as. It's codified in the form of a data model or in the form on ontology for how you're looking at unstructured data. You have a baseline so then as changes come, you can easily adjust to those changes. Where I see clients struggle is if you don't have that baseline then you're constantly trying to change things on the fly and that makes it really hard to get to this... >> Well, really hard, expensive, they have to rewrite apps. >> Exactly. >> Rewrite algorithms and machine learning things that were built probably by people that maybe left the company, who knows, right? So the consequences are pretty grave, I mean, pretty big. >> Yes. >> Okay, so let's back to something that you said yesterday. You were on theCUBE yesterday with Hortonworks CEO, Rob Bearden and you were commenting about AI or AI washing. You said quote, "You can't have AI without IA." A play on letters there, sequence of letters which was really an interesting comment, we kind of referenced it pretty much all day yesterday. Information architecture is the IA and AI is the artificial intelligence basically saying if you don't have some sort of architecture AI really can't work. Which really means models have to be understood, with the learning machine kind of approach. Expand more on that 'cause that was I think a fundamental thing that we're seeing at the show this week, this in New York is a model for the models. Who trains the machine learning? Machines got to learn somewhere too so there's learning for the learning machines. This is a real complex data problem and a half. If you don't set up the architecture it may not work, explain. >> So, there's two big problems enterprises have today. One is trying to operationalize data science and machine learning that scale, the other one is getting the cloud but let's focus on the first one for a minute. The reason clients struggle to operationalize this at scale is because they start a data science project and they build a model for one discreet data set. Problem is that only applies to that data set, it doesn't, you can't pick it up and move it somewhere else so this idea of data architecture just to kind of follow through, whether it's the catalog or how you're managing your data across multiple clouds becomes fundamental because ultimately you want to be able to provide machine learning across all your data because machine learning is about predictions and it's hard to do really good predictions on a subset. But that pre-req is the need for an information architecture that comprehends for the fact that you're going to build models and you want to train those models. As new data comes in, you want to keep the training process going. And that's the biggest challenge I see clients struggling with. So they'll have success with their first ML project but then the next one becomes progressively harder because now they're trying to use more data and they haven't prepared their architecture for that. >> Great point. Now, switching to data science. You spoke many times with us on theCUBE about data science, we know you're passionate about you guys doing a lot of work on that. We've observed and Jim Kobielus and I were talking yesterday, there's too much work still in the data science guys plate. There's still doing a lot of what I call, sys admin like work, not the right word, but like administrative building and wrangling. They're not doing enough data science and there's enough proof points now to show that data science actually impacts business in whether it's military having data intelligence to execute something, to selling something at the right time, or even for work or play or consume, or we use, all proof is out there. So why aren't we going faster, why aren't the data scientists more effective, what does it going to take for the data science to have a seamless environment that works for them? They're still doing a lot of wrangling and they're still getting down the weeds. Is that just the role they have or how does it get easier for them that's the big catch? >> That's not the role. So they're a victim of their architecture to some extent and that's why they end up spending 80% of their time on data prep, data cleansing, that type of thing. Look, I think we solved that. That's why when we introduced the integrated analytic system this week, that whole idea was get rid of all the data prep that you need because land the data in one place, machine learning and data science is built into that. So everything that the data scientist struggles with today goes away. We can federate to data on cloud, on any cloud, we can federate to data that's sitting inside Hortonworks so it looks like one system but machine learning is built into it from the start. So we've eliminated the need for all of that data movement, for all that data wrangling 'cause we organized the data, we built the catalog, and we've made it really simple. And so if you go back to the point I made, so one issue is clients can't apply machine learning at scale, the other one is they're struggling to get the cloud. I think we've nailed those problems 'cause now with a click of a button, you can scale this to part of the cloud. >> All right, so how does the customer get their hands on this? Sounds like it's a great tool, you're saying it's leading edge. We'll take a look at it, certainly I'll do a review on it with the team but how do I get it, how do I get a hold of this? What do I do, download it, you guys supply it to me, is it some open source, how do your customers and potential customers engage with this product? >> However they want to but I'll give you some examples. So, we have an analytic system built on Spark, you can bring the whole box into your data center and right away you're ready for data science. That's one way. Somebody like you, you're going to want to go get the containerized version, you go download it on the web and you'll be up and running instantly with a highly performing warehouse integrated with machine learning and data science built on Spark using Apache Jupyter. Any developer can go use that and get value out of it. You can also say I want to run it on my desktop. >> And that's free? >> Yes. >> Okay. >> There's a trial version out there. >> That's the open source, yeah, that's the free version. >> There's also a version on public cloud so if you don't want to download it, you want to run it outside your firewall, you can go run it on IBM cloud on the public cloud so... >> Just your cloud, Amazon? >> No, not today. >> John: Just IBM cloud, okay, I got it. >> So there's variety of ways that you can go use this and I think what you'll find... >> But you have a premium model that people can get started out so they'll download it to your data center, is that also free too? >> Yeah, absolutely. >> Okay, so all the base stuff is free. >> We also have a desktop version too so you can download... >> What URL can people look at this? >> Go to datascience.ibm.com, that's the best place to start a data science journey. >> Okay, multi-cloud, Common Cloud is what people are calling it, you guys have Common SQL engine. What is this product, how does it relate to the whole multi-cloud trend? Customers are looking for multiple clouds. >> Yeah, so Common SQL is the idea of integrating data wherever it is, whatever form it's in, ANSI SQL compliant so what you would expect for a SQL query and the type of response you get back, you get that back with Common SQL no matter where the data is. Now when you start thinking multi-cloud you introduce a whole other bunch of factors. Network, latency, all those types of things so what we talked about yesterday with the announcement of Hortonworks Dataplane which is kind of extending the YARN environment across multi-clouds, that's something we can plug in to. So, I think let's be honest, the multi-cloud world is still pretty early. >> John: Oh, really early. >> Our focus is delivery... >> I don't think it really exists actually. >> I think... >> It's multiple clouds but no one's actually moving workloads across all the clouds, I haven't found any. >> Yeah, I think it's hard for latency reasons today. We're trying to deliver an outstanding... >> But people are saying, I mean this is head room I got but people are saying, I'd love to have a preferred future of multi-cloud even though they're kind of getting their own shops in order, retrenching, and re-platforming it but that's not a bad ask. I mean, I'm a user, I want to move from if I don't like IBM's cloud or I got a better service, I can move around here. If Amazon is too expensive I want to move to IBM, you got product differentiation, I might want to to be in your cloud. So again, this is the customers mindset, right. If you have something really compelling on your cloud, do I have to go all in on IBM cloud to run my data? You shouldn't have to, right? >> I agree, yeah I don't think any enterprise will go all in on one cloud. I think it's delusional for people to think that so you're going to have this world. So the reason when we built IBM Cloud Private we did it on Kubernetes was we said, that can be a substrate if you will, that provides a level of standards across multiple cloud type environments. >> John: And it's got some traction too so it's a good bet there. >> Absolutely. >> Rob, final word, just talk about the personas who you now engage with from IBM's standpoint. I know you have a lot of great developers stuff going on, you've done some great work, you've got a free product out there but you still got to make money, you got to provide value to IBM, who are you selling to, what's the main thing, you've got multiple stakeholders, could you just clarify the stakeholders that you're serving in the marketplace? >> Yeah, I mean, the emerging stakeholder that we speak with more and more than we used to is chief marketing officers who have real budgets for data and data science and trying to change how they're performing their job. That's a major stakeholder, CTOs, CIOs, any C level, >> Chief data officer. >> Chief data officer. You know chief data officers, honestly, it's a mixed bag. Some organizations they're incredibly empowered and they're driving the strategy. Others, they're figure heads and so you got to know how the organizations do it. >> A puppet for the CFO or something. >> Yeah, exactly. >> Our ops. >> A puppet? (chuckles) So, you got to you know. >> Well, they're not really driving it, they're not changing it. It's not like we're mandated to go do something they're maybe governance police or something. >> Yeah, and in some cases that's true. In other cases, they drive the data architecture, the data strategy, and that's somebody that we can engage with right away and help them out so... >> Any events you got going up? Things happening in the marketplace that people might want to participate in? I know you guys do a lot of stuff out in the open, events they can connect with IBM, things going on? >> So we do, so we're doing a big event here in New York on November first and second where we're rolling out a lot of our new data products and cloud products so that's one coming up pretty soon. The biggest thing we've changed this year is there's such a craving for clients for education as we've started doing what we're calling Analytics University where we actually go to clients and we'll spend a day or two days, go really deep and open languages, open source. That's become kind of a new focus for us. >> A lot of re-skilling going on too with the transformation, right? >> Rob: Yes, absolutely. >> All right, Rob Thomas here, General Manager IBM Analytics inside theCUBE. CUBE alumni, breaking it down, giving his perspective. He's got two books out there, The Data Revolution was the first one. >> Big Data Revolution. >> Big Data Revolution and the new one is Every Company is a Tech Company. Love that title which is true, check it out on Amazon. Rob Thomas, Bid Data Revolution, first book and then second book is Every Company is a Tech Company. It's theCUBE live from New York. More coverage after the short break. (theCUBE jingle) (theCUBE jingle) (calm soothing music)

Published Date : Oct 2 2017

SUMMARY :

Brought to you by, SiliconANGLE Media Great to see you again. but the analytics game just seems to be getting started and the way I would describe it is and so we are unifying what we deliver where you have the tools in the back and they're rusty. So talk about that dynamic because you still need tooling that they may have bought or want to get rid of. and it's isolated and if you want They might not have the big money to push it all at once, the first thing you do with books, card catalog. That might be the right thing to do just to kind of reinforce, first of all I agree with you and that makes it really hard to get to this... they have to rewrite apps. probably by people that maybe left the company, Okay, so let's back to something that you said yesterday. and you want to train those models. Is that just the role they have the data prep that you need What do I do, download it, you guys supply it to me, However they want to but I'll give you some examples. There's a That's the open source, so if you don't want to download it, So there's variety of ways that you can go use this that's the best place to start a data science journey. you guys have Common SQL engine. and the type of response you get back, across all the clouds, I haven't found any. Yeah, I think it's hard for latency reasons today. If you have something really compelling on your cloud, that can be a substrate if you will, so it's a good bet there. I know you have a lot of great developers stuff going on, Yeah, I mean, the emerging stakeholder that you got to know how the organizations do it. So, you got to you know. It's not like we're mandated to go do something the data strategy, and that's somebody that we can and cloud products so that's one coming up pretty soon. CUBE alumni, breaking it down, giving his perspective. and the new one is Every Company is a Tech Company.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jim KobielusPERSON

0.99+

Peter BurrisPERSON

0.99+

AmazonORGANIZATION

0.99+

IBMORGANIZATION

0.99+

JohnPERSON

0.99+

Rob BeardenPERSON

0.99+

Rob ThomasPERSON

0.99+

O'Reilly MediaORGANIZATION

0.99+

80%QUANTITY

0.99+

10QUANTITY

0.99+

New YorkLOCATION

0.99+

10 productsQUANTITY

0.99+

O'ReillyORGANIZATION

0.99+

two daysQUANTITY

0.99+

first bookQUANTITY

0.99+

two booksQUANTITY

0.99+

a dayQUANTITY

0.99+

RobPERSON

0.99+

TodayDATE

0.99+

yesterdayDATE

0.99+

New York CityLOCATION

0.99+

HortonworksORGANIZATION

0.99+

San Francisco BayLOCATION

0.99+

five productsQUANTITY

0.99+

second bookQUANTITY

0.99+

IBM AnalyticsORGANIZATION

0.99+

this weekDATE

0.99+

SiliconANGLE MediaORGANIZATION

0.99+

firstQUANTITY

0.99+

first oneQUANTITY

0.99+

theCUBEORGANIZATION

0.99+

eight yearsQUANTITY

0.99+

SparkTITLE

0.99+

SQLTITLE

0.99+

Common SQLTITLE

0.98+

datascience.ibm.comOTHER

0.98+

eighth yearQUANTITY

0.98+

OneQUANTITY

0.98+

one issueQUANTITY

0.97+

Hortonworks DataplaneORGANIZATION

0.97+

three platformsQUANTITY

0.97+

Strata HadoopTITLE

0.97+

todayDATE

0.97+

The Data RevolutionTITLE

0.97+

ClouderaORGANIZATION

0.97+

secondQUANTITY

0.96+

NYCLOCATION

0.96+

two big problemsQUANTITY

0.96+

Analytics UniversityORGANIZATION

0.96+

step twoQUANTITY

0.96+

one wayQUANTITY

0.96+

November firstDATE

0.96+

Big Data RevolutionTITLE

0.95+

oneQUANTITY

0.94+

Every Company is a Tech CompanyTITLE

0.94+

CUBEORGANIZATION

0.93+

this yearDATE

0.93+

two different conceptsQUANTITY

0.92+

one systemQUANTITY

0.92+

step oneQUANTITY

0.92+

Panel Discussion | IBM Fast Track Your Data 2017


 

>> Narrator: Live, from Munich, Germany, it's the CUBE. Covering IBM, Fast Track Your Data. Brought to you by IBM. >> Welcome to Munich everybody. This is a special presentation of the CUBE, Fast Track Your Data, brought to you by IBM. My name is Dave Vellante. And I'm here with my cohost, Jim Kobielus. Jim, good to see you. Really good to see you in Munich. >> Jim: I'm glad I made it. >> Thanks for being here. So last year Jim and I hosted a panel at New York City on the CUBE. And it was quite an experience. We had, I think it was nine or 10 data scientists and we felt like that was a lot of people to organize and talk about data science. Well today, we're going to do a repeat of that. With a little bit of twist on topics. And we've got five data scientists. We're here live, in Munich. And we're going to kick off the Fast Track Your Data event with this data science panel. So I'm going to now introduce some of the panelists, or all of the panelists. Then we'll get into the discussions. I'm going to start with Lillian Pierson. Lillian thanks very much for being on the panel. You are in data science. You focus on training executives, students, and you're really a coach but with a lot of data science expertise based in Thailand, so welcome. >> Thank you, thank you so much for having me. >> Dave: You're very welcome. And so, I want to start with sort of when you focus on training people, data science, where do you start? >> Well it depends on the course that I'm teaching. But I try and start at the beginning so for my Big Data course, I actually start back at the fundamental concepts and definitions they would even need to understand in order to understand the basics of what Big Data is, data engineering. So, terms like data governance. Going into the vocabulary that makes up the very introduction of the course, so that later on the students can really grasp the concepts I present to them. You know I'm teaching a deep learning course as well, so in that case I start at a lot more advanced concepts. So it just really depends on the level of the course. >> Great, and we're going to come back to this topic of women in tech. But you know, we looked at some CUBE data the other day. About 17% of the technology industry comprises women. And so we're a little bit over that on our data science panel, we're about 20% today. So we'll come back to that topic. But I don't know if there's anything you would add? >> I'm really passionate about women in tech and women who code, in particular. And I'm connected with a lot of female programmers through Instagram. And we're supporting each other. So I'd love to take any questions you have on what we're doing in that space. At least as far as what's happening across the Instagram platform. >> Great, we'll circle back to that. All right, let me introduce Chris Penn. Chris, Boston based, all right, SMI. Chris is a marketing expert. Really trying to help people understand how to get, turn data into value from a marketing perspective. It's a very important topic. Not only because we get people to buy stuff but also understanding some of the risks associated with things like GDPR, which is coming up. So Chris, tell us a little bit about your background and your practice. >> So I actually started in IT and worked at a start up. And that's where I made the transition to marketing. Because marketing has much better parties. But what's really interesting about the way data science is infiltrating marketing is the technology came in first. You know, everything went digital. And now we're at a point where there's so much data. And most marketers, they kind of got into marketing as sort of the arts and crafts field. And are realizing now, they need a very strong, mathematical, statistical background. So one of the things, Adam, the reason why we're here and IBM is helping out tremendously is, making a lot of the data more accessible to people who do not have a data science background and probably never will. >> Great, okay thank you. I'm going to introduce Ronald Van Loon. Ronald, your practice is really all about helping people extract value out of data, driving competitive advantage, business advantage, or organizational excellence. Tell us a little bit about yourself, your background, and your practice. >> Basically, I've three different backgrounds. On one hand, I'm a director at a data consultancy firm called Adversitement. Where we help companies to become data driven. Mainly large companies. I'm an advisory board member at Simply Learn, which is an e-learning platform, especially also for big data analytics. And on the other hand I'm a blogger and I host a series of webinars. >> Okay, great, now Dez, Dez Blanchfield, I met you on Twitter, you know, probably a couple of years ago. We first really started to collaborate last year. We've spend a fair amount of time together. You are a data scientist, but you're also a jack of all trades. You've got a technology background. You sit on a number of boards. You work very active with public policy. So tell us a little bit more about what you're doing these days, a little bit more about your background. >> Sure, I think my primary challenge these days is communication. Trying to join the dots between my technical background and deeply technical pedigree, to just plain English, every day language, and business speak. So bridging that technical world with what's happening in the boardroom. Toe to toe with the geeks to plain English to execs in boards. And just hand hold them and steward them through the journey of the challenges they're facing. Whether it's the enormous rapid of change and the pace of change, that's just almost exhaustive and causing them to sprint. But not just sprint in one race but in multiple lanes at the same time. As well as some of the really big things that are coming up, that we've seen like GDPR. So it's that communication challenge and just hand holding people through that journey and that mix of technical and commercial experience. >> Great, thank you, and finally Joe Caserta. Founder and president of Caserta Concepts. Joe you're a practitioner. You're in the front lines, helping organizations, similar to Ronald. Extracting value from data. Translate that into competitive advantage. Tell us a little bit about what you're doing these days in Caserta Concepts. >> Thanks Dave, thanks for having me. Yeah, so Caserta's been around. I've been doing this for 30 years now. And natural progressions have been just getting more from application development, to data warehousing, to big data analytics, to data science. Very, very organically, that's just because it's where businesses need the help the most, over the years. And right now, the big focus is governance. At least in my world. Trying to govern when you have a bunch of disparate data coming from a bunch of systems that you have no control over, right? Like social media, and third party data systems. Bringing it in and how to you organize it? How do you ingest it? How do you govern it? How do you keep it safe? And also help to define ownership of the data within an organization within an enterprise? That's also a very hot topic. Which ties back into GDPR. >> Great, okay, so we're going to be unpacking a lot of topics associated with the expertise that these individuals have. I'm going to bring in Jim Kobielus, to the conversation. Jim, the newest Wikibon analyst. And newest member of the SiliconANGLE Media Team. Jim, get us started off. >> Yeah, so we're at an event, at an IBM event where machine learning and data science are at the heart of it. There are really three core themes here. Machine learning and data science, on the one hand. Unified governance on the other. And hybrid data management. I want to circle back or focus on machine learning. Machine learning is the coin of the realm, right now in all things data. Machine learning is the heart of AI. Machine learning, everybody is going, hiring, data scientists to do machine learning. I want to get a sense from our panel, who are experts in this area, what are the chief innovations and trends right now on machine learning. Not deep learning, the core of machine learning. What's super hot? What's in terms of new techniques, new technologies, new ways of organizing teams to build and to train machine learning models? I'd like to open it up. Let's just start with Lillian. What are your thoughts about trends in machine learning? What's really hot? >> It's funny that you excluded deep learning from the response for this, because I think the hottest space in machine learning is deep learning. And deep learning is machine learning. I see a lot of collaborative platforms coming out, where people, data scientists are able to work together with other sorts of data professionals to reduce redundancies in workflows. And create more efficient data science systems. >> Is there much uptake of these crowd sourcing environments for training machine learning wells. Like CrowdFlower, or Amazon Mechanical Turk, or Mighty AI? Is that a huge trend in terms of the workflow of data science or machine learning, a lot of that? >> I don't see that crowdsourcing is like, okay maybe I've been out of the crowdsourcing space for a while. But I was working with Standby Task Force back in 2013. And we were doing a lot of crowdsourcing. And I haven't seen the industry has been increasing, but I could be wrong. I mean, because there's no, if you're building automation models, most of the, a lot of the work that's being crowdsourced could actually be automated if someone took the time to just build the scripts and build the models. And so I don't imagine that, that's going to be a trend that's increasing. >> Well, automation machine learning pipeline is fairly hot, in terms of I'm seeing more and more research. Google's doing a fair amount of automated machine learning. The panel, what do you think about automation, in terms of the core modeling tasks involved in machine learning. Is that coming along? Are data scientists in danger of automating themselves out of a job? >> I don't think there's a risk of data scientist's being put out of a job. Let's just put that on the thing. I do think we need to get a bit clearer about this meme of the mythical unicorn. But to your call point about machine learning, I think what you'll see, we saw the cloud become baked into products, just as a given. I think machine learning is already crossed this threshold. We just haven't necessarily noticed or caught up. And if we look at, we're at an IBM event, so let's just do a call out for them. The data science experience platform, for example. Machine learning's built into a whole range of things around algorithm and data classification. And there's an assisted, guided model for how you get to certain steps, where you don't actually have to understand how machine learning works. You don't have to understand how the algorithms work. It shows you the different options you've got and you can choose them. So you might choose regression. And it'll give you different options on how to do that. So I think we've already crossed this threshold of baking in machine learning and baking in the data science tools. And we've seen that with Cloud and other technologies where, you know, the Office 365 is not, you can't get a non Cloud Office 365 account, right? I think that's already happened in machine learning. What we're seeing though, is organizations even as large as the Googles still in catch up mode, in my view, on some of the shift that's taken place. So we've seen them write little games and apps where people do doodles and then it runs through the ML library and says, "Well that's a cow, or a unicorn, or a duck." And you get awards, and gold coins, and whatnot. But you know, as far as 12 years ago I was working on a project, where we had full size airplanes acting as drones. And we mapped with two and 3-D imagery. With 2-D high res imagery and LiDAR for 3-D point Clouds. We were finding poles and wires for utility companies, using ML before it even became a trend. And baking it right into the tools. And used to store on our web page and clicked and pointed on. >> To counter Lillian's point, it's not crowdsourcing but crowd sharing that's really powering a lot of the rapid leaps forward. If you look at, you know, DSX from IBM. Or you look at Node-RED, huge number of free workflows that someone has probably already done the thing that you are trying to do. Go out and find in the libraries, through Jupyter and R Notebooks, there's an ability-- >> Chris can you define before you go-- >> Chris: Sure. >> This is great, crowdsourcing versus crowd sharing. What's the distinction? >> Well, so crowdsourcing, kind of, where in the context of the question you ask is like I'm looking for stuff that other people, getting people to do stuff that, for me. It's like asking people to mine classifieds. Whereas crowd sharing, someone has done the thing already, it already exists. You're not purpose built, saying, "Jim, help me build this thing." It's like, "Oh Jim, you already "built this thing, cool. "So can I fork it and make my own from it?" >> Okay, I see what you mean, keep going. >> And then, again, going back to earlier. In terms of the advancements. Really deep learning, it probably is a good idea to just sort of define these things. Machine learning is how machines do things without being explicitly programmed to do them. Deep learning's like if you can imagine a stack of pancakes, right? Each pancake is a type of machine learning algorithm. And your data is the syrup. You pour the data on it. It goes from layer, to layer, to layer, to layer, and what you end up with at the end is breakfast. That's the easiest analogy for what deep learning is. Now imagine a stack of pancakes, 500 or 1,000 high, that's where deep learning's going now. >> Sure, multi layered machine learning models, essentially, that have the ability to do higher levels of abstraction. Like image analysis, Lillian? >> I had a comment to add about automation and data science. Because there are a lot of tools that are able to, or applications that are able to use data science algorithms and output results. But the reason that data scientists aren't in risk of losing their jobs, is because just because you can get the result, you also have to be able to interpret it. Which means you have to understand it. And that involves deep math and statistical understanding. Plus domain expertise. So, okay, great, you took out the coding element but that doesn't mean you can codify a person's ability to understand and apply that insight. >> Dave: Joe, you have something to add? >> I could just add that I see the trend. Really, the reason we're talking about it today is machine learning is not necessarily, it's not new, like Dez was saying. But what's different is the accessibility of it now. It's just so easily accessible. All of the tools that are coming out, for data, have machine learning built into it. So the machine learning algorithms, which used to be a black art, you know, years ago, now is just very easily accessible. That you can get, it's part of everyone's toolbox. And the other reason that we're talking about it more, is that data science is starting to become a core curriculum in higher education. Which is something that's new, right? That didn't exist 10 years ago? But over the past five years, I'd say, you know, it's becoming more and more easily accessible for education. So now, people understand it. And now we have it accessible in our tool sets. So now we can apply it. And I think that's, those two things coming together is really making it becoming part of the standard of doing analytics. And I guess the last part is, once we can train the machines to start doing the analytics, right? And get smarter as it ingests more data. And then we can actually take that and embed it in our applications. That's the part that you still need data scientists to create that. But once we can have standalone appliances that are intelligent, that's when we're going to start seeing, really, machine learning and artificial intelligence really start to take off even more. >> Dave: So I'd like to switch gears a little bit and bring Ronald on. >> Okay, yes. >> Here you go, there. >> Ronald, the bromide in this sort of big data world we live in is, the data is the new oil. You got to be a data driven company and many other cliches. But when you talk to organizations and you start to peel the onion. You find that most companies really don't have a good way to connect data with business impact and business value. What are you seeing with your clients and just generally in the community, with how companies are doing that? How should they do that? I mean, is that something that is a viable approach? You don't see accountants, for example, quantifying the value of data on a balance sheet. There's no standards for doing that. And so it's sort of this fuzzy concept. How are and how should organizations take advantage of data and turn it into value. >> So, I think in general, if you look how companies look at data. They have departments and within the departments they have tools specific for this department. And what you see is that there's no central, let's say, data collection. There's no central management of governance. There's no central management of quality. There's no central management of security. Each department is manages their data on their own. So if you didn't ask, on one hand, "Okay, how should they do it?" It's basically go back to the drawing table and say, "Okay, how should we do it?" We should collect centrally, the data. And we should take care for central governance. We should take care for central data quality. We should take care for centrally managing this data. And look from a company perspective and not from a department perspective what the value of data is. So, look at the perspective from your whole company. And this means that it has to be brought on one end to, whether it's from C level, where most of them still fail to understand what it really means. And what the impact can be for that company. >> It's a hard problem. Because data by its' very nature is now so decentralized. But Chris you have a-- >> The thing I want to add to that is, think about in terms of valuing data. Look at what it would cost you for data breach. Like what is the expensive of having your data compromised. If you don't have governance. If you don't have policy in place. Look at the major breaches of the last couple years. And how many billions of dollars those companies lost in market value, and trust, and all that stuff. That's one way you can value data very easily. "What will it cost us if we mess this up?" >> So a lot of CEOs will hear that and say, "Okay, I get it. "I have to spend to protect myself, "but I'd like to make a little money off of this data thing. "How do I do that?" >> Well, I like to think of it, you know, I think data's definitely an asset within an organization. And is becoming more and more of an asset as the years go by. But data is still a raw material. And that's the way I think about it. In order to actually get the value, just like if you're creating any product, you start with raw materials and then you refine it. And then it becomes a product. For data, data is a raw material. You need to refine it. And then the insight is the product. And that's really where the value is. And the insight is absolutely, you can monetize your insight. >> So data is, abundant insights are scarce. >> Well, you know, actually you could say that intermediate between insights and the data are the models themselves. The statistical, predictive, machine learning models. That are a crystallization of insights that have been gained by people called data scientists. What are your thoughts on that? Are statistical, predictive, machine learning models something, an asset, that companies, organizations, should manage governance of on a centralized basis or not? >> Well the models are essentially the refinery system, right? So as you're refining your data, you need to have process around how you exactly do that. Just like refining anything else. It needs to be controlled and it needs to be governed. And I think that data is no different from that. And I think that it's very undisciplined right now, in the market or in the industry. And I think maturing that discipline around data science, I think is something that's going to be a very high focus in this year and next. >> You were mentioning, "How do you make money from data?" Because there's all this risk associated with security breaches. But at the risk of sounding simplistic, you can generate revenue from system optimization, or from developing products and services. Using data to develop products and services that better meet the demands and requirements of your markets. So that you can sell more. So either you are using data to earn more money. Or you're using data to optimize your system so you have less cost. And that's a simple answer for how you're going to be making money from the data. But yes, there is always the counter to that, which is the security risks. >> Well, and my question really relates to, you know, when you think of talking to C level executives, they kind of think about running the business, growing the business, and transforming the business. And a lot of times they can't fund these transformations. And so I would agree, there's many, many opportunities to monetize data, cut costs, increase revenue. But organizations seem to struggle to either make a business case. And actually implement that transformation. >> Dave, I'd love to have a crack at that. I think this conversation epitomizes the type of things that are happening in board rooms and C suites already. So we've really quickly dived into the detail of data. And the detail of machine learning. And the detail of data science, without actually stopping and taking a breath and saying, "Well, we've "got lots of it, but what have we got? "Where is it? "What's the value of it? "Is there any value in it at all?" And, "How much time and money should we invest in it?" For example, we talk of being about a resource. I look at data as a utility. When I turn the tap on to get a drink of water, it's there as a utility. I counted it being there but I don't always sample the quality of the water and I probably should. It could have Giardia in it, right? But what's interesting is I trust the water at home, in Sydney. Because we have a fairly good experience with good quality water. If I were to go to some other nation. I probably wouldn't trust that water. And I think, when you think about it, what's happening in organizations. It's almost the same as what we're seeing here today. We're having a lot of fun, diving into the detail. But what we've forgotten to do is ask the question, "Well why is data even important? "What's the reasoning to the business? "Why are we in business? "What are we doing as an organization? "And where does data fit into that?" As opposed to becoming so fixated on data because it's a media hyped topic. I think once you can wind that back a bit and say, "Well, we have lot's of data, "but is it good data? "Is it quality data? "Where's it coming from? "Is it ours? "Are we allowed to have it? "What treatment are we allowed to give that data?" As you said, "Are we controlling it? "And where are we controlling it? "Who owns it?" There's so many questions to be asked. But the first question I like to ask people in plain English is, "Well is there any value "in data in the first place? "What decisions are you making that data can help drive? "What things are in your organizations, "KPIs and milestones you're trying to meet "that data might be a support?" So then instead of becoming fixated with data as a thing in itself, it becomes part of your DNA. Does that make sense? >> Think about what money means. The Economists' Rhyme, "Money is a measure for, "a systems for, a medium, a measure, and exchange." So it's a medium of exchange. A measure of value, a way to exchange something. And a way to store value. Data, good clean data, well governed, fits all four of those. So if you're trying to figure out, "How do we make money out of stuff." Figure out how money works. And then figure out how you map data to it. >> So if we approach and we start with a company, we always start with business case, which is quite clear. And defined use case, basically, start with a team on one hand, marketing people, sales people, operational people, and also the whole data science team. So start with this case. It's like, defining, basically a movie. If you want to create the movie, You know where you're going to. You know what you want to achieve to create the customer experience. And this is basically the same with a business case. Where you define, "This is the case. "And this is how we're going to derive value, "start with it and deliver value within a month." And after the month, you check, "Okay, where are we and how can we move forward? "And what's the value that we've brought?" >> Now I as well, start with business case. I've done thousands of business cases in my life, with organizations. And unless that organization was kind of a data broker, the business case rarely has a discreet component around data. Is that changing, in your experience? >> Yes, so we guide companies into be data driven. So initially, indeed, they don't like to use the data. They don't like to use the analysis. So that's why, how we help. And is it changing? Yes, they understand that they need to change. But changing people is not always easy. So, you see, it's hard if you're not involved and you're not guiding it, they fall back in doing the daily tasks. So it's changing, but it's a hard change. >> Well and that's where this common parlance comes in. And Lillian, you, sort of, this is what you do for a living, is helping people understand these things, as you've been sort of evangelizing that common parlance. But do you have anything to add? >> I wanted to add that for organizational implementations, another key component to success is to start small. Start in one small line of business. And then when you've mastered that area and made it successful, then try and deploy it in more areas of the business. And as far as initializing big data implementation, that's generally how to do it successfully. >> There's the whole issue of putting a value on data as a discreet asset. Then there's the issue, how do you put a value on a data lake? Because a data lake, is essentially an asset you build on spec. It's an exploratory archive, essentially, of all kinds of data that might yield some insights, but you have to have a team of data scientists doing exploration and modeling. But it's all on spec. How do you put a value on a data lake? And at what point does the data lake itself become a burden? Because you got to store that data and manage it. At what point do you drain that lake? At what point, do the costs of maintaining that lake outweigh the opportunity costs of not holding onto it? >> So each Hadoop note is approximately $20,000 per year cost for storage. So I think that there needs to be a test and a diagnostic, before even inputting, ingesting the data and storing it. "Is this actually going to be useful? "What value do we plan to create from this?" Because really, you can't store all the data. And it's a lot cheaper to store data in Hadoop then it was in traditional systems but it's definitely not free. So people need to be applying this test before even ingesting the data. Why do we need this? What business value? >> I think the question we need to also ask around this is, "Why are we building data lakes "in the first place? "So what's the function it's going to perform for you?" There's been a huge drive to this idea. "We need a data lake. "We need to put it all somewhere." But invariably they become data swamps. And we only half jokingly say that because I've seen 90 day projects turn from a great idea, to a really bad nightmare. And as Lillian said, it is cheaper in some ways to put it into a HDFS platform, in a technical sense. But when we look at all the fully burdened components, it's actually more expensive to find Hadoop specialists and Spark specialists to maintain that cluster. And invariably I'm finding that big data, quote unquote, is not actually so much lots of data, it's complex data. And as Lillian said, "You don't always "need to store it all." So I think if we go back to the question of, "What's the function of a data lake in the first place? "Why are we building one?" And then start to build some fully burdened cost components around that. We'll quickly find that we don't actually need a data lake, per se. We just need an interim data store. So we might take last years' data and tokenize it, and analyze it, and do some analytics on it, and just keep the meta data. So I think there is this rush, for a whole range of reasons, particularly vendor driven. To build data lakes because we think they're a necessity, when in reality they may just be an interim requirement and we don't need to keep them for a long term. >> I'm going to attempt to, the last few questions, put them all together. And I think, they all belong together because one of the reasons why there's such hesitation about progress within the data world is because there's just so much accumulated tech debt already. Where there's a new idea. We go out and we build it. And six months, three years, it really depends on how big the idea is, millions of dollars is spent. And then by the time things are built the idea is pretty much obsolete, no one really cares anymore. And I think what's exciting now is that the speed to value is just so much faster than it's ever been before. And I think that, you know, what makes that possible is this concept of, I don't think of a data lake as a thing. I think of a data lake as an ecosystem. And that ecosystem has evolved so much more, probably in the last three years than it has in the past 30 years. And it's exciting times, because now once we have this ecosystem in place, if we have a new idea, we can actually do it in minutes not years. And that's really the exciting part. And I think, you know, data lake versus a data swamp, comes back to just traditional data architecture. And if you architect your data lake right, you're going to have something that's substantial, that's you're going to be able to harness and grow. If you don't do it right. If you just throw data. If you buy Hadoop cluster or a Cloud platform and just throw your data out there and say, "We have a lake now." yeah, you're going to create a mess. And I think taking the time to really understand, you know, the new paradigm of data architecture and modern data engineering, and actually doing it in a very disciplined way. If you think about it, what we're doing is we're building laboratories. And if you have a shabby, poorly built laboratory, the best scientist in the world isn't going to be able to prove his theories. So if you have a well built laboratory and a clean room, then, you know a scientist can get what he needs done very, very, very efficiently. And that's the goal, I think, of data management today. >> I'd like to just quickly add that I totally agree with the challenge between on premise and Cloud mode. And I think one of the strong themes of today is going to be the hybrid data management challenge. And I think organizations, some organizations, have rushed to adopt Cloud. And thinking it's a really good place to dump the data and someone else has to manage the problem. And then they've ended up with a very expensive death by 1,000 cuts in some senses. And then others have been very reluctant as a result of not gotten access to rapid moving and disruptive technology. So I think there's a really big challenge to get a basic conversation going around what's the value using Cloud technology as in adopting it, versus what are the risks? And when's the right time to move? For example, should we Cloud Burst for workloads? Do we move whole data sets in there? You know, moving half a petabyte of data into a Cloud platform back is a non-trivial exercise. But moving a terabyte isn't actually that big a deal anymore. So, you know, should we keep stuff behind the firewalls? I'd be interested in seeing this week where 80% of the data, supposedly is. And just push out for Cloud tools, machine learning, data science tools, whatever they might be, cognitive analytics, et cetera. And keep the bulk of the data on premise. Or should we just move whole spools into the Cloud? There is no one size fits all. There's no silver bullet. Every organization has it's own quirks and own nuances they need to think through and make a decision themselves. >> Very often, Dez, organizations have zonal architectures so you'll have a data lake that consists of a no sequel platform that might be used for say, mobile applications. A Hadoop platform that might be used for unstructured data refinement, so forth. A streaming platform, so forth and so on. And then you'll have machine learning models that are built and optimized for those different platforms. So, you know, think of it in terms of then, your data lake, is a set of zones that-- >> It gets even more complex just playing on that theme, when you think about what Cisco started, called Folk Computing. I don't really like that term. But edge analytics, or computing at the edge. We've seen with the internet coming along where we couldn't deliver everything with a central data center. So we started creating this concept of content delivery networks, right? I think the same thing, I know the same thing has happened in data analysis and data processing. Where we've been pulling social media out of the Cloud, per se, and bringing it back to a central source. And doing analytics on it. But when you think of something like, say for example, when the Dreamliner 787 from Boeing came out, this airplane created 1/2 a terabyte of data per flight. Now let's just do some quick, back of the envelope math. There's 87,400 fights a day, just in the domestic airspace in the USA alone, per day. Now 87,400 by 1/2 a terabyte, that's 43 point five petabytes a day. You physically can't copy that from quote unquote in the Cloud, if you'll pardon the pun, back to the data center. So now we've got the challenge, a lot of our Enterprise data's behind a firewall, supposedly 80% of it. But what's out at the edge of the network. Where's the value in that data? So there are zonal challenges. Now what do I do with my Enterprise versus the open data, the mobile data, the machine data. >> Yeah, we've seen some recent data from IDC that says, "About 43% of the data "is going to stay at the edge." We think that, that's way understated, just given the examples. We think it's closer to 90% is going to stay at the edge. >> Just on the airplane topic, right? So Airbus wasn't going to be outdone. Boeing put 4,000 sensors or something in their 787 Dreamliner six years ago. Airbus just announced an 83, 81,000 with 10,000 sensors in it. Do the same math. Now the FAA in the US said that all aircraft and all carriers have to be, by early next year, I think it's like March or April next year, have to be at the same level of BIOS. Or the same capability of data collection and so forth. It's kind of like a mini GDPR for airlines. So with the 83, 81,000 with 10,000 sensors, that becomes two point five terabytes per flight. If you do the math, it's 220 petabytes of data just in one day's traffic, domestically in the US. Now, it's just so mind boggling that we're going to have to completely turn our thinking on its' head, on what do we do behind the firewall? What do we do in the Cloud versus what we might have to do in the airplane? I mean, think about edge analytics in the airplane processing data, as you said, Jim, streaming analytics in flight. >> Yeah that's a big topic within Wikibon, so, within the team. Me and David Floyer, and my other colleagues. They're talking about the whole notion of edge architecture. Not only will most of the data be persisted at the edge, most of the deep learning models like TensorFlow will be executed at the edge. To some degree, the training of those models will happen in the Cloud. But much of that will be pushed in a federated fashion to the edge, or at least I'm predicting. We're already seeing some industry moves in that direction, in terms of architectures. Google has a federated training, project or initiative. >> Chris: Look at TensorFlow Lite. >> Which is really fascinating for it's geared to IOT, I'm sorry, go ahead. >> Look at TensorFlow Lite. I mean in the announcement of having every Android device having ML capabilities, is Google's essential acknowledgment, "We can't do it all." So we need to essentially, sort of like a setting at home. Everyone's smartphone top TV box just to help with the processing. >> Now we're talking about this, this sort of leads to this IOT discussion but I want to underscore the operating model. As you were saying, "You can't just "lift and shift to the Cloud." You're not going to, CEOs aren't going to get the billion dollar hit by just doing that. So you got to change the operating model. And that leads to, this discussion of IOT. And an entirely new operating model. >> Well, there are companies that are like Sisense who have worked with Intel. And they've taken this concept. They've taken the business logic and not just putting it in the chip, but actually putting it in memory, in the chip. So as data's going through the chip it's not just actually being processed but it's actually being baked in memory. So level one, two, and three cache. Now this is a game changer. Because as Chris was saying, even if we were to get the data back to a central location, the compute load, I saw a real interesting thing from I think it was Google the other day, one of the guys was doing a talk. And he spoke about what it meant to add cognitive and voice processing into just the Android platform. And they used some number, like that had, double the amount of compute they had, just to add voice for free, to the Android platform. Now even for Google, that's a nontrivial exercise. So as Chris was saying, I think we have to again, flip it on its' head and say, "How much can we put "at the edge of the network?" Because think about these phones. I mean, even your fridge and microwave, right? We put a man on the moon with something that these days, we make for $89 at home, on the Raspberry Pie computer, right? And even that was 1,000 times more powerful. When we start looking at what's going into the chips, we've seen people build new, not even GPUs, but deep learning and stream analytics capable chips. Like Google, for example. That's going to make its' way into consumer products. So that, now the compute capacity in phones, is going to, I think transmogrify in some ways because there is some magic in there. To the point where, as Chris was saying, "We're going to have the smarts in our phone." And a lot of that workload is going to move closer to us. And only the metadata that we need to move is going to go centrally. >> Well here's the thing. The edge isn't the technology. The edge is actually the people. When you look at, for example, the MIT language Scratch. This is kids programming language. It's drag and drop. You know, kids can assemble really fun animations and make little movies. We're training them to build for IOT. Because if you look at a system like Node-RED, it's an IBM interface that is drag and drop. Your workflow is for IOT. And you can push that to a device. Scratch has a converter for doing those. So the edge is what those thousands and millions of kids who are learning how to code, learning how to think architecturally and algorithmically. What they're going to create that is beyond what any of us can possibly imagine. >> I'd like to add one other thing, as well. I think there's a topic we've got to start tabling. And that is what I refer to as the gravity of data. So when you think about how planets are formed, right? Particles of dust accrete. They form into planets. Planets develop gravity. And the reason we're not flying into space right now is that there's gravitational force. Even though it's one of the weakest forces, it keeps us on our feet. Oftentimes in organizations, I ask them to start thinking about, "Where is the center "of your universe with regard to the gravity of data." Because if you can follow the center of your universe and the gravity of your data, you can often, as Chris is saying, find where the business logic needs to be. And it could be that you got to think about a storage problem. You can think about a compute problem. You can think about a streaming analytics problem. But if you can find where the center of your universe and the center of your gravity for your data is, often you can get a really good insight into where you can start focusing on where the workloads are going to be where the smarts are going to be. Whether it's small, medium, or large. >> So this brings up the topic of data governance. One of the themes here at Fast Track Your Data is GDPR. What it means. It's one of the reasons, I think IBM selected Europe, generally, Munich specifically. So let's talk about GDPR. We had a really interesting discussion last night. So let's kind of recreate some of that. I'd like somebody in the panel to start with, what is GDPR? And why does it matter, Ronald? >> Yeah, maybe I can start. Maybe a little bit more in general unified governance. So if i talk to companies and I need to explain to them what's governance, I basically compare it with a crime scene. So in a crime scene if something happens, they start with securing all the evidence. So they start sealing the environment. And take care that all the evidence is collected. And on the other hand, you see that they need to protect this evidence. There are all kinds of policies. There are all kinds of procedures. There are all kinds of rules, that need to be followed. To take care that the whole evidence is secured well. And once you start, basically, investigating. So you have the crime scene investigators. You have the research lab. You have all different kind of people. They need to have consent before they can use all this evidence. And the whole reason why they're doing this is in order to collect the villain, the crook. To catch him and on the other hand, once he's there, to convict him. And we do this to have trust in the materials. Or trust in basically, the analytics. And on the other hand to, the public have trust in everything what's happened with the data. So if you look to a company, where data is basically the evidence, this is the value of your data. It's similar to like the evidence within a crime scene. But most companies don't treat it like this. So if we then look to GDPR, GDPR basically shifts the power and the ownership of the data from the company to the person that created it. Which is often, let's say the consumer. And there's a lot of paradox in this. Because all the companies say, "We need to have this customer data. "Because we need to improve the customer experience." So if you make it concrete and let's say it's 1st of June, so GDPR is active. And it's first of June 2018. And I go to iTunes, so I use iTunes. Let's go to iTunes said, "Okay, Apple please "give me access to my data." I want to see which kind of personal information you have stored for me. On the other end, I want to have the right to rectify all this data. I want to be able to change it and give them a different level of how they can use my data. So I ask this to iTunes. And then I say to them, okay, "I basically don't like you anymore. "I want to go to Spotify. "So please transfer all my personal data to Spotify." So that's possible once it's June 18. Then I go back to iTunes and say, "Okay, I don't like it anymore. "Please reduce my consent. "I withdraw my consent. "And I want you to remove all my "personal data for everything that you use." And I go to Spotify and I give them, let's say, consent for using my data. So this is a shift where you can, as a person be the owner of the data. And this has a lot of consequences, of course, for organizations, how to manage this. So it's quite simple for the consumer. They get the power, it's maturing the whole law system. But it's a big consequence of course for organizations. >> This is going to be a nightmare for marketers. But fill in some of the gaps there. >> Let's go back, so GDPR, the General Data Protection Regulation, was passed by the EU in 2016, in May of 2016. It is, as Ronald was saying, it's four basic things. The right to privacy. The right to be forgotten. Privacy built into systems by default. And the right to data transfer. >> Joe: It takes effect next year. >> It is already in effect. GDPR took effect in May of 2016. The enforcement penalties take place the 25th of May 2018. Now here's where, there's two things on the penalty side that are important for everyone to know. Now number one, GDPR is extra territorial. Which means that an EU citizen, anywhere on the planet has GDPR, goes with them. So say you're a pizza shop in Nebraska. And an EU citizen walks in, orders a pizza. Gives her the credit card and stuff like that. If you for some reason, store that data, GDPR now applies to you, Mr. Pizza shop, whether or not you do business in the EU. Because an EU citizen's data is with you. Two, the penalties are much stiffer then they ever have been. In the old days companies could simply write off penalties as saying, "That's the cost of doing business." With GDPR the penalties are up to 4% of your annual revenue or 20 million Euros, whichever is greater. And there may be criminal sanctions, charges, against key company executives. So there's a lot of questions about how this is going to be implemented. But one of the first impacts you'll see from a marketing perspective is all the advertising we do, targeting people by their age, by their personally identifiable information, by their demographics. Between now and May 25th 2018, a good chunk of that may have to go away because there's no way for you to say, "Well this person's an EU citizen, this person's not." People give false information all the time online. So how do you differentiate it? Every company, regardless of whether they're in the EU or not will have to adapt to it, or deal with the penalties. >> So Lillian, as a consumer this is designed to protect you. But you had a very negative perception of this regulation. >> I've looked over the GDPR and to me it actually looks like a socialist agenda. It looks like (panel laughs) no, it looks like a full assault on free enterprise and capitalism. And on its' face from a legal perspective, its' completely and wholly unenforceable. Because they're assigning jurisdictional rights to the citizen. But what are they going to do? They're going to go to Nebraska and they're going to call in the guy from the pizza shop? And call him into what court? The EU court? It's unenforceable from a legal perspective. And if you write a law that's unenforceable, you know, it's got to be enforceable in every element. It can't be just, "Oh, we're only "going to enforce it for Facebook and for Google. "But it's not enforceable for," it needs to be written so that it's a complete and actionable law. And it's not written in that way. And from a technological perspective it's not implementable. I think you said something like 652 EU regulators or political people voted for this and 10 voted against it. But what do they know about actually implementing it? Is it possible? There's all sorts of regulations out there that aren't possible to implement. I come from an environmental engineering background. And it's absolutely ridiculous because these agencies will pass laws that actually, it's not possible to implement those in practice. The cost would be too great. And it's not even needed. So I don't know, I just saw this and I thought, "You know, if the EU wants to," what they're essentially trying to do is regulate what the rest of the world does on the internet. And if they want to build their own internet like China has and police it the way that they want to. But Ronald here, made an analogy between data, and free enterprise, and a crime scene. Now to me, that's absolutely ridiculous. What does data and someone signing up for an email list have to do with a crime scene? And if EU wants to make it that way they can police their own internet. But they can't go across the world. They can't go to Singapore and tell Singapore, or go to the pizza shop in Nebraska and tell them how to run their business. >> You know, EU overreach in the post Brexit era, of what you're saying has a lot of validity. How far can the tentacles of the EU reach into other sovereign nations. >> What court are they going to call them into? >> Yeah. >> I'd like to weigh in on this. There are lots of unknowns, right? So I'd like us to focus on the things we do know. We've already dealt with similar situations before. In Australia, we introduced a goods and sales tax. Completely foreign concept. Everything you bought had 10% on it. No one knew how to deal with this. It was a completely new practice in accounting. There's a whole bunch of new software that had to be written. MYRB had to have new capability, but we coped. No one actually went to jail yet. It's decades later, for not complying with GST. So what it was, was a framework on how to shift from non sales tax related revenue collection. To sales tax related revenue collection. I agree that there are some egregious things built into this. I don't disagree with that at all. But I think if I put my slightly broader view of the world hat on, we have well and truly gone past the point in my mind, where data was respected, data was treated in a sensible way. I mean I get emails from companies I've never done business with. And when I follow it up, it's because I did business with a credit card company, that gave it to a service provider, that thought that I was going to, when I bought a holiday to come to Europe, that I might want travel insurance. Now some might say there's value in that. And other's say there's not, there's the debate. But let's just focus on what we're talking about. We're talking about a framework for governance of the treatment of data. If we remove all the emotive component, what we are talking about is a series of guidelines, backed by laws, that say, "We would like you to do this," in an ideal world. But I don't think anyone's going to go to jail, on day one. They may go to jail on day 180. If they continue to do nothing about it. So they're asking you to sort of sit up and pay attention. Do something about it. There's a whole bunch of relief around how you approach it. The big thing for me, is there's no get out of jail card, right? There is no get out of jail card for not complying. But there's plenty of support. I mean, we're going to have ambulance chasers everywhere. We're going to have class actions. We're going to have individual suits. The greatest thing to do right now is get into GDPR law. Because you seem to think data scientists are unicorn? >> What kind of life is that if there's ambulance chasers everywhere? You want to live like that? >> Well I think we've seen ad blocking. I use ad blocking as an example, right? A lot of organizations with advertising broke the internet by just throwing too much content on pages, to the point where they're just unusable. And so we had this response with ad blocking. I think in many ways, GDPR is a regional response to a situation where I don't think it's the exact right answer. But it's the next evolutional step. We'll see things evolve over time. >> It's funny you mentioned it because in the United States one of the things that has happened, is that with the change in political administrations, the regulations on what companies can do with your data have actually been laxened, to the point where, for example, your internet service provider can resell your browsing history, with or without your consent. Or your consent's probably buried in there, on page 47. And so, GDPR is kind of a response to saying, "You know what? "You guys over there across the Atlantic "are kind of doing some fairly "irresponsible things with what you allow companies to do." Now, to Lillian's point, no one's probably going to go after the pizza shop in Nebraska because they don't do business in the EU. They don't have an EU presence. And it's unlikely that an EU regulator's going to get on a plane from Brussels and fly to Topeka and say, or Omaha, sorry, "Come on Joe, let's get the pizza shop in order here." But for companies, particularly Cloud companies, that have offices and operations within the EU, they have to sit up and pay attention. So if you have any kind of EU operations, or any kind of fiscal presence in the EU, you need to get on board. >> But to Lillian's point it becomes a boondoggle for lawyers in the EU who want to go after deep pocketed companies like Facebook and Google. >> What's the value in that? It seems like regulators are just trying to create work for themselves. >> What about the things that say advertisers can do, not so much with the data that they have? With the data that they don't have. In other words, they have people called data scientists who build models that can do inferences on sparse data. And do amazing things in terms of personalization. What do you do about all those gray areas? Where you got machine learning models and so forth? >> But it applies-- >> It applies to personally identifiable information. But if you have a talented enough data scientist, you don't need the PII or even the inferred characteristics. If a certain type of behavior happens on your website, for example. And this path of 17 pages almost always leads to a conversion, it doesn't matter who you are or where you're coming from. If you're a good enough data scientist, you can build a model that will track that. >> Like you know, target, infer some young woman was pregnant. And they inferred correctly even though that was never divulged. I mean, there's all those gray areas that, how can you stop that slippery slope? >> Well I'm going to weigh in really quickly. A really interesting experiment for people to do. When people get very emotional about it I say to them, "Go to Google.com, "view source, put it in seven point Courier "font in Word and count how many pages it is." I guess you can't guess how many pages? It's 52 pages of seven point Courier font, HTML to render one logo, and a search field, and a click button. Now why do we need 52 pages of HTML source code and Java script just to take a search query. Think about what's being done in that. It's effectively a mini operating system, to figure out who you are, and what you're doing, and where you been. Now is that a good or bad thing? I don't know, I'm not going to make a judgment call. But what I'm saying is we need to stop and take a deep breath and say, "Does anybody need a 52 page, "home page to take a search query?" Because that's just the tip of the iceberg. >> To that point, I like the results that Google gives me. That's why I use Google and not Bing. Because I get better search results. So, yeah, I don't mind if you mine my personal data and give me, our Facebook ads, those are the only ads, I saw in your article that GDPR is going to take out targeted advertising. The only ads in the entire world, that I like are Facebook ads. Because I actually see products I'm interested in. And I'm happy to learn about that. I think, "Oh I want to research that. "I want to see this new line of products "and what are their competitors?" And I like the targeted advertising. I like the targeted search results because it's giving me more of the information that I'm actually interested in. >> And that's exactly what it's about. You can still decide, yourself, if you want to have this targeted advertising. If not, then you don't give consent. If you like it, you give consent. So if a company gives you value, you give consent back. So it's not that it's restricting everything. It's giving consent. And I think it's similar to what happened and the same type of response, what happened, we had the Mad Cow Disease here in Europe, where you had the whole food chain that needed to be tracked. And everybody said, "No, it's not required." But now it's implemented. Everybody in Europe does it. So it's the same, what probably going to happen over here as well. >> So what does GDPR mean for data scientists? >> I think GDPR is, I think it is needed. I think one of the things that may be slowing data science down is fear. People are afraid to share their data. Because they don't know what's going to be done with it. If there are some guidelines around it that should be enforced and I think, you know, I think it's been said but as long as a company could prove that it's doing due diligence to protect your data, I think no one is going to go to jail. I think when there's, you know, we reference a crime scene, if there's a heinous crime being committed, all right, then it's going to become obvious. And then you do go directly to jail. But I think having guidelines and even laws around privacy and protection of data is not necessarily a bad thing. You can do a lot of data, really meaningful data science, without understanding that it's Joe Caserta. All of the demographics about me. All of the characteristics about me as a human being, I think are still on the table. All that they're saying is that you can't go after Joe, himself, directly. And I think that's okay. You know, there's still a lot of things. We could still cure diseases without knowing that I'm Joe Caserta, right? As long as you know everything else about me. And I think that's really at the core, that's what we're trying to do. We're trying to protect the individual and the individual's data about themselves. But I think as far as how it affects data science, you know, a lot of our clients, they're afraid to implement things because they don't exactly understand what the guideline is. And they don't want to go to jail. So they wind up doing nothing. So now that we have something in writing that, at least, it's something that we can work towards, I think is a good thing. >> In many ways, organizations are suffering from the deer in the headlight problem. They don't understand it. And so they just end up frozen in the headlights. But I just want to go back one step if I could. We could get really excited about what it is and is not. But for me, the most critical thing there is to remember though, data breaches are happening. There are over 1,400 data breaches, on average, per day. And most of them are not trivial. And when we saw 1/2 a billion from Yahoo. And then one point one billion and then one point five billion. I mean, think about what that actually means. There were 47,500 Mongodbs breached in an 18 hour window, after an automated upgrade. And they were airlines, they were banks, they were police stations. They were hospitals. So when I think about frameworks like GDPR, I'm less worried about whether I'm going to see ads and be sold stuff. I'm more worried about, and I'll give you one example. My 12 year old son has an account at a platform called Edmodo. Now I'm not going to pick on that brand for any reason but it's a current issue. Something like, I think it was like 19 million children in the world had their username, password, email address, home address, and all this social interaction on this Facebook for kids platform called Edmodo, breached in one night. Now I got my hands on a copy. And everything about my son is there. Now I have a major issue with that. Because I can't do anything to undo that, nothing. The fact that I was able to get a copy, within hours on a dark website, for free. The fact that his first name, last name, email, mobile phone number, all these personal messages from friends. Nobody has the right to allow that to breach on my son. Or your children, or our children. For me, GDPR, is a framework for us to try and behave better about really big issues. Whether it's a socialist issue. Whether someone's got an issue with advertising. I'm actually not interested in that at all. What I'm interested in is companies need to behave much better about the treatment of data when it's the type of data that's being breached. And I get really emotional when it's my son, or someone else's child. Because I don't care if my bank account gets hacked. Because they hedge that. They underwrite and insure themselves and the money arrives back to my bank. But when it's my wife who donated blood and a blood donor website got breached and her details got lost. Even things like sexual preferences. That they ask questions on, is out there. My 12 year old son is out there. Nobody has the right to allow that to happen. For me, GDPR is the framework for us to focus on that. >> Dave: Lillian, is there a comment you have? >> Yeah, I think that, I think that security concerns are 100% and definitely a serious issue. Security needs to be addressed. And I think a lot of the stuff that's happening is due to, I think we need better security personnel. I think we need better people working in the security area where they're actually looking and securing. Because I don't think you can regulate I was just, I wanted to take the microphone back when you were talking about taking someone to jail. Okay, I have a background in law. And if you look at this, you guys are calling it a framework. But it's not a framework. What they're trying to do is take 4% of your business revenues per infraction. They want to say, "If a person signs up "on your email list and you didn't "like, necessarily give whatever "disclaimer that the EU said you need to give. "Per infraction, we're going to take "4% of your business revenue." That's a law, that they're trying to put into place. And you guys are talking about taking people to jail. What jail are you? EU is not a country. What jurisdiction do they have? Like, you're going to take pizza man Joe and put him in the EU jail? Is there an EU jail? Are you going to take them to a UN jail? I mean, it's just on its' face it doesn't hold up to legal tests. I don't understand how they could enforce this. >> I'd like to just answer the question on-- >> Security is a serious issue. I would be extremely upset if I were you. >> I personally know, people who work for companies who've had data breaches. And I respect them all. They're really smart people. They've got 25 plus years in security. And they are shocked that they've allowed a breach to take place. What they've invariably all agreed on is that a whole range of drivers have caused them to get to a bad practice. So then, for example, the donate blood website. The young person who was assist admin with all the right skills and all the right experience just made a basic mistake. They took a db dump of a mysql database before they upgraded their Wordpress website for the business. And they happened to leave it in a folder that was indexable by Google. And so somebody wrote a radio expression to search in Google to find sql backups. Now this person, I personally respect them. I think they're an amazing practitioner. They just made a mistake. So what does that bring us back to? It brings us back to the point that we need a safety net or a framework or whatever you want to call it. Where organizations have checks and balances no matter what they do. Whether it's an upgrade, a backup, a modification, you know. And they all think they do, but invariably we've seen from the hundreds of thousands of breaches, they don't. Now on the point of law, we could debate that all day. I mean the EU does have a remit. If I was caught speeding in Germany, as an Australian, I would be thrown into a German jail. If I got caught as an organization in France, breaching GDPR, I would be held accountable to the law in that region, by the organization pursuing me. So I think it's a bit of a misnomer saying I can't go to an EU jail. I don't disagree with you, totally, but I think it's regional. If I get a speeding fine and break the law of driving fast in EU, it's in the country, in the region, that I'm caught. And I think GDPR's going to be enforced in that same approach. >> All right folks, unfortunately the 60 minutes flew right by. And it does when you have great guests like yourselves. So thank you very much for joining this panel today. And we have an action packed day here. So we're going to cut over. The CUBE is going to have its' interview format starting in about 1/2 hour. And then we cut over to the main tent. Who's on the main tent? Dez, you're doing a main stage presentation today. Data Science is a Team Sport. Hillary Mason, has a breakout session. We also have a breakout session on GDPR and what it means for you. Are you ready for GDPR? Check out ibmgo.com. It's all free content, it's all open. You do have to sign in to see the Hillary Mason and the GDPR sessions. And we'll be back in about 1/2 hour with the CUBE. We'll be running replays all day on SiliconAngle.tv and also ibmgo.com. So thanks for watching everybody. Keep it right there, we'll be back in about 1/2 hour with the CUBE interviews. We're live from Munich, Germany, at Fast Track Your Data. This is Dave Vellante with Jim Kobielus, we'll see you shortly. (electronic music)

Published Date : Jun 24 2017

SUMMARY :

Brought to you by IBM. Really good to see you in Munich. a lot of people to organize and talk about data science. And so, I want to start with sort of can really grasp the concepts I present to them. But I don't know if there's anything you would add? So I'd love to take any questions you have how to get, turn data into value So one of the things, Adam, the reason I'm going to introduce Ronald Van Loon. And on the other hand I'm a blogger I met you on Twitter, you know, and the pace of change, that's just You're in the front lines, helping organizations, Trying to govern when you have And newest member of the SiliconANGLE Media Team. and data science are at the heart of it. It's funny that you excluded deep learning of the workflow of data science And I haven't seen the industry automation, in terms of the core And baking it right into the tools. that's really powering a lot of the rapid leaps forward. What's the distinction? It's like asking people to mine classifieds. to layer, and what you end up with the ability to do higher levels of abstraction. get the result, you also have to And I guess the last part is, Dave: So I'd like to switch gears a little bit and just generally in the community, And this means that it has to be brought on one end to, But Chris you have a-- Look at the major breaches of the last couple years. "I have to spend to protect myself, And that's the way I think about it. and the data are the models themselves. And I think that it's very undisciplined right now, So that you can sell more. And a lot of times they can't fund these transformations. But the first question I like to ask people And then figure out how you map data to it. And after the month, you check, kind of a data broker, the business case rarely So initially, indeed, they don't like to use the data. But do you have anything to add? and deploy it in more areas of the business. There's the whole issue of putting And it's a lot cheaper to store data And then start to build some fully is that the speed to value is just the data and someone else has to manage the problem. So, you know, think of it in terms on that theme, when you think about from IDC that says, "About 43% of the data all aircraft and all carriers have to be, most of the deep learning models like TensorFlow geared to IOT, I'm sorry, go ahead. I mean in the announcement of having "lift and shift to the Cloud." And only the metadata that we need And you can push that to a device. And it could be that you got to I'd like somebody in the panel to And on the other hand, you see that But fill in some of the gaps there. And the right to data transfer. a good chunk of that may have to go away So Lillian, as a consumer this is designed to protect you. I've looked over the GDPR and to me You know, EU overreach in the post Brexit era, But I don't think anyone's going to go to jail, on day one. And so we had this response with ad blocking. And so, GDPR is kind of a response to saying, a boondoggle for lawyers in the EU What's the value in that? With the data that they don't have. leads to a conversion, it doesn't matter who you are And they inferred correctly even to figure out who you are, and what you're doing, And I like the targeted advertising. And I think it's similar to what happened I think no one is going to go to jail. and the money arrives back to my bank. "disclaimer that the EU said you need to give. I would be extremely upset if I were you. And I think GDPR's going to be enforced in that same approach. And it does when you have great guests like yourselves.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jim KobielusPERSON

0.99+

ChrisPERSON

0.99+

David FloyerPERSON

0.99+

Dave VellantePERSON

0.99+

RonaldPERSON

0.99+

Lillian PiersonPERSON

0.99+

DavePERSON

0.99+

LillianPERSON

0.99+

JimPERSON

0.99+

Joe CasertaPERSON

0.99+

IBMORGANIZATION

0.99+

DezPERSON

0.99+

NebraskaLOCATION

0.99+

AdamPERSON

0.99+

EuropeLOCATION

0.99+

Hillary MasonPERSON

0.99+

87,400QUANTITY

0.99+

TopekaLOCATION

0.99+

AirbusORGANIZATION

0.99+

ThailandLOCATION

0.99+

BrusselsLOCATION

0.99+

AustraliaLOCATION

0.99+

EUORGANIZATION

0.99+

10%QUANTITY

0.99+

Dez BlanchfieldPERSON

0.99+

Chris PennPERSON

0.99+

OmahaLOCATION

0.99+

MunichLOCATION

0.99+

May of 2016DATE

0.99+

May 25th 2018DATE

0.99+

SydneyLOCATION

0.99+

nineQUANTITY

0.99+

GermanyLOCATION

0.99+

17 pagesQUANTITY

0.99+

JoePERSON

0.99+

80%QUANTITY

0.99+

$89QUANTITY

0.99+

YahooORGANIZATION

0.99+

FranceLOCATION

0.99+

June 18DATE

0.99+

83, 81,000QUANTITY

0.99+

30 yearsQUANTITY

0.99+

Ronald Van LoonPERSON

0.99+

GoogleORGANIZATION

0.99+

USALOCATION

0.99+

thousandsQUANTITY

0.99+

2013DATE

0.99+

one pointQUANTITY

0.99+

100%QUANTITY

0.99+

Jerry Chen, Greylock - DockerCon 2017 - #theCUBE - #DockerCon


 

>> Announcer: From Austin, Texas, it's theCUBE covering DockerCon 2017. Brought to you by Docker and support from its ecosystem partners. (techno music) >> Welcome back. Hi, I'm Stu Miniman, joined with Jim Kobielus. You're watching theCUBE's SiliconANGLE Media's production of DockerCon 2017. We're the worldwide leader in live enterprise tech coverage. And we can't finish any DockerCon without having Jerry Chen on. So, Jerry, partner with Greylock, always a pleasure to interview you. We've had you on the Amazon shows a lot, Docker, other ecosystem shows, so, great to see ya. >> Stu, Jim. Hey, thanks for having me, as always. It's great to be here. >> Alright, so first of all, I mean, you invested back in the dotCloud days. Could you imagine, when you were meeting with Solomon and those guys and everything that we'd be here with 5,500 people as to where they'd go? What's your take on the growth? >> Every year just blows my mind, both in open-source community developers, ecosystem partners, and more recently, past year and a half, the enterprise customers that take Docker seriously, or replatformed applications on Docker, amazes me. I think I did an investment in 2013, and there were a few hundred thousand downloads of Docker, now there's billions and billions of containers being pulled. When I talk to CIOs that I deal with frequently, they're like, "Docker containers, what is this thing, pants?" And then, (laughter) three and a half, four years later, I can't have a conversation without a Fortune 500 CIO without talking about their Docker container strategy. >> By the way, I hear if you do send back a belt or something that's broken to the Docker people, they'll fix it for you, and maybe send some whale stickers. >> It's like the old school Nordstroms where they take any return. They're this urban store, with the four tires return to Nordstrom, return some pants, you'll be fine. >> You know, we work on container strategy, but we're also your repair shop for you know, men's apparel. So, it's always interesting to look at-- >> Jim: Integration fabric. >> Brilliant. You know, the maturation of technology, of ecosystem, of monetization. I feel like you talked about the growth of the containers. We've seen the ecosystem. It's gone through some fits and spurts and changes over the last couple of years. I think we're really well-received this week. And then there's the money maturation and how they mature that. What do you see? How does open-source fit into your investment strategy, and any commentary on Docker and beyond? >> I was thinking about this on the flight over here today. Open source today is very different than open source five years ago, 10 years ago, as 15. So what what Red Hat did 20 years ago, is very different than what Xen tried to do 10 years ago. When I was at VMware, very different from what Docker is doing today. And it's different in a couple ways. I think the way you monetize is different. Because you have cloud, and cloud changes things. The ecosystem's very different, because all of a sudden the developers, contributors, are not just kind of your misfits and rebels working on the weekends. They are Fortune 100, Fortune 500 companies. Their jobs are now dedicated to this. And then the business models of the developers' ecosystem, how you work with them is very different. So before, you had maybe one or two models to make money in open source. Or one or two ways to develop a community. We did that at Red Hat, which Greylock was lucky enough to be investors in years ago. I was at VMware around Cloud Foundry, we built that. We had a model mine, we had a spring source as well, and what you've seen Docker in the past three or four years, is they're really pioneering a way to bring open source and community ecosystem into the next 10-20 years. So I think it's one to watch. I think Solomon's probably as good as anybody understanding what developers need. >> So a little broader, what's your thoughts on developers today? You actually made the comment coming over, there's two big developer shows this week. You've got F8 and you've got DockerCon, two very different communities. >> Right, it's kind of funny. There's always this sense of, do you consider yourself a developer? So if I write a line of JavaScript, am I a developer? My two cents is yes. If I'm a developer, from JavaScript to Swift to Docker to cURL hacking, it's all great. But if you look at those two conferences, you have F8 going on right now, and the announcements there around augmented reality and messaging, and it's trying to be a platform, but they're doing many of the same things. You have a distribution platform be it Messenger or Facebook, and they're open sourcing technologies around the camera, the lens, the filters, to have developers a) go through the channel, b) add apps or widgets. It's really beyond my ability to comprehend these filters, but Docker today announced a couple great projects: Moby and Linux Kit, much the same way as trying to give tools to the ecosystem developers to build what they want. I think what you've learned is, if you give developers the building blocks, the "Legos" as they call it today, they're going to build some awesome structures. >> Jim was, we talked about coming in here as the role of how data science fits into the developers, and developer is such a broad term, as to what we have here. >> One of the core themes I have is that the data scientist is the nucleus of next generation developer because much of the IP that's being built in the applications now, is statistical models and machine learning and so forth, driving recommendation, but much of that development is being containerized using new tool kits and so forth. But it needs to be more containerized so you can deploy statistical predictive models, machine learning, deep porting to routing the string ecosystem into a hybrid cloud to perform various functions. >> Right now there's, in most companies, there's a data engineer, there's a data scientist, and the two typically work hand in hand. >> Jim: One manages Hadoop, the other one does the modeling. >> Does the modeling, so one speaks in R and Python and works in Jupyter Notebook, the other person runs on Hadoop or database or Redis. The two need to work together and so what you're seeing now and obviously we're investors of Cloudera, that's another great open source company, what you see now is either a) a set of tools and technologies to either blend the two together in some cases, either enable engineers to be more data scientists, or enable data scientists to be more engineers, but also see a bunch of technology tools that say, no, two different roles, I'm going to create tools purpose-built for the data scientists, create tools purpose-built for the power of a data engineer. And I think there's space for both to the extent that you have applications running from news feed or ads to predicting how my self-driving car should make a left turn, you're going to need tools that are used by both types of populations. >> I think Cloudera now has a collaboration environment in the data science department. IBM has something very similar with what they're doing, so it's a team that has specialties such as coders, such as data modelers and data engineers. Point well taken. Cloudera's made a major entrance into that space of collaborative development, of these rich stacks of IP, essentially, that include deterministic program code, but also probabilistic models in a deepening stack. >> I think you've seen Cloudera definitely follow that path from Hadoop and low-level file system HDFS, to these high-level tools for data scientists that's becoming a platform for machine learning for these next generation applications. I think you see Docker in the infrastructure analogy doing low-level tools like Project Moby and Linux Kit, to high-level services around Docker Datacenter. So you can either have the basic tools for your low-level developer, or for the system admin or administrator who wants to operate or run the cloud, you have tools for him or her, too. >> It's interesting, you look at some of these projects and some of the maturation and pivots you see. We talked about dotCloud went over to Docker. You see a bunch of open stock companies that are now Kubernetes companies. I see companies that were big data, they're now, "Oh, I'm an AI or ML company." It's always like, it's usually not the tool, it's the wave. What is the driver? Is data the driver of our next wave there? Is it the application? Is it some combination of the two? Those are the two that I usually look at. Follow the data, follow the application. >> I would say it's data driving. It's really data application, it's data, and the applications make use of the data. Algorithms, I think, is a component. They're important, but they're a component. So what you see now is, to be on the right side of history, data is outstripping compute and storage, so the amount of videos and center data that we're generating from our phones, our cars, our homes, that is outstripping most of the other charts in compute, networking, whatever. That's definitely kind of a rising tide or a wave, as Stu was saying. Now how do we extract data, or value from this data? And historically, because you didn't have infrastructure, that cloud, or compute capacity to make use of this data, it was kind of stranded, so what you've seen in generation technologies like Hadoop or big data or cloud technologies like Docker did, is distribute your applications across a cloud. That's actually enabling you to now build applications to get value out of this data. And that value can be something like forecasting your sales this quarter. It can be about figuring which shade of brown belt you should wear with your pants, going back to our clothing analogy. Or it could be like, let me build a model around how this car or this drone should drive or fly itself. So you combine the vast amount of data, nearly infinite resource of compute, with these machine-learning or AI techniques. Machine learning is one AI technique, but all these other techniques, you can build another generation application, this new intelligent application to power everything from your home, your car, your watch, or your enterprise app, as wonderful as that is. >> Much of the sea change is less and less coding or programming is actually being done or needs to be done because more of the application logic is being distilled directly from the data in the form of machine learning. There's automated machine learning tools that are coming. Google has been a major investor as is Facebook in automated machine learning. >> I would say application logic from the inside, right. So in my mind, application logic, an application is reflecting business process. Hire to fire, order to cash. You still need a program that does logic. Data in itself, or AI in itself without that context, without that business process, is meaningless, right. Just having a model around Jim or Stu, it doesn't matter unless you're trying to buy something. Google pioneered machine learning in a workflow perfectly. You're searching for something, they knew who you were based upon history, you're searching the right ad and say, "Oh, you really want to buy a car, you want to buy a house." So in the workflow, or in the application logic of a search, they used ML to serve you timely information. Now if you're an enterprise, you're looking at help desk tickets, be it ITSM like ServiceNow, or support tickets like Zendesk supporting B to C support tickets. That's a workflow, there's application logic. They take information on a user or a grumpy customer, and they do things like automatically respond to a help ticket, reset your password, provision a server. So I think when you have AI or have applications using this data in the context of a business process, that's magic. And I think we're seeing some core technologies like TensorFlow out there that are super compelling. But we're seeing a generation of developers and founders take that technology, apply it to a problem, it could be HR or CRM, ITSM, or true vertical. Construction, finance, health care. >> Jim: Streaming media analytics is a core area where that's coming in. >> Media analytics because there's a ton of data. Understand what you watch and what you want to see, and so you apply things to a vertical, like health care, or apply the technology to a problem space like media analytics, and you have a wonderful application and hopefully a great company. >> Jerry, we've talked a lot at the cloud shows about how do the startups maintain relevant and get involved when there's all of these platforms. We talked about what Google does, Amazon of course is eating the entire world in everything. Microsoft is making lot of moves here. How do companies, what do you look for? Has your investment strategy changed at all in the last couple of years? >> It is daunting. I think about this a lot in terms of business models and defensibility, and the question goes, what are the sustainable moats you can build around your business as a startup anymore? 'Cause you feel like economies of scale and ecosystems, network effects, those were historically big defensive moats for a Windows operating system. Now those apply to Facebook's platform, Apple's platform, or AWS. They have scale and they have network effects for the ecosystem, so now your startup is saying, okay, how can I either a) overcome those moats, or b) how can I develop my own IP or my own moats around myself that I can actually sustain and thrive in this generation. I think you got to play a different game. As a startup, you're not going to try to out-scale Google or Microsoft; leave that to Amazon and those three or four players. But you can get scale in a domain, so either a problem space like autonomous vehicles, security is a great one, or vertical construction or health care. You redefine the market that you can dominate, can you build your own moat around that IP. >> It's interesting. did you hear Adrian Cockcroft who went from Battery Ventures over to AWS. He's like, "Well, rather than go startup that business, "come build that next thing at Amazon "and we'll do it there." Is that a viable way for people with the entrepreneurial spirit to go be part of that two-pizza team doing something cool inside a large platform? >> I think Adrian probably has motivation and more developers on Amazon now, but I would say most of our companies, not all, but a lot of them started at Amazon. Some start in ads, some start in Google, some start with their own data centers. I think what they believe is they'll get started in one of these clouds but I don't believe, so we talked about this first, it's not a one-cloud-rules-all world. I think there'll be three or four, if not more, clouds in every different geography from Europe to Asia to Russia to the US, will have different clouds, different players. So I think it's fine to get started in Amazon and be a two-pizza team with the other two-pizza team, but over time I see these applications being cross-cloud, and that's where something like Docker comes into play. Docker wants to be cross-cloud, better than any other technology out there. >> On some level, actually, the moat could be, or increasingly is, the training data that drives the refinement of your AI, like Tesla is a perfect example. The self-driving capabilities that they built into the vehicle, they have now a few years' worth of rich test data, training data I should say, that is a core moat in terms of continuing refinement of those algorithms. So that gives you sort of an example of some startup might come along with some very specialized application that takes the consumer world by storm and then they build up some deep well of training data in some very specialized area that becomes their core asset that their next competitor down the pipe doesn't have. >> It has to be a set of data that's unique or proprietary. You're not going to basically out-train your model on cat photos from Google, right? So it has to be a combination of either proprietary data or a combination of data sources that you can stick together. So it's not just one data source, I believe you have to combine multiple data sources together. >> So Jerry, sitting over Jim's shoulder is VMware's booth. I haven't talked about VMware at all this week. You worked at VMware, I've worked with VMware since pretty early days. What advice would you give VMware in the containerized cloud future? How should they be doing more to be part of more conversations? >> I think it's amazing that they have a presence here in the size and scale. The past couple years they're really done a lot to embrace containers and Docker, so I think that's first and foremost. They've done a couple great moves lately. Embracing Amazon last year, with VMware on Amazon, was a big move. Embracing containers with some of their cloud and data technologies I think was an aggressive move too. So I think they're moving in the right direction. I think what they need to understand is, are they going to revolutionize themselves and push these new technologies aggressively, or are they going to keep hanging onto some of their old businesses? For any company of their size and scale, they have multiple motivations, but I think they're making the right steps. So five years ago, or four years ago, I don't think they would have taken this DockerCon seriously. I don't think they were exhibitors at the first DockerCon. But in the past 24 months they've done some amazing moves, so I would say it makes me smile to see them take these great steps forward. >> Jerry, I want to give you the last word. Any cool companies we should be looking at, or things that are exciting to you without giving away trade secrets? >> I can't broadcast the companies I want because everyone else is going to chase those investments. I don't know, I think I'm going to enjoy spending time, actually less with the companies here but a lot with the developers and customers, because I think by the time they have a booth here, everybody knows the company's investment is probably too far along maybe for me to invest, maybe not. But talking to developers to hear what are their friction points? I think when you hear enough friction either in this ecosystem or another ecosystem or at AWS or VWware, then there's something there, you just got to scratch. >> I was talking to some of the people working the booths and they just said the quality of the attendees here, you learn something with every single person you talk to, and there's only a few shows that say that. Amazon reinvented one, the quality of the attendees always real good, this one and a few others. >> I think people who come here by definition are learners, both the companies and the individuals, and you want to surround yourself with learners, people who are open and honest and always learning. >> Jerry, I think that's a perfect note to end it on. We are always learners here and helping to help our audience in trying to understand these technologies, so Jerry Chen, always a pleasure. And we'll be back with the wrap-up here of day one DockerCon 2017. You're watching theCUBE. (techno music)

Published Date : Apr 18 2017

SUMMARY :

Brought to you by Docker We've had you on the Amazon shows a lot, Docker, It's great to be here. I mean, you invested back in the dotCloud days. When I talk to CIOs that I deal with frequently, By the way, I hear if you do send back a belt It's like the old school Nordstroms So, it's always interesting to look at-- I feel like you talked about the growth of the containers. I think the way you monetize is different. You actually made the comment coming over, around the camera, the lens, the filters, to have developers as to what we have here. But it needs to be more containerized so you can deploy and the two typically work hand in hand. And I think there's space for both to the extent in the data science department. I think you see Docker in the infrastructure analogy and some of the maturation and pivots you see. So what you see now is, because more of the application logic is being distilled So I think when you have AI or have applications using this is a core area where that's coming in. or apply the technology to a problem space in the last couple of years? You redefine the market that you can dominate, the entrepreneurial spirit to go be part of So I think it's fine to get started in Amazon and be a So that gives you sort of an example of some startup a combination of data sources that you can stick together. in the containerized cloud future? or are they going to keep hanging onto that are exciting to you without giving away trade secrets? I don't know, I think I'm going to enjoy spending time, Amazon reinvented one, the quality of the attendees and you want to surround yourself with learners, Jerry, I think that's a perfect note to end it on.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
Jim KobielusPERSON

0.99+

AmazonORGANIZATION

0.99+

FacebookORGANIZATION

0.99+

MicrosoftORGANIZATION

0.99+

2013DATE

0.99+

threeQUANTITY

0.99+

NordstromORGANIZATION

0.99+

AsiaLOCATION

0.99+

JimPERSON

0.99+

GoogleORGANIZATION

0.99+

JerryPERSON

0.99+

EuropeLOCATION

0.99+

IBMORGANIZATION

0.99+

RussiaLOCATION

0.99+

AWSORGANIZATION

0.99+

Jerry ChenPERSON

0.99+

Adrian CockcroftPERSON

0.99+

oneQUANTITY

0.99+

USLOCATION

0.99+

twoQUANTITY

0.99+

StuPERSON

0.99+

fourQUANTITY

0.99+

5,500 peopleQUANTITY

0.99+

last yearDATE

0.99+

SwiftTITLE

0.99+

AdrianPERSON

0.99+

Battery VenturesORGANIZATION

0.99+

Stu MinimanPERSON

0.99+

Austin, TexasLOCATION

0.99+

VMwareORGANIZATION

0.99+

AppleORGANIZATION

0.99+

DockerConEVENT

0.99+

JavaScriptTITLE

0.99+

TeslaORGANIZATION

0.99+

Red HatORGANIZATION

0.99+

10 years agoDATE

0.99+

#DockerConEVENT

0.99+

firstQUANTITY

0.99+

four playersQUANTITY

0.99+

five years agoDATE

0.99+

SolomonPERSON

0.99+

DockerTITLE

0.99+

NordstromsORGANIZATION

0.99+

bothQUANTITY

0.99+

20 years agoDATE

0.99+

VWwareORGANIZATION

0.98+

two modelsQUANTITY

0.98+

todayDATE

0.98+

two conferencesQUANTITY

0.98+

four years agoDATE

0.98+

OneQUANTITY

0.98+

this weekDATE

0.98+

four years laterDATE

0.98+

MessengerTITLE

0.98+

Robbie Strickland, IBM - Spark Summit East 2017 - #SparkSummit - #theCUBE


 

>> Announcer: Live from Boston Massachusetts this is theCube. Covering Spark Summit East 2017, brought to you by Databricks. Now here are your hosts Dave Vellante and George Gilbert. >> Welcome back to theCube, everybody, we're here in Boston. The Cube is the worldwide leader in live tech coverage. This is Spark Summit, hashtag #SparkSummit. And Robbie Strickland is here. He's the Vice President of Engines & Pipelines, I love that title, for the Watson Data Platform at IBM Analytics, formerly with The Weather Company that was acquired by IBM. Welcome to you theCube, good to see you. >> Thank you, good to be here. >> So, it's my standing tongue-in-cheek line is the industry's changing, Dell buys EMC, IBM buys The Weather Company. [Robbie] That's right. >> Wow! That sort of says it all, right? But it was kind of a really interesting blockbuster acquisition. Great for the folks at The Weather Company, great for IBM, so give us the update. Where are we at today? >> So, it's been an interesting first year. Actually, we just hit our first anniversary of the acquisition and a lot has changed. Part of my role, new role at IBM, having come from The Weather Company, is a byproduct of the two companies bringing our best analytics work and kind of pulling those together. I don't know if we have some water but that would be great. So, (coughs) excuse me. >> Dave: So, let me chat for a bit. >> Thanks. >> Feel free to clear your throat. So, you were at IBM, the conference at the time was called IBM Insight. It was the day before the acquisition was announced and we had David Kenny on. David Kenny was the CEO of The Weather Company. And I remember we were talking, and I was like, wow, you have such an interesting business model. Off camera, I was like, what do you want to do with this company, you guys are like prime. Are you going public, you going to sell this thing, I know you have an MBA background. And he goes, "Oh, yeah, we're having fun." Next day was the announcement that IBM bought The Weather Company. I saw him later and I was like, "Aha!" >> And now he's the leader of the Watson Group. >> That's right. >> Which is part of our, The Weather Company joined The Watson Group. >> And The Cloud and analytics groups have come together in recognition that analytics and The Cloud are peanut butter and jelly. >> Robbie: That's absolutely right. >> And David's running that organization, right? >> That is absolutely right. So, it's been an exciting year, it's been an interesting year, a lot of challenges. But I think where we are now with the Watson Data Platform is a real recognition that the use dase where we want to try to make data and analytics and machine learning and operationalizing all of those, that that's not easy for people. And we need to make that easy. And our experience doing that at The Weather Company and all the challenges we ran into have informed the organization, have informed the road map and the technologies that we're using to kind of move forward on that path. >> And The Watson Data Platform was announced in, I believe, October. >> Robbie: That's right. >> You guys had a big announcement in New York City. And you took many sort of components that were viewed as individual discreet functions-- >> Robbie: That's right. >> And brought them together in a single data pipeline. Is that right? >> Robbie: That's right. >> So, maybe describe that a little bit for our audience. >> So, the vision is, you know, one of the things that's missing in the market today is the ability to easily grab data from some source, whether it's a database or a Kafka stream, or some sort of streaming data feed, which is actually something that's often overlooked. Usually you have platforms that are oriented around streaming data, data feeds, or oriented around data at rest, batch data. One of the things that we really wanted to do was sort of combine those two together because we think that's really important. So, to be able to easily acquire data at scale, bring it into a platform, orchestrate complex workflows around that, with the objective, of course, of data enrichment. Ultimately, what you want to be able to do is take those raw signals, whatever they are, and turn that into some sort of enriched data for your organization. And so, for example, we may take signals in from a mobile app, things like beacons, usage beacons on a mobile app, and turn that into a recommendation engine so we can feed real time content decisions back into a mobile platform. Well, that's really hard right now. It requires lots of custom development. It requires you to essentially stitch together your pipeline end to end. It might involve a machine learning pipeline that runs a training pipeline. It might involve, it's all batch oriented, so you land your data somewhere, you run this machine learning pipeline maybe in Spark or ADO or whatever you've got. And then the results of that get fed back into some data store that gets merged with your online application. And then you need to have a restful API or something for your application to consume that and make decisions. So, our objective was to take all of the manual work of standing up those individual pieces and build a platform where that is just, that's what it's designed to do. It's designed to orchestrate those multiple combinations of real time and batch flows. And then with a click of a button and a few configuration options, stand up a restful service on top of whatever the results are. You know, either at an interim stage or at the end of the line. >> And you guys gave an example. You actually showed a demo at the announcement. And I think it was a retail example, and you showed a lot of what would traditionally be batch processes, and then real time, a recommendation came up and completed the purchase. The inference was this is an out of the box software solution. >> Robbie: That's right. >> And that's really what you're saying you've developed. A lot of people would say, oh, it's IBM, they've cobbled together a bunch of their old products, stuck them together, put an abstraction layer on, and wrapped a bunch of services around it. I'm hearing from you-- >> That's exactly, that's just WebSphere. It's WebSphere repackaged. >> (laughing) Yeah, yeah, yeah. >> No, it's not that. So, one of the things that we're trying to do is, if you look at our cloud strategy, I mean, this is really part and parcel, I mean, the nexus of the cloud strategy is the Watson Data Platform. What we could have done is we could have said let's build a fantastic cloud and compete with Amazon or Google or Microsoft. But what we realized is that there is a certain niche there of people who want to take individual services and compose them together and build an application. Mostly on top of just raw VMs with some additional, you know, let's stitch together something with Lambda or stitch together something with SQS, or whatever it may be. Our objective was to sort of elevate that a bit, not try to compete on that level. And say, how do we bring Enterprise grade capabilities to that space. Enterprise grade data management capabilities end-to-end application development, machine learning as a first class citizen, in a cohesive experience. So that, you know, the collaboration is key. We want to be able to collaborate with business users, data scientists, data engineers, developers, API developers, the consumers of the end results of that, whether they be mobile developers or whatever. One of the things that is sort of key, I think, to the vision is that these roles that we've traditionally looked at. If you look at the way that tool sets are built, they're very targeted to specific roles. The data engineer has a tool, the data scientist has a tool. And what's been the difficult part is the boundaries between those have been very firm and the collaboration has been difficult. And so, we draw the personas as a Venn diagram. Because it's very difficult, especially if you look at a smaller company, and even sometimes larger companies, the data engineer is the data scientist. The developer who builds the mobile application is the data scientist. And then in some larger organizations, you have very large teams of data scientists that have these artificial barriers between the data scientist and the data engineer. So, how do we solve both cases? And I think the answer was for us a platform that allows for seamless collaboration where there is not these clean lines between the personas, that the tool sets easily move from one to the other. And if you're one of those hybrid people that works across lines, that the tool feels like it's one tool for you. But if you're two different teams working together, that you can easily hand off. So, that was one of the key objectives we're trying to answer. >> Definitely an innovative component of the announcement, for sure. Go ahead, George. >> So, help us sort of bracket how mature this end-to-end tool suite is in terms of how much of the pipeline it addresses. You know, from the data origin all the way to a trained model and deploying that model. Sort of what's there now, what's left to do. >> So, there are a few things we've brought to market. Probably the most significant is the data science experience. The data science experience is oriented around data science and has, as its sort of central interface, Jupyter Notebooks. Now, as well as, we brought in our studio, and those sorts of things. The idea there being that we'll start with the collaboration around data scientists. So, data scientists can use their language of choice, collaborate around data sets, save out the results of their work and have it consumed either publicly by some other group of data scientists. But the collaboration among data scientists, that was sort of step one. There's a lot of work going on that's sort of ongoing, not ready to bring to market, around how do we simplify machine learning pipelines specifically, how do we bring governance and lineage, and catalog services and those sorts of things. And then the ingest, one of the things we're working on that we have brought to market is our product called Lift which connects, as well. And that's bringing large amounts of data easily into the platform. There are a few components that have sort of been brought to market. dashDB, of course, is a key source of data clouded. So, one of the things that we're working on is some of these existing technologies that actually really play well into the eco system, trying to tie them well together. And then add the additional glue pieces. >> And some of your information management and governance components, as well. Now, maybe that is a little bit more legacy but they're proven. And I don't know if the exits and entries into those systems are as open, I don't know, but there's some capabilities there. >> Speaking of openness, that's actually a great point. If you look at the IIG suite, it's a great On-Premise suite. And one of the challenges that we've had in sort of past IBM cloud offerings is a lot of what has been the M.O. in the past is take a great On-Prem solution and just try to stand it up as a service in the cloud. Which in some cases has been successful, in other cases, less so. One of the things we're trying to look at with this platform is how do we leverage (a) open source. So that whatever you may already be running open source on, Prem or in some other provider, that it's very easy to move your workloads. So, we want to be able to say if you've got 10,000 lines of fraud detection code to map produce. You don't need to rewrite that in anything. You can just move it. And the other thing is where our existing legacy tech doesn't necessarily translate well to the cloud, our first strategy is see if there's any traction around an existing open source project that satisfies that need, and try to see if we can build on that. Where there's not, we go cloud first and we build something that's tailor made to come out. >> So, who's the first one or two customers for this platform? Is it like IBM Global Business Services where they're building the semi-custom industry apps? Or is it the very, very big and sophisticated, like banks and Telcos who are doing the same? Or have you gotten to the point where you can push it out to a much wider audience? >> That's a great question, and it's actually one that is a source of lots of conversation internally for us. If you look at where the data science experience is right now, it's a lot of individual data scientists, you know, small companies, those sorts of things coming together. And a lot of that is because some of the sophistication that we expect for Enterprise customers is not quite there yet. So, we wouldn't expect Enterprise customers to necessarily be onboarded as quickly at the moment. But if we look at sort of the, so I guess there's maybe a medium term answer and a long term answer. I think the long term answer is definitely the Enterprise customers, you know, leveraging IBM's huge entry point into all of those customers today, there's definitely a play to be made there. And one of the things that we're differentiating, we think, over an AWS or Google, is that we're trying to answer that use case in a way that they really aren't even trying to answer it right now. And so, that's one thing. The other is, you know, going beta with a launch customer that's a healthcare provider or a bank where they have all sorts of regulatory requirements, that's more complicated. And so, we are looking at, in some cases, we're looking at those banks or healthcare providers and trying to carve off a small niche use case that doesn't actually fall into the category of all those regulatory requirements. So that we can get our feet wet, get the tires kicked, those sorts of things. And in some cases we're looking for less traditional Enterprise customers to try to launch with. So, that's an active area of discussion. And one of the other key ones is The Weather Company. Trying to take The Weather Company workloads and move The Weather Company workloads. >> I want to come back to The Weather Company. When you did that deal, I was talking to one of your executives and he said, "Why do you think we did the deal?" I said, "Well, you've got 1500 data scientists, "you've got all this data, you know, it's the future." He goes, "Yeah, it's also going to be a platform "for IOT for IBM." >> Robbie: That's right. >> And I was like, "Hmmm." I get the IOT piece, how does it become a platform for IBM's IOT strategy? Is that really the case? Is that transpiring and how so? >> It's interesting because that was definitely one of the key tenets behind the acquisition. And what we've been working on so hard over the last year, as I'm sure you know, sometimes boxes and arrows on an architecture diagram and reality are more challenging. >> Dave: (laughing) Don't do that. >> And so, what we've had to do is reconcile a lot of what we built at The Weather Company, existing IBM tech, and the new things that were in flight, and try to figure out how can we fit all those pieces together. And so, it's been complicated but also good. In some cases, it's just people and expertise. And bringing those people and expertise and leaving some of the software behind. And other cases, it's actually bringing software. So, the story is, obviously, where the rubber meets the road, more complicated than what it sounds like in the press release. But the reality is we've combined those teams and they are all moving in the same direction together with various bits and pieces from the different teams. >> Okay, so, there's vision and then the road map to execute on that, and it's going to unfold over several years. >> Robbie: That's right. >> Okay, good. Stuff at the event here, I mean, what are you seeing, what's hot, what's going on with Spark? >> I think one of the interesting things with what's going on with Spark right now is a lot of the optimizations, especially things around GPUs and that. And we're pretty excited about that, being a hardware manufacturer, that's something that is interesting to us. We run our own cloud. Where some people may not be able to immediately leverage those capabilities, we're pretty excited about that. And also, we're looking at some of those, you know, taking Spark and running it on Power and those sorts of things to try to leverage the hardware improvements. So, that's one of the things we're doing. >> Alright, we have to leave it there, Robbie. Thanks very much for coming on theCube, really appreciate it. >> Thank you. >> You're welcome. Alright, keep it right there, everybody. We'll be right back with our next guest. This is theCube. We're live from Spark Summit East, hashtag #SparkSummit. Be right back. >> Narrator: Since the dawn of The Cloud, theCube.

Published Date : Feb 9 2017

SUMMARY :

brought to you by Databricks. The Cube is the worldwide leader in live tech coverage. is the industry's changing, Dell buys EMC, Great for the folks at The Weather Company, is a byproduct of the two companies And I remember we were talking, and I was like, Which is part of our, And The Cloud and analytics groups have come together is a real recognition that the use dase And The Watson Data Platform was announced in, And you took many sort of components that were And brought them together in a single data pipeline. So, the vision is, you know, one of the things And I think it was a retail example, And that's really what you're saying you've developed. That's exactly, that's just WebSphere. So, one of the things that we're trying to do is, of the announcement, for sure. You know, from the data origin all the way to So, one of the things that we're working on And I don't know if the exits and entries One of the things we're trying to look at with this platform And a lot of that is because some of the sophistication and he said, "Why do you think we did the deal?" Is that really the case? one of the key tenets behind the acquisition. and the new things that were in flight, to execute on that, and it's going to unfold Stuff at the event here, I mean, So, that's one of the things we're doing. Alright, we have to leave it there, Robbie. This is theCube.

SENTIMENT ANALYSIS :

ENTITIES

EntityCategoryConfidence
DavidPERSON

0.99+

IBMORGANIZATION

0.99+

Dave VellantePERSON

0.99+

George GilbertPERSON

0.99+

GeorgePERSON

0.99+

MicrosoftORGANIZATION

0.99+

AmazonORGANIZATION

0.99+

BostonLOCATION

0.99+

The Weather CompanyORGANIZATION

0.99+

GoogleORGANIZATION

0.99+

RobbiePERSON

0.99+

DavePERSON

0.99+

Robbie StricklandPERSON

0.99+

Watson GroupORGANIZATION

0.99+

David KennyPERSON

0.99+

OctoberDATE

0.99+

New York CityLOCATION

0.99+

1500 data scientistsQUANTITY

0.99+

two companiesQUANTITY

0.99+

10,000 linesQUANTITY

0.99+

DellORGANIZATION

0.99+

AWSORGANIZATION

0.99+

OneQUANTITY

0.99+

both casesQUANTITY

0.99+

Boston MassachusettsLOCATION

0.99+

Spark SummitEVENT

0.99+

IBM AnalyticsORGANIZATION

0.99+

SparkTITLE

0.99+

oneQUANTITY

0.99+

ADOTITLE

0.99+

LambdaTITLE

0.99+

TelcosORGANIZATION

0.99+

The CloudORGANIZATION

0.98+

Spark Summit East 2017EVENT

0.98+

first strategyQUANTITY

0.98+

IBM Global Business ServicesORGANIZATION

0.98+

EMCORGANIZATION

0.98+

one toolQUANTITY

0.98+

first anniversaryQUANTITY

0.98+

DatabricksORGANIZATION

0.98+

last yearDATE

0.98+

todayDATE

0.97+

two customersQUANTITY

0.97+

singleQUANTITY

0.97+

SQSTITLE

0.97+

first yearQUANTITY

0.97+

twoQUANTITY

0.96+

two different teamsQUANTITY

0.96+

WebSphereTITLE

0.96+

#SparkSummitEVENT

0.95+

JupyterORGANIZATION

0.95+

Watson Data PlatformTITLE

0.94+

KafkaTITLE

0.94+