Ali Ghodsi, Databricks | Informatica World 2019

>> Live from Las Vegas, it's theCUBE, covering Informatica World 2019. Brought to you by Informatica. >> Welcome back everyone to theCUBE's live coverage of Informatica World 2019. I'm your host Rebecca Knight, along with my co-host John Furrier. We're joined by Ali Ghodsi, he is the CEO of Databricks, thank you so much for coming on, for returning to theCUBE. You're a CUBE veteran. >> Yes, thank you for having me. >> So I want to pick up on something that you said up on the main stage, and that is that every enterprise on the planet wants to add AI capabilities, but the hardest part of AI is not AI, it's the data. >> Yeah. >> Can you riff on that a little bit for our viewers? Elaborate? >> Yeah, actually, the interesting part is that, if you look at the company that succeeded with AI, the actual AI algorithms they're using, are actually algorithms from the 70s, you know, they're actually developed in the 70s, that's 50 years ago. So then how come they're succeeding now? When actually the same algorithms weren't working in the 70s, so people gave up on them. Like, these things called neural nets, right? Now they're en vogue and they're, you know, super successful. The reason is you have to apply orders of magnitude more data. If you feed those algorithms that we thought were broken orders of magnitude more data, you actually get great results, but that's actually hard. You know, dealing with petabyte scale data and cleaning it, making sure that it's actually the right data for the task at hand is not easy. So that's the part that people are struggling with. >> I saw you up on stage, I'm like ah, Ali's here, Databricks is here, that's awesome. Psyched that you stopped by theCUBE. Been a while. I wanted to get a quick update, 'cause you guys have been on a tear, doing some great work at Cal, we were just told before we came on camera. But what are you doing here? What's the, is there any announcements or news with Informatica? What's the story? >> Yeah, it's, we're doing partnership around Delta Lake, which is our next generation engine that we built, so we're super excited about that. It integrates with all of the Informatica platform. So their ingestion tools, their transformation tools, and the catalog that they also have. So we think together, this can actually really help enterprises make that transition into the AI era. >> So you know, we've been followers, our 10th year, so remember when we were in the cloud era office of Mike Olsen and Amr Awadallah when we first started and now, Hadoop movement started, and then the cloud came along. Right when you guys started your company, the cloud growth took off. You guys were instrumental in changing the equation in dealing with data, data lakes, whatever they're calling it back then. So now, data, holistically, is a systems architecture. On premise it's a huge challenge, cloud native, well no real challenge, people love that. Data feeds AI, lot of risk taking, lot of reward. We're seeing the SaaS business explode, Zoom communications. The list goes on and on. Do you know, enterprise that's trying to be SAS is hard. So you can't just take data from an enterprise and make it SaaS-ified. You really got to think differently. What are you guys doing? How have you guys evolved and vectored into that challenge, because this is where your core value proposition initially started change. Take us through that Databricks story and how you're solving that problem today. >> Yeah, it's a great question. Really what happened is that people started collecting a lot of our data about a decade ago. And the promise was, you can do great things with this. There are all these aspirational use cases around machine learning, real time, it's going to be amazing. Right? So people started collecting it. They started storing one petabytes, two petabytes, and they kept going back to their boss and saying this project is real successful I now have five petabytes in it. But at some point the business said, okay that's great but what can you do with it? What business problems are you actually addressing? What are you solving? And so, in the last couple years there's been a push towards let's prove the value of these data lakes. And actually, many of these projects are falling short. Many are failing. And the reason is, people have just been dumping this data into data lakes without thinking about, the structure, the quality, how it's going to be used. The use cases have been an afterthought. So the number one thing in the top of mind for everyone right now is how do we make these data lakes that we have successful so we can prove some business value to our management? Towards this, this is the main problem that we're focusing on. Towards this, we built something called Delta Lake. It's something you situate on top of your data lake. And what it does is it increases the quality, the reliability, the performance, and the scale of your data lake. >> (John) So it's like a filter. >> Yeah. >> The cream rises to the top. >> (Ari) Exactly. >> Let's the sludge, the data swamp stay below the clean water, if you will. >> Exactly actually you nailed it. So basically, we look at the data as it comes in, filter as you said, and then look at, if there's any quality issues we then put it back in the data lake. It's fine, it can stay there. We'll figure out how to get value out of it later. But if it makes it into the Delta Lake, it will have high quality. Right? So that's great. And since we're anyway already looking at all the data as it's coming in, we might as well also store a lot of inducees and a lot of things that let us performance optimize it later on. So that, later, when people are actually trying to use that data they get really high performance, they get really good quality. And we also added asset transactions to it so that now you're also getting all those transactional use cases working on your existing data lake. >> I saw, at my daughter's graduation in Cal Berkley this weekend and yesterday, people around with Databricks backpacks. Very popular in academic. You guys got the young generation coming in. What's the update on the company? How many employees? What's the traction? Give us a quick business update. >> Yeah we're about 800 employees now. About 100 people in Europe, I would say, and maybe 40-50 people in Asiapac. We're expanding the ME and the Asia business. >> (John) Growth mode. >> Yeah, growth mode. So it's expanding as fast as possible. I mean, I actually, as a CEO, I try to always, slow the hiring down to make sure that we keep the quality bars. So that's actually top of mind for me. But yeah we're-- >> (John) You did Delta Lake on that one. >> Yeah (laughing) >> Exactly. Yeah and we're super excited about working with these universities. We get a lot of graduate students from top universities-- >> And Cal had the first ever class in college of data analytics, what was that? Data analytics are the first inagaural class graduated. Shows how early it is. >> Yeah, yeah, yeah. And actually used Databricks, the community edition, for a class of over a thousand students at Cal used the platform. So they're going to be trained in data science as they come out. >> So I want to ask about that because as you said you're trying to slow down the hiring to make sure that you are maintaining a high bar for your new hires. But yet, I'm sure there's a huge demand because you are in growth mode. So what are you doing? You said you're working with universities to make sure that the next generation is trained up and is capable of performing at Databricks. So tell us more about those efforts. >> Yeah I mean, so, obviously university recruiting is big for us. Cal, I think Databricks has the longest line of all the companies that come there on the career fair day. So, we work very closely with these universities. I think, next generation, as they come out, this generation that's coming out today actually is data science trained. So it's a big difference. There is a huge skills gap out there. Every big enterprise you talk tells you my biggest problem is actually, I don't have skilled people. Can you help me hire people? I say, hey we're not in the recruiting business. But, the good news is, if you look at the universities, they're all training thousands and thousands of data scientists every year now. I can tell you just at Cal, because, I happpen to be on the faculty there, is, almost every applicant now, to grad school, wants to do something AI related. Which has actually led to, if you look at all the programs in universities today, people used to do networking, professors used to do networking, say we do intelligent networks. People who do databases say, we do intelligent databases. People who do systems research say, hey we do intelligent systems, right? So what that means is, in a couple years you'll have lots of students coming out and these companies, that are now struggling hiring, then will be able to hire this talent and will actually succeed better with these AI projects. >> As they say in Berkley, nothing like a good revolution once in a while. AI is kind of changing everyone over. I got to ask you for the young kids out there, and parents who have kids either in elementary school or high school, everyone is trying to figure out, and there's no yet clear playbook, we're starting to see first generation training, but is there a skill set, because there's a range in surface area, you got hardcore coding to ethics, and everything in between from visualization, multiple dimensions of opportunities. What skills do you that people could hone or tweak that may not be on a curriculum that they could get, or pieces of different curriculums in school that would be a good foundation for folks learning and wanting to jump in to data and data value, whether it's coding to ethics? >> Yeah, just looking at my own background and seeing how, what I got to learn in school, the thing that was lacking, compared to what's needed today, is statistics. Understanding of statistics, statistical knowledge, That I think, it's going to be pervasive. So I think, 10, 15 years from now, no matter which field you're in, actually whatever job you have, you have to have some basic level of statistical understanding 'cause the systems you're working with will be, they'll be spitting out statistics and numbers and you need to understand what is false positives, what is this, what is the sample, what is that? What do these things mean? So that's one thing that's definitely missing and actually it's coming, that's one. The second is computing will continue being important. So, in the intersection of those two is, I think a lot of those jobs. >> In all fields, we were talking about earlier, biology, everything's intersecting, biochemistry to whatever right? >> (Ali) Yeah. >> I got to ask you about, well I'm a little old school, I'm 53 years old but I remember when I broke into the business coding, I used to walk into departments, they were called DP, data processing. So we're getting into the data processing world now, you've got statistics, you've got pipeline, these are data concepts. So I got to ask you as companies that are in the enterprise may be slower to move to the cutting edge like you guys are, they got to figure out where to store the data. So can you share your opinion or view on how customers are thinking and how they maybe should be architecting data on premise, in the cloud. Certainly cloud's great, if you're getting cloud native for pure SAS, and born in the cloud like a start-up. But if you're a large enterprise, and you want to be SAS-like, to have all that benefit, take the risk with the reward of being agile, you got to have data because if you don't the data into the machine learning or AI, you're not going to have good AI. So you need to get that data feeding in fast. And if it's constrained with regulation compliance you're screwed. So what's your view on this? Where should it be stored? What's your opinion? >> Yeah, we've had the same opinion for five, six years, right? Which is the data belongs in the cloud. Don't try to do this yourself. Don't try to do this on prem. Don't store it in, at Duke, it's not built for this. Store it in the cloud. In the cloud, first of all, you get a lot of security benefits that the cloud vendors are already working on. So that's one good thing about it. Second, you get it, it's realiable. You get the 10, 11 lines of availability, so that's great, you get that. Start collecting data there. Another reason you want to do it in the cloud is that a lot of the data sets that you need to actually get good quality results, are available in the cloud. Often times what happens with AI is, you build a predictive model, but actually, it's terrible. It didn't work well. So you go back, and then the main trick, the first tricks you use to increase the quality is actually augmenting that data with other data sets. You might purchase those data sets from other vendors. You don't want to be shipping hard drives around or, you know, getting that into your data center. Those will be available in the cloud, so you can augment that data. So we're big fans of storing your data in data lakes, in the cloud. We obviously believe that you need to make that data high quality and reliable. With that we believe the Delta Lake platform, open-source project that we created is a great vehicle for that. But I think moving to the cloud is the number one thing. >> (John) And hybrid works with that if you need to have something on premise? >> In my opinion the two worlds are so different, that it's hard. You hear a lot of vendors that say we're the hybrid solution that works on both and so on. But the two models are so different, fundamentally, that it's hard to actually make them work well. I have not yet seen a customer yet or enterprise. You see a lot of offerings, where people say hybrid is the way. Of course, a lot of on prem vendors are now saying, hey, we're the hybrid solution. I haven't actually seen that be successful to be frank. Maybe someone will crack that nut but-- >> I think it's an operational question to see who can make it work. Ali, congratulations on all your success. Great to see you. >> Yeah it's been great having you on the show. >> Thank you so much for having me. >> You are watching theCUBE, Informatica 2019. I'm Rebecca Knight, for John Furrier, stay tuned.

Published Date : May 21 2019

SUMMARY :

Brought to you by Informatica. thank you so much for coming on, for returning to theCUBE. So I want to pick up on something that you said So that's the part that people are struggling with. Psyched that you stopped by theCUBE. and the catalog that they also have. So you know, we've been followers, our 10th year, And the promise was, you can do great things with this. the clean water, if you will. But if it makes it into the Delta Lake, You guys got the young generation coming in. We're expanding the ME and the Asia business. slow the hiring down to make sure that Yeah and we're super excited about And Cal had the first ever class in So they're going to be trained in data science the hiring to make sure that you are But, the good news is, if you look at the I got to ask you for the young kids out there, and numbers and you need to understand So I got to ask you as companies that are in the enterprise is that a lot of the data sets that you need But the two models are so different, fundamentally, to see who can make it work. You are watching theCUBE,

ENTITIES

Entity	Category	Confidence
Rebecca Knight	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
10	QUANTITY	0.99+
Databricks	ORGANIZATION	0.99+
Europe	LOCATION	0.99+
John Furrier	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
first	QUANTITY	0.99+
five	QUANTITY	0.99+
Cal	ORGANIZATION	0.99+
Ali	PERSON	0.99+
John	PERSON	0.99+
two	QUANTITY	0.99+
two models	QUANTITY	0.99+
thousands	QUANTITY	0.99+
one petabytes	QUANTITY	0.99+
10th year	QUANTITY	0.99+
Second	QUANTITY	0.99+
yesterday	DATE	0.99+
two petabytes	QUANTITY	0.99+
70s	DATE	0.99+
six years	QUANTITY	0.99+
Las Vegas	LOCATION	0.99+
Duke	ORGANIZATION	0.99+
five petabytes	QUANTITY	0.99+
Delta Lake	LOCATION	0.99+
both	QUANTITY	0.99+
Delta Lake	ORGANIZATION	0.99+
second	QUANTITY	0.98+
first tricks	QUANTITY	0.98+
Berkley	LOCATION	0.98+
40-50 people	QUANTITY	0.98+
two worlds	QUANTITY	0.98+
one good thing	QUANTITY	0.98+
one	QUANTITY	0.98+
Asia	LOCATION	0.98+
50 years ago	DATE	0.98+
CUBE	ORGANIZATION	0.97+
Cal Berkley	LOCATION	0.97+
over a thousand students	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.96+
15 years	QUANTITY	0.96+
today	DATE	0.96+
Asiapac	LOCATION	0.96+
Mike Olsen	PERSON	0.96+
Amr Awadallah	PERSON	0.96+
About 100 people	QUANTITY	0.96+
53 years old	QUANTITY	0.95+
about 800 employees	QUANTITY	0.95+
first generation	QUANTITY	0.92+
11 lines	QUANTITY	0.92+
one thing	QUANTITY	0.91+
2019	DATE	0.89+
Informatica World 2019	EVENT	0.88+
SaaS	TITLE	0.86+
a decade ago	DATE	0.85+
thousands of data scientists	QUANTITY	0.84+
SAS	ORGANIZATION	0.84+
this weekend	DATE	0.82+
last couple years	DATE	0.81+
Informatica World	TITLE	0.62+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Cal Berkley: